MASARYK UNIVERSITY FACULTY OF INFORMATICS

A Checker for

BACHELOR THESIS

Marek Blahuš

Brno, May 2008 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out on my own. All sources, references and literature used or excerpted during the elaboration of this work are properly cited and listed in complete reference to the due source.

Supervisor: RNDr. Petr Sojka, Ph.D.

ii Acknowledgement I would like to thank RNDr. Petr Sojka, Ph.D., the supervisor of my bachelor thesis, for his comments, suggestions and time he spent helping me with this work.

I would also like to thank Dr. Petr Chrdle, CSc., owner of the KAVA-PECH publishing house, who has provided me with a complimentary copy of the Plena ilustrita vortaro de Esperanto 2005. I am grateful to Dr. Ludvic Lazar Zamenhof, the initiator of Esperanto.

iii Abstract This thesis provides a brief overview of spell checking software and describes the process of constructing a spell checker for the Esperanto language and its implementation as a dic- tionary (i.e. an affix file and a word list) for the spell checker. The word list is an adaptation of word roots coming from the renowned Esperanto dictionary PIV. Recognition of morphologically complex words, which are common in Esperanto due to its agglutina- tive nature, is made possible by the affix file which has been built based on ready-made morpheme segmentation of word derivations appearing in the same source. Rules derived in the latter process have been improved by semantic classification of all involved roots, for which a system has been created based on corpus analysis and several specialized dictio- naries, in combination with knowledge on the capability of each affix to accept roots from different semantic classes, acquired from the PMEG reference grammar. The resulting spell checker is a working proof of concept, to be further improved and integrated in the gram- mar checker project of the E@I organization. Abstrakto Tiu ĉi disertaĵo donas koncizan trarigardon de literumkontrola programaro kaj priskribas la procezon de konstruado de literumkontrolilo por la lingvo Esperanto kaj ties implementon forme de vortaro (t.e. afiksa dosiero kaj vortlisto) por la literumkontrolilo Hunspell. La vortlisto estas adaptaĵo de vortradikoj venantaj de la renoma Esperanto-vortaro PIV. Rekono de morfologie kompleksaj vortoj, kiuj estas oftaj en Esperanto pro ties aglutina ĥaraktero, estas ebla pro la afiksa dosiero kiu konstruiĝis surbaze de preta morfemstrukturi- de vortderivaĵoj aperantaj en la sama fonto. Reguloj ekestintaj en tiu procezo estas pli- bonigitaj per semantika klasado de ĉiuj engaĝitaj radikoj, por kio kreiĝis sistemo bazita sur tekstara analizo kaj kelkaj fakvortaroj, kombine kun scioj pri akceptemo de ĉiu afikso al radikoj el diversaj semantikaj klasoj, akiritaj de la gramatika manlibro PMEG. La rezultinta literumkontrolilo estas funkcianta koncept-pruvo, plibonigota kaj integrota en la gra- matikkontrolilan projekton de la organizo E@I.

iv Keywords spell checker, Esperanto, Hunspell, corpus, , semantic classification Ŝlosilvortoj literumkontrolilo, Esperanto, Hunspell, tekstaro, morfologio, semantika klasado

v Table of Contents 1 Introduction...... 1 2 Overview of Existing Software...... 2 2.1 Kontrolu Literumadon...... 2 2.2 Esperantilo...... 2 2.3 ...... 3 2.4 GNU Aspell...... 3 2.5 MySpell...... 4 2.6 Hunspell...... 4 2.6.1 Structure of Hunspell Data Files...... 5 2.7 Overview Table...... 6 3 A Hunspell Dictionary for Esperanto...... 7 3.1 Strengths and Weaknesses of the Existing Dictionaries...... 7 3.2 Adopting a Suitable Approach to Esperanto Morphology...... 9 3.2.1 The Structure of Words in Esperanto...... 9 3.2.2 System for a Semantic Classification of Stems...... 11 3.3 DFD Diagram for Dictionary Construction...... 16 3.4 Compiling a New Word List...... 17 3.4.1 Identifying Relevant Sources for the Word List...... 17 3.4.2 Moore Machine for Semantic Classification...... 19 3.4.3 Implementation of the Semantic Classification...... 21 3.5 Compiling a New Set of Affix Rules...... 23 3.5.1 Dictionary-Based Word Derivation System...... 23 3.5.2 Implementation of Esperanto Morphology in Hunspell...... 25 3.6 Integration in OpenOffice.org and the E@I Grammar Checker...... 27 4 Evaluation of the Newly Constructed Dictionary...... 28 5 Conclusion...... 29 Bibliography...... 30 Appendix A: The 16 Rules of ...... 34 Appendix B: An Overview of Esperanto Affixes...... 38

vi Chapter 1 Introduction The development of computer technologies and the internet has been having a strong im- pact on the world in which we live, often introducing major changes into certain fields of human activities, within human communities, sometimes even giving brand new potential to tools which we had already had before. Such is also the case of Esperanto, the interna- tional auxiliary language created in 1887 by Dr. Ludvic Lazar Zamenhof, and its world- wide community of speakers, which, according to some sources (Gordon, 2005, the article about Esperanto), is estimated to count up to 2 million of people, spread in 115 countries of the world from South America through Europe to eastern Asia.

The impact of these new technologies on the Esperanto community, a language-based dias- pora, has been mainly positive: the borders between countries seem to be disappearing, ge- ographical distances are losing their character of a burden, there are more opportunities for maintaining international contacts. Esperanto speakers and students are able to meet each other in an easier fashion, documents, music, literature and other resources in the language are easier to find. It has also never been possible to address such a wide audience at once, in Esperanto as in other languages. There are hundreds of thousands of webpages in Es- peranto in the internet, and writing an article for the or for one’s own online diary does not take a serious effort. To send an e-mail message to an online discus- sion group is much easier than to mail a letter to the editor of a magazine or to send out bulk mail on one’s own, and this is being taken advantage of.

However, apart from all these positive aspects, negative consequences are emerging as well. The language capabilities of an average Esperantist have probably not improved much over the recent decades, but more and more published Esperanto texts are now writ- ten by average Esperantists, with little or no attention to their language level, and thus the quality of Esperanto texts accessible to the public observes a fall.

On the other hand, the technology does not only impose the problem; it also gives us the tools to solve it. In 2007, a group of young people from E@I (Education@Internet, an in- ternational organization promoting usage of the internet among Esperanto speakers) came up with the idea of creating a language checking software package for Esperanto, intended both for the student who yet needs to cultivate their language skills, as well as the skilled Esperantist who types pages of text a day and whose mistakes usually come merely from a lack of attention. A good spell and grammar checker would be useful in every kind of text processor, e-mail client, web browser and everywhere in the internet where texts in Es- peranto are being written (forums, chats, blogs). This bachelor thesis is a part of the project and its goal is to design a proof of concept realization of the spell checking part of a soft- ware which would be fulfilling such needs.

1 Chapter 2 Overview of Existing Software Automatic spell checking by computer has its own history of a few decades. This chapter provides an overview of several existing pieces of spell checker software which are either exclusively dedicated to spell-checking Esperanto texts, or support spell-checking for mul- tiple languages and have such Esperanto support included by default or as an optional downloadable dictionary module, or which at least provide a fitting environment for the creation of such a module, which may perhaps not have been developed so far only because of their recency or due to lack of interest or unfamiliarity to the present community of Es- peranto-speaking NLP software developers.

The aim of this chapter is to underline the differences between the various existing solu- tions in order to learn about the good and bad aspects of the work which has so far been done in the concerning field, as well as to identify a possible tool which would provide a good starting point for the new tool which is to be created, should that mean merely a loose inspiration by this particular solution’s advantages, an implementation of an Esperanto spell checking dictionary for an existing piece of software, or even simply devising a more efficient version in case there is already a fitting Esperanto dictionary available.

2.1 Kontrolu Literumadon Kontrolu Literumadon (Esperanto for “Check Spelling”; Lendon, 1992) in its version 1.0, created in 1992 by Klivo Lendon from Canada using Prolog and distributed as shareware, is probably one of the very first spell checkers for Esperanto ever developed. It is a dedicat- ed, stand-alone, non-suggesting spell checker and provides an MS-DOS pseudo-graphic in- terface for spell-checking plain text and WordPerfect 5.1 files stated in the command line. Upon execution, the program displays the content of the file and indicates in color the words which it did not recognize. There is no text-editing or result-outputting functionality. Since the program was created before the advent of Unicode, a large part of the author’s ef- fort has been spent into implementing support for those characters of the Esperanto alpha- bet which are not present in the basic ASCII character set.1 The correct appearance of these characters has been secured by writing texts in graphic mode and for recognizing those characters in the input files, apart from comprehension of the common x-convention, cir- cumflex-convention and Zamenhof-convention, a user-editable character set file called supersgn.inf has been provided.

2.2 Esperantilo Esperantilo (Esperanto for “Tool for Esperanto”; Trzewik, 2003), started in 2003 as Es- perantoEdit, is a stand-alone, multi-platform UTF-8 text editor with special linguistic func- tions for Esperanto, including spell checker, grammar checker and machine translation. It is distributed as free software under GNU GPL, maintained by Artur Trzewik from Germany and programmed using XOTcl and XOTclIDE, a set of rather uncommon programming 1 See Appendix A for Esperanto alphabet and the conventions used for representing its non-ASCII-characters in circumstances where an 8-bit character encoding is imposed.

2 tools. The spell-checking capabilities of Esperantilo employ both an internal dictionary and a possibility to exploit existing Hunspell dictionaries as well as to externally launch Aspell. The internal dictionary of Esperantilo consists of a word list of approx. 60,000 word forms and a list of word roots of approx. 9,000 items. Thus the spell checking can work in two levels: Marking in green the words which were not found in the word list but still may be considered valid derivations within the word roots system, and marking in red the words which were neither found in the word list nor could be recognized as valid derivations. For each unknown word the program provides a set of suggestions.

2.3 Ispell Ispell (Gorin, Willisson, Kuenning, 1971) is the first in a row of spell checkers, which goes on with Aspell, MySpell and Hunspell. It first emerged in 1971, in connection with the ap- pearance of Unix, and was aimed to serve to the text processing application of this operat- ing system, developed since 1971 by Bell Labs. Ispell was originally written in PDP-10 As- sembly language by R. E. Gorin, and later ported to the C programming language by Pace Willisson of MIT. During its evolution, Ispell implemented several innovative performance enhancements, including the generalized affix description system, which has since then been imitated by other spell checkers such as MySpell, or the programmatic interface for the emacs text editor, which was a pioneer attempt at separating the spell-checking func- tionality in form of an external module which may be used by other applications. Some of the Ispell’s weaknesses are its incapability of spell checking texts in other character sets than the basic ASCII (thus rendering it usable for a very limited set of languages, particu- larly those of Western Europe) and its low efficient correction-suggesting system that is based simply on a Damerau-Levenshtein distance of 1. These were among the reasons which supported the later emergence of GNU Aspell as Ispell’s successor. It is still, howev- er, being maintained and its current version, the International Ispell by Geoff Kuenning, is already equipped with a Unicode support and is available under the BSD license. There is a 70,000-item Esperanto word list for Ispell (Pokrovskij, 1997), created by Sergio Pokrovskij from Russia. This word list was probably the first of its kind in existence, and was later adapted and used for many other goals, including its adaption for the spell check- ers which followed Ispell. From modern applications which still offer a support for Ispell and make use of this word list, UniRed (an abbreviation for “Unikoda Redaktilo”, meaning “Unicode Editor” in Esperanto), created by Yuri Finkel of Russia, is worth mentioning.

2.4 GNU Aspell The GNU Aspell (Atkinson, 1998), currently maintained by Kevin Atkinson of USA as part of the GNU software system and distributed as free software under the GNU LGPL, is the GNU’s standard spell checker software and was first published in 1998 with the aim to eventually replace Ispell. The main improvements were done in adding better support for spell-checking the (a developed suggestion system based on English pro- nunciation rules) and also in memory management (such as that GNU Aspell supports us- ing shared memory for dictionaries when several Aspell processes are open at once). How- ever, also steps leading to further internationalization of the software have been done in its later versions, including a built-in support for UTF-8 (without having to use a special dic-

3 tionary) and the effort to respect the current locale setting. GNU Aspell is written in C++, can be compiled in all Unix-like operating systems as well as in Microsoft Windows and can be used either as a library or as a stand-alone command line program.

GNU Aspell maintains backward compatibility with Ispell. It can be used with virtually any program that expects Ispell, since it is capable of simulating its behavior when using a pipe. Though the Aspell’s compiled dictionary format is completely different from that of Ispell, virtually all old Ispell dictionaries have been converted so that they can be used with Aspell. This was also the case of the Esperanto word list by Sergio Pokrovskij.

2.5 MySpell MySpell (Hendricks, 2000) was the former spell checker library included in Writer, the text processing software of the OpenOffice.org office suite. MySpell’s main developer, Kevin Hendricks of Canada, created it in C++, with assistance from Kevin Atkinson (GNU Aspel- l’s maintainer), in an aim to integrate various open source spell checkers and add a spell- checking capability to OpenOffice.org (a project started in 2000). For every locale (a com- bination of language and geographic territory), MySpell can store separate files for spelling, hyphenation and a thesaurus. The spell-checking routine uses a word list file (.dic) in connection with an affix file (.aff), in a similar manner as it has been intro- duced by Ispell, to provide a support for languages with a rich affix system.

Important applications using MySpell include AbiWord, a text processor, and Mozilla Thunderbird and Mozilla Firefox, the e-mail client and web browser of the Mozilla Foun- dation. OpenOffice.org itself, however, has since its version 2.0.2 replaced MySpell with Hunspell. The same is assumed to happen with Thunderbird and Firefox when they appear in version 3. There are Esperanto word list and affix files created by Dmitri Gabinski of Belarus, based on the older dictionary of Sergio Pokrovskij (Gabinski, Pokrovskij 2003).

2.6 Hunspell Hunspell (Németh, 2005a, 2005b, 2005c) is an open source spell checker based on MySpell, created and maintained by László Németh of Hungary, written in C++ and dis- tributed as a stand-alone program or a library under GPL/LGPL/MPL tri-license. It has been designed specially for languages with rich morphology and complex system of word compounding, originally for Hungarian. Its dictionary format is backward-compatible with that of MySpell, but Hunspell has the extra capability of working with UTF-8 encoded dic- tionaries. Also the affix classes used in Hunspell may make use of UTF-8, resulting in a 65,535 affix classes maximum in a dictionary. Major improvements of Hunspell’s spell checking algorithm include support for circumfixes, two-folded stripping and recur- sive compound rules.

Hunspell was started in 2005 and since March 8, 2006 it has replaced MySpell as the de- fault spell checker in OpenOffice.org (starting from version 2.0.2). The same switch has been launched as a bug report for Mozilla Firefox and Mozilla Thunderbird in 2005, and since 2008, Hunspell has replaced MySpell in the beta versions of the upcoming Mozilla Firefox 3.

4 There is no Esperanto spell checking dictionary dedicated to Hunspell and making use of its improved support for agglutinative languages, among which Esperanto is often counted. However, because of backward-compatibility, the MySpell adaptation of the Ispell Es- peranto dictionary by Sergio Pokrovskij may be used with any software which employs Hunspell as its spell checker. This is the present case of OpenOffice.org, for instance.

2.6.1 Structure of Hunspell Data Files Spell checking dictionaries used by Hunspell consist of two data files. The first is a word list file containing words of the language, the second is an “affix file” defining the meaning of special flags used in the word list and word compounding patterns found in the affix file itself. Depending on the character of the dictionary, these affix classes may be used to refer to actual affixes or as much as each individual morpheme. They may also be used to mark words in the word list with regard to word compounding. Sometimes, however, they may actually have little or no relation to morphology and serve merely as an instrument of data compression (through the creation of a “pseudo-affix” class for word forms accidentally starting or ending in same letters). In an extreme cases, affix classes may even not be used at all.

Hunspell’s word list files may be recognized by the .dic extension. The first line of the file contains the approximate word count, after which comes the word list itself, with one word on a line. Each word may be followed by a slash (“/”) and one or more flags which represent the affix classes the word can accept or attributes related to word compounding. Optionally, a field of morphological information may follow after a tabulator or a space.

An affix file (.aff extension) has a somewhat more complex structure. It is a collection of instruction, each on one line, which describe the meaning of the affix classes present in the word list and set different options which influence the behavior of the spell checker, such as character encoding, character set used in suggestions, assumed keyboard layout, explicit lists of often misspellings and easily interchangeable letters, etc. For each affix class, a set of rules is defined, which describe the possible derivations the affix may produce. The fol- lowing is an excerpt from the US English Hunspell dictionary en_US which defines the creation of past tense forms of regular English :

SFX D Y 4 SFX D 0 d e SFX D y ied [^aeiou]y SFX D 0 ed [^ey] SFX D 0 ed [aeiou]y

The first line is the affix class header, which states the option name (PFX or SFX for a pre- fix or suffix rule, respectively), the flag which denotes the affix class in the word list, the capability of the affix of producing cross products with affixes of the opposite type on the same root, and the line count of the following rules.

5 Each affix rule consists of a repetition of the option name and the affix class flag, followed by the characters which get stripped from the beginning or end of the word when the affix is applied (or a zero if nothing gets stripped), the affix itself, and the condition under which its application is possible (a regular expression-like string or a dot if there is no special con- dition). The first rule, for instance, may be applied to the “breathe” (past tense “breathed”), the second one to the transition from “fly” to “flied”, the third one for “work”→ “worked”, and the fourth one for “play” → “played”.

2.7 Overview Table In this table, selected properties of the discussed spell checking softwares are presented:

Kontrolu Li- GNU As- Esperantilo Ispell MySpell Hunspell terumadon pell spell spell spell spell spell Type text editor checker checker checker checker checker multiple; Target lan- multiple multiple multiple Esperanto Esperanto stress on guages languages languages languages Hungarian Artur Kevin Kevin László Author Klivo Lendon R. E. Gorin Trzewik Atkinson Hendricks Németh Started 1992 2003 1971 1998 2000 2005 Source code Prolog XOTcl PDP-10, C C++ C++ C++ Windows, Unix, Unix, Unix, Platform MS-DOS Unix Linux Windows Windows Windows GPL, License shareware GPL BSD GPL BSD LGPL, MPL stand- stand- Operation stand-alone stand-alone stand-alone alone, li- library alone, li- mode brary brary plain text, Input plain text plain text plain text plain text plain text WordPerfect Output display only plain text plain text plain text plain text plain text no, only surro- yes, not yes, not UTF-8 yes no yes gates originally originally Suggestions no yes yes yes yes yes Table 2.1: Overview table of existing spell checking softwares and their properties.

6 Chapter 3 A Hunspell Dictionary for Esperanto Reading through the previous chapter, one can easily convince oneself that spell checking is a living field of computer software which has been observing progress since the time of its origin, today maybe an even more rapid one than ever before. The emerging of en- hanced algorithms and new spell checkers, however, creates a demand for up-to-date spell checking dictionaries at the same time, should the new technology be useful to users in practice. Recently, in particular the rise of Hunspell as the default spell checker in OpenOffice.org and Mozilla software has been worth noticing. Hunspell introduces some features that its predecessors had not, particularly with regard to the processing of aggluti- native languages with a rich morphology. This fact, and the fact that E@I has recently started a project for an Esperanto grammar checker which should include also a spell checking component, constitute an especially good opportunity for a new Hunspell dictio- nary for Esperanto to be constructed. Such a dictionary may try to benefit the new features of Hunspell, as well as make use of some recent resources such as larger corpora, and thus represent an update of the existing MySpell dictionary for Esperanto which still carries with itself the limitations of the original Ispell dictionary it is based on. That’s why Hun- spell was chosen as the spell checking engine used to implement the dictionary that is the of this paper.

3.1 Strengths and Weaknesses of the Existing Dictionaries The present Esperanto spell checking dictionary for MySpell is poorly documented and consists of a word list of 19,342 items, an affix file with 58 affix classes (34 prefixes and 24 ) and a total of 2,426 affix rules, listed in alphabetical order without any com- ments. There is a README file for the dictionary, but this limits itself to stating the name of the authors (Pokrovskij as author of the original word list and Gabinski as the one who adapted it for MySpell) and a reference to the GNU General Public License.

Fortunately, in order to understand the employed system of affix classes, the older work by Pokrovskij may be used, since the MySpell dictionary obviously uses the same affix classes and flags assigned to them, although I have not been able to get in contact with Dmitri Gabinski in order to prove this supposition. A list of the Ispell dictionary’s affix classes and their meaning can be found as comments in the respective affix file. And indeed, this list of 26 affix classes is a subset of the affix classes used by the MySpell dictionary. Additionaly, the MySpell dictionary introduces several new ones, obviously in an attempt to simulate word compounding (they include prefixes such as numbers or , which are capable of entering a compound in Esperanto – such as “trikapa” from “tri” and “kapa” meaning “three-headed”), which in such a form is not present in the work of Pokrovskij.

Anyway, the restrictions of MySpell which supports only single byte flags for affix classes seem inopportune for an agglutinative language like Esperanto, whose principles of word building would require a much higher number of available affix classes if they were to be described in whole detail. The addition of new affix classes in the MySpell dictionary is a

7 clear demonstration of this demand, however, it is only Hunspell with its capability of han- dling UTF-8 (two byte) flags which seems to provide a satisfying space for such a project.

It may seem that another advantages of using Hunspell for spell checking Esperanto texts is its novel capability of two-folded suffix stripping (the possibility of defining suffixes for suffixes, i.e. one extra level of root modification), but this has been proven to be of a limit- ed usefulness, as explained later in this text.

Definitely worth noticing is Hunspell’s capability of word compounding, which could sub- stitute for the prefixes added by Gabinski into his MySpell dictionary. MySpell supports basic compounding as well, but it is not possible to it well enough in order to be able to make use of it in favor of the Esperanto dictionary. Hunspell’s improvements in- clude a set of three options (to which affix flags are assignable) for basic compounding rules (COMPOUNDBEGIN, COMPOUNDMIDDLE and COMPOUNDEND), which, however, seem to be quickly getting obsolete with the introduction of yet another instruction – COM- POUNDRULE – which makes it possible to define complex word compounding rules in Hunspell. The Esperanto dictionary which is to be constructed should try to make use of these features as well.

As far as sources of linguistic data related to Esperanto used for building the dictionary are concerned, there is also a short description in the Pokrovskij’s Ispell files, in the Esperanto README file legumin.l3. In it, Pokrovskij first regrets the absence of recursive affix stripping in Ispell (which has forced him to introduce some complex affixes which are not present on their own in the Esperanto grammar and which imposes some limitations to what he could achieve with his tool) and later he mentions the uselessness of the Ispell’s basic compounding system, as I have just made above as well. Further in the file, he men- tions the primary sources for his word list (the ready-made compounds from PIV, and sev- eral texts he himself had checked up to that time) and then goes into details about his own preferences with regard to certain parts of the Esperanto lexicon and how that has influ- enced his accepting or not of certain words into the word list. This thorough inspection of the word list and manual edits in order to improve it seem worth imitating in the new spell checker as well.

In a personal communication from May 12, 2008, Pokrovskij has confirmed to me he had still been developing his Ispell dictionary, although the lack of interest from the public had been keeping him from updating its public distribution very frequently. He also adds that he had never heard of Hunspell before, comes himself with the idea that it could be a better solution for an agglutinative language such as Esperanto and even admits that if he was starting his work on an Esperanto spell checker at this moment, he would definitely seri- ously consider using Hunspell for the purpose.2 However, right thereafter, he adds that the reported problems with its integration in the text editor Emacs make it of little use to him- self.

2 „Eble hunspell pli bone konvenus al la kunmetema lingvo kiel Esperanto; se mi estus komenconta mian laboron super literumilo por Esperanto, mi certe serioze konsiderus tion.“ – Sergio Pokrovskij in an e-mail to Marek Blahuš on May 12, 2008.

8 3.2 Adopting a Suitable Approach to Esperanto Morphology In order to implement a spell checker exploiting the traits of Esperanto morphology, it is first necessary to adopt a suitable approach to this topic, which would make it possible to produce a set of rules defining all the possible derivations which are considered valid in the language.

The project of an international language, then called “Lingvo Internacia” and later named “Esperanto”, was first published by its initiator, Dr. L. L. Zamenhof, in 1887. The “” (“First Book”) included the Lord’s Prayer, some Bible verses, a letter, poetry, and sixteen rules of grammar3 and 900 roots of vocabulary. In 1905, the sixteen rules were re- published, along with a “universal dictionary” and a collection of exercises as the “Funda- mento de Esperanto” (“Foundation of Esperanto”; Zamenhof, 1905). The Fundamento is still considered a norm by the most of contemporary Esperanto speakers. Further evolution of the language is now being observed and controlled by the “” (“Academy of Esperanto”).

Zamenhof had spent about ten years working on the project of his language, but since he had little scientific linguistics background, a thorough linguistic analysis of his language was yet to be done. First attempts in this field have been made for instance by René de Saussure (brother of Ferdinand de Saussure) who has also explored the word formation in Esperanto in particular (Saussure, 1910). Probably the most significant and complete up-to- date description of Esperanto grammar (including morphology) is given by Bertilo Wen- nergren in PMEG (Wennergren, 2005). An English description (with particular stress on word building), partially based on PMEG, has been given by Jiří Hana in his master thesis (Hana, 1998).

Considering the word building principles described in PMEG as valid, it seems to be possi- ble to adopt an approach to Esperanto morphology which enables us to construct a Hun- spell affix file and a corresponding word list that would resemble the derivative principles actually occurring in usage of the language.

3.2.1 The Structure of Words in Esperanto The word building principles described in PMEG are said to be in agreement with earlier decisions of the Akademio de Esperanto on the same topic (Aktoj de la Akademio 1963-1967“, pp. 69–70). A description of these principles in English may also be found by Hana (Hana, 1998, pp. 33–45). The Academy has observed that roots in Esperanto maybe divided into three categories, according to what part of speech is inherent to them. In com- pliance with the word endings typical for each part of speech,4 these categories are some- time called A-roots (adjectival), I-roots (verbal) and O-roots (nominal).

Apart from roots, there are three other elements in the Esperanto word building system:

3 These original sixteen rules may be found in Appendix A of this paper. 4 See Appendix A for details on Esperanto endings for different parts of speech.

9 • affixes (both prefixes and suffixes), which in fact are a subset of specific, very pro- ductive roots, usually short in letters

• inflectional affixes and endings, capable of expressing the category (part of speech) in common words, the number and the accusative in adjectives and , the tense and mood in verbs, and the active and passive participles

• primitive words, which are roots that do not require any category ending to form a word, although the addition of the ending may be possible

One of the significant observations of the Academy was that any root can accept any cate- gory ending, provided “the formed word has some meaning”. For example the root “rapid” is inherently adjectival, so the base word form it produces is “rapida” (meaning “quick”), but it can also accept the ending and produce “rapido” (meaning “speed”) or together with the verbal ending constitute the word form “rapidi” (meaning “to hurry”).

If the endings on their own are not enough to express a meaning, this can be performed us- ing affixes. PMEG lists a total of 10 official prefixes and 31 suffixes in Esperanto. These are attached to the root and can either add on or change the meaning of it. Several affixes of the same type may appear one after another.

New words, however, may also be formed as compounds, from two or more roots. For ex- ample the above mentioned root “rapid” with the root “trajn” (meaning “train”) and the noun ending can produce the compound “rapidtrajno” (meaning “express train”; literally “quick train”). The effects such composition has on the semantics of the roots and on the overall meaning of the compound are different in each case and guided by a system of rules, which, however, is outside of the scope of this work and unnessary for our purpose. Furthermore, the difference between a root and an affix is often also not clear. Yet, when compounding, a connector letter, which is either “a”, “o”, “i” or even “e” (the adverbial ending) are sometime preserved at the compound boundary, for example “skribotablo” = “writing desk” from “skrib” = “write” and “tabl” = “table” (this issue of the connector, sometimes regarded as conserving of endings, is in Hana 1998, where it is called “insert- ed o”, intentionally rather neglected).

In order to maintain clarity and to propose a system in which every Esperanto word could be constructed from a relatively small set of elements, I have decided to introduce the fol- lowing word structure system and terminology, which represents a slightly modified ver- sion of the structure mentioned above, and to keep using it through the rest of this text:

A word in Esperanto, for the purpose of the developed spell checker, is a compound, con- sisting of one or several compound parts, with optional connecting letter between any pair of neighbouring compound parts and optionally ended by an ending. Each compound part consists of exactly one stem and any number of affixes placed around it. Written in form of a regular expression, the structure looks like this:

(affix* stem affix* connector?)* affix* stem affix* ending?

10 All elements used in this word formation are called morphemes and each of them belongs to exactly one of the following categories: stems, affixes, endings and connectors. Also (apart from the fact that all connectors have a homonymous ending to them, but those can be unambiguously distinguished by the position in the word they occupy), each morpheme is unambiguously classifiable into one of these categories, simply according to the letters it consists of.

If compared to the system used by PMEG and described above, all PMEG roots are stems for me; all PMEG affixes are affixes for me as well; from inflectional affixes and endings, the most are treated by me as endings (and there is always at most one ending, so complex- es like “ajn” from “a”, “j” and “n” to denote an accusative plural adjective represent a sin- gle ending for me), only the participles (“int”, “ant”, “ont”, “it”, “at”, “ot”) are being han- dled in the same way as affixes (since they may appear not only in the end of words, but also inside them); and the primitive words, called “function words” by me, are all consid- ered stems.

Following are several illustration of my system of word structuring, with morpheme types and inner compound boundaries shown: malsanulejdomo:5 MAL – SAN – UL – EJ – | DOMO | O affix stem affix affix | stem | ending

ŝajnmultekosta:6 ŜAJN | MULT | E | KOST | A stem | stem | connector | stem | ending

3.2.2 System for a Semantic Classification of Stems If we are to define rules for permitted combinations of affixes and stems, it is necessary to observe the conditions under which affixes affixes accept certain stems (or vice versa) while they don’t produce a meaningful combination with others. If we succeed in under- standing this system, it is possible to create a spell checker which would not only recognize all words which could be found in sources it was based on, but also valid compounds and combinations of stems and affixes which may be untypical, but still perfectly correct and usable in a context which the author of the dictionary simply did not think about. Also, im- plementing such a system of stem classification could result in a significantly shorter word list file, since all the possible combinations of stems and affixes would be described just by the means of affix classes and not listed individually.

Semantic classification of stems and its usefulness for implementing rules of Esperanto morphology has already been touched by Hana in his morphological analyzer (Hana, 1998, pp. 54–55). He shows two examples of using two-level morphology rules to restrict the us- 5 meaning “a hospital house”, literally compounded as “the house [DOM+O] of the place [EJ] of the person(s) [UL] of the condition opposite [MAL] to healthy [SAN]” 6 meaning “seemingly expensive”, literally compounded as “costly [KOST+A] in the way [+E] of amount [MULT] of a seeming character [ŜAJN]”

11 age of some morphemes, in particular the prefixes “bo” (which can precede only a very limited set of family-related stems) and “pra” (which has two different meanings, and thus, in addition to the stems accepted by “pra”, it also accepts other stems). Later, in the conclu- sion of his paper (Hana, 1998, p. 65), he declares classification of stems a working ap- proach, which would be worth a wider application, but explains he has not pursued it exten- sively since such a classification would be very time-consuming.

However, exploiting data coming from grammar references, dictionaries and corpus analy- ses, it should be possible to pursue such a classification without the need for manual tag- ging of each and every stem in the word list. It is among the goals of this work to try out this approach and see if an automated semantic classification of Esperanto stems is feasible and the data coming out of it useful for the construction of a spell checking dictionary.

In order to devise an algorithm for semantic classification, a suitable system of semantic classes has first to be identified. This is to be done on basis of known traits of all affixes which there are in Esperanto, since it is exactly the affixes and their relation to word stems which we are later going to control using rules that make use of the classification. A good enough description of the traits of Esperanto affixes has been given by Wennergren in PMEG (Wennergren, 2005, chapters 38.2 and 38.3). He also discusses the topic of seman- tic classification of stems, but provides only a provisional classification, declared as incom- plete (Wennergren, 2005, chapter 37.1), whose part concerning humans and animals and the problem of inherent genders in Esperanto stems is discussed in another chapter of PMEG (Wennergren, 2005, chapter 4.2) and in a slightly updated yet in an apart document (Wennergren, 2008).

In spite of the overall incompleteness of Wennergren’s work on semantic classification, it seems to be possible to take his classification as a basis of the one which is to be designed in scope of this chapter, so I am presenting an English translation of his semantic classes described in the above cited sources here:

• Some stems refer to humans, persons, e.g.: AMIK, TAJLOR, INFAN, PATR, SINJOR, VIR...

• Other stems refer to animals, e.g.: ĈEVAL, AZEN, HUND, BOV, FIŜ, KOK, PORK...

• Another stems are plants, e.g.: ARB, FLOR, ROZ, HERB, ABI, TRITIK...

• Some stems are tools, e.g.:7 KRAJON, BROS, FORK, MAŜIN, PINGL, TELEFON...

7 This category was later not identified as necessary in my explorations, since I have not found any affix re- strictive to this particular semantic class, so I have not included in my proposal for semantic classification of Esperanto stems.

12 • Many stems are names of actions, e.g.: DIR, FAR, LABOR, MOV, VEN, FRAP, LUD...

• Other stems are names of traits or qualities, e.g.: BEL, BON, GRAV, RUĜ, VARM, ĜUST, PRET...

Additionally, the stems referring to humans and animals may be sorted as masculine stems, feminine stems and neutral gender stems. An extensive list of examples for these cate- gories, and particularly complete lists for some tricky combinations (e.g. masculine stems referring to animals, such as “taŭro” for “bull”) are present.

Wennergren’s description of the Esperanto affix system lists the 10 official prefixes and 31 suffixes in Esperanto8 and includes detailed information on their possible use, with exam- ples. In some cases, the semantic class of stems combinable with the particular affix is de- scribed very precisely, in other cases the description is unfortunately somewhat fuzzy (as is often also the actual usage of the affix-stem combination in question).

I have attempted to produce a list of distinguishable semantic classes from the description of affixes as well, combined the two sets together, and using information collected from the cited sources, the given lists of examples and own knowledge of the language, I have come up with a semantic classification system of Esperanto stems which is shown as a Venn dia- gram in Figure 3.1. The class of objects, which is especially complex, is shown in details in the bottom part of the figure.

8 For a list of these affixes and short explanation of their meaning, see Appendix B.

13 Figure 3.1: Proposed system for a semantic classification of Esperanto stems.

14 According to the needs of each affix, a list of flags denoting particular semantic classes or their combinations has been derived from this classification. It is assumed that assigning such flags to Esperanto stems in a semantic classification process should be sufficient for later creation of rules controlling the possible combinations of stems and affixes. Each flag has been assigned a single letter for easier reference, and a mnemonic for easier orientation. A full list of these flags is shown in Table 3.1, with the mnemonic marked in bold.

Flag Description A attribute stems, having the A-ending in their base word form B animals [“bestoj” in Esperanto] C common gender in beings (animals and persons) F female gender in persons I action stems, having the I-ending in their base word form J place stems, producing adverbs of spatial meaning [in Esperanto “ejo” = “place”] K plants [“kreskaĵoj” in Esperanto] L antonym-producing stems, which accept the prefix “mal“ M male gender in beings (animals and persons) N numbers (numerals and some other stems of amount) O stems, having the O-ending in their base word form P persons T transitive stems, producing transitive verbs V function words, which may appear without an ending [ “funkcivortoj”] Y family relationships Table 3.1: Semantic flags derived from the proposed system of semantic classification.

15 Finally, a reference table with a haphazard sample of manually classified stems plus a well- thought group of stems selected to cover the remaining semantic classes has been built, which shall later serve as a check list for the automated classification process which is to be implemented. The table shows for each stem and semantic flag whether the flag has been assigned to the stem or not. These data are presented in Table 3.2.

Stem A B C F I K L M N O P R T X Y ABON ------L - - - - - T - - ANANAS - - - - - K - - - O - R - - - ANAS - B C ------O - R - - - BOV - B C ------O - R - - - CENT ------N O - - - - - CENTR ------L - - O - - - - - DAM - - - F - - - - - O P R - - - INFORM - - - - I ------T - - KOTIZ - - - - I ------LAND ------O - - - - - OFIC ------O - - - - - PAG - - - - I ------T - - PATR ------M - O P R - X Y REĜISOR - - C ------O P R - - - REPREZENT - - - - I ------T - - SAM A - - - - - L ------TAŬR - B - - - - - M - O - R - - - TEMP ------O - - - - - VER ------L - - O - - - - - Table 3.2: A check list with a manually classified selection of stems.

3.3 DFD Diagram for Dictionary Construction Having created a system for semantic classification of stems, it is possible to start design- ing an algorithm which would collect Esperanto stems, make use of this system to semanti- cally classify them, and thereafter generate rules describing all the permitted ways in which they may be combined with each other and with affixes and endings in order to produce valid word forms of the language. The flow of linguistic data through the dictionary con- struction process is depicted in a data flow diagram in Figure 3.2.

16 Dictionary Semantic Classified Affix rules Hunspell headwords classification stems generation dictionary

Function Corpus Specialized Ready-made morpheme words data dictionaries segmentations

Figure 3.2: Data flow diagram for linguistic data in the dictionary construction process.

3.4 Compiling a New Word List An essential part of a good spell checker is a solid word list, from which words may be tak- en and used for checking the words obtained from the user on the input, either on their own or composed in compounds according to additional compounding rules. Various word lists (general word lists, proper names, terminology specific to some profession, unofficial words) have been compiled also for Esperanto, notably the separate word lists distributed with the Esperanto dictionary for Ispell (Pokrovskij, 1997), which give the user a limited possibility to influence the properties of the spell checking process. In our proof-of-concept dictionary for Hunspell, however, we are not going into such details so far, and will con- centrate on compiling a single, though large, word list of general vocabulary.

3.4.1 Identifying Relevant Sources for the Word List Plena ilustrita vortaro de Esperanto 2005 (“The Complete Illustrated Dictionary of Es- peranto”, PIV) is the latest version of a renowned monolingual dictionary of Esperanto, first compiled in 1970 by a large team of Esperanto linguists. There is an electronic list of the words found in the dictionary (Grimley Evans, 2005), downloadable under a free li- cense from the internet. The list consists of a total of 46,890 lexical units, of which about 16,780 are head words (stems); the other are derivatives of these. Because of its high num- ber of words, good coverage of general vocabulary (plus a significant amount of rather spe- cialized terms), easy availability and general recognition in the Esperanto community, PIV, or actually its electronic version, is a good candidate for being the dictionary that will pro- vide the base for our spell checking word list. It is also useful for obtaining information about the inherent part of speech for each stem, because the form with the corresponding ending is always the one which appears as the head word in the dictionary, while the de- rived forms are listed after it, if at all.

Apart from the most common simple adjectival, verbal and nominal stems, there are also stems which are function words, which means they may appear in a text on their own, with- out an ending. Some function words also have an adjectival, verbal or nominal character and may thus be used in word building, other function words lack this capability. Since it is more difficult to extract function words from PIV, another source is going to be used for obtaining a list of them. ESPSOF, a software package for analyzing and proofreading of Esperanto texts, currently developed by Toon Witkam (Witkam, 2008), includes a list of

17 357 function words, even classified into many different categories. For our purpose, only the distinguishing between inflected (which may accept and ending) and uninflected (which may only stay on their own) function words is important. Another advantage of Witkam’s list of function words is the inclusion in it of some word forms, which actually are not functional words, but are regarded as such. This is, for instance, the case of all personal pronouns, which hardly ever get combined with other stems and at the same time are very likely to be erroneously recognized within compounds, because of their extremely short length (two or three letters). The need to set aside personal pronouns and similar words (in- cluding the few word forms they can produce by accepting an ending, such as possessive form, plural and accusative) from the rest of inflected stems and regard them as uninflected function words, along with several other possible improvements of Esperanto morphologi- cal analysis, has been mentioned by Witkam in his paper from the 2006 GIL conference (Witkam, 2007a).

Witkam’s ESPSOFT, whose dictionary is also based on PIV, includes yet another type of useful extra data, which is the information on every verbal stem whether the verbs it pro- duces are of a transitive or intransitive nature. Such kind of information is present in the pa- per version of PIV as well, but has been lost during its conversion into the electronic ver- sion. Witkam has managed to substitute it by the same kind of data coming from a reliable Esperanto-Japanese dictionary (Hirotaka, Ono, 1997) which also I am going to make use of. The advantage of such a source, apart from its reliability, is the fact it completely covers the domain of words which enter the semantic classification process, i.e. the PIV words.

Corpus analysis seems to be a source of linguistic data especially worth exploring for the purpose of the compilation of the word list. There are several professionally compiled cor- pora for Esperanto, the largest of them being the 18.5-million-position Esperanto corpus created by Eckhard Bick (Bick, 2007). Corpora present the language in a form in which it is actually used, what often differs from the way it is described in dictionaries and grammar references. And if the goal of our spell checker is rather to make users aware of their mis- takes than to force them to switch to a particular language style, we should make sure the linguistic data we construct the spell checker on are as realistic as possible. Some semantic classes may be directly derived from corpus analysis, since searching for a set of stems which accept a certain affix, the easiest solution is, of course, to look up that affix in a cor- pus and extract the stem set from the result. This works perfectly, for example, for the “mal” prefix, which produces a meaning opposite to that of the stem it is modifying. This method, however, can unfortunately not be used in all cases, since very often an affix can be accepted by two or more semantic classes, which themselves are difficult to distinguish one from another. Also, not every affix is productive enough to have a high number of oc- currences in a corpus, and not every affix is recognizable reliably enough in a corpus which is not morphologically tagged.

In order to be able to properly recognize all classes of the proposed semantic classification system, it is necessary to employ some specialized dictionaries as well. A particularly de- manding field is the recognition of stems denoting animals, plants and human beings. It is for example only these classes of stems, which may accept the Esperanto suffix “id” used

18 for forming an offspring, descendant. This suffix is, however, not frequent enough for a satisfying number of combinable stems to be identified by corpus analysis. That is why it is necessary to make use of specialized word lists during the semantic classification process. A significant contribution in this field has been brought by the recently deceased Wouter F. Pilger, who had been maintaining a set of “provisional personal lists” from the fields of botany, zoology and ornithology (Pilger, 1982, 1992, 1996a, 1996b, 1996c, 1997). Some- times the words given in them are too much scientific and do not cover the everyday vo- cabulary (such as listing a dog by the zoological word “kaniso” rather than the common word “hundo”), but these gaps may be filled in by addition of the vocabulary from several lernu! word lists (lernu!, 2002), which, on the other hand, attempt to present the Esperanto students with the most common vocabulary for each subject. Together, these word lists, if properly adapted, provide a solid base for recognizing the semantic classes of plants and animals.

Another specialized word lists which are being used as an aid for the semantic classifica- tion are a list of professions, functions and ranks (Worsten, 2003), and several small closed vocabularies, such as the vocabulary of family-relationship stems, extracted from the PMEG (Wennergren, 2005).

Every time, however, when a stem appears in a specialized word list, which is not present in the original PIV-derived word list, it is thrown away, since such a stem would with cer- tainty not be fully classifiable. This measure also guarantees that only stems from a con- trolled vocabulary (those listed in PIV) may get recognized by the spell checker, and no other words, which may be welcomed by some users and definitely it at least helps to main- tain certain quality of the produced dictionary, even if it may have the side effect of not recognizing some rather specialized vocabulary.

3.4.2 Moore Machine for Semantic Classification The process of semantic classification of all rules is a complex one, because of all the dif- ferent possible semantic classes that have to be taken into account and the number of sources that must be combined in order to achieve a good classification. Figure 3.3 in this chapter depicts the whole classification process each stem must go through, including the names of the tools that are used in each step of classification, the flags which are being as- signed to the stems in these steps, and the way in which each step of the classification is connected to the rest of the process. The diagram shown is a Moore machine, a finite state automaton in which the outputs (semantic flags) are determined by the current state alone. If in a state some classification step is performed, the state is labeled by an abbreviation de- noting the source engaged, and there are transitions from this state to two or more other states, each of them for some possible result of the classification step in question. A state from which there are no transitions to other states is a final state. The flags outputted while walking through the automaton are accumulated and when a final state is achieved, the re- sulted output is the set of semantic flags for the particular stem Thus, the set of all possible combinations of output semantic flags (a total of 85) may be obtained simply by enumera- tion of all possible passes through this automaton.

19 q29

λ

20 Figure 3.3: Moore machine describing the process of automated semantic classification. semantic automated of process the describing machine Moore 3.3: Figure ESPSOF q13

q19 q20 q12

q0

M c

M c C

C

o o λ

m m

f

u

m

m

male n

c

o o

t

n i

n

PMEG+corpus m o n

a

l w e o

r ESPSOF d PMEG PMEG q14 q11 q17 q18 non-function word non-function q1 YP λ K B

V

n uninflected in

o number f n n

family le

plant

- o c

a n n λ t n u - e

f im m a d

m

a b

l e il r y Pilger+lernu PMEG corpus q15 q22 q16 q10 q2 q3 N

F O λ

λ λ

n

o

n

-

p

l

female a

n

t

n

o

n

n

o place

-

n

a o

-

n b p

i j m

e l

c a

a

t c l e λ PMEG q21 λ non-female

PIV

q5 q4 q9 λ T λ J

PMEG+worsten λ corpus action q26 q23 t non-person ra n s it

λ λ iv e non

- ESPSOF tran sitiv

n e

q8 a o

a t n t

n r - i

t b a o

I u n

n t t

c y e o

o m n

male m

y λ n

- o m m p λ n

r - n

o o p u

n d m r

o u be

du c r

i n

c g

i

n g PMEG+corpus

q27 q28 q25 q7 q6 q24 number

MP CP N A L λ q30 λ 3.4.3 Implementation of the Semantic Classification The above described system of semantic classification has been implemented by means of a set of Linux shell scripts which make use particularly of the programs sed, grep, and the utilities from the textutils package (such as cut, join, sort, uniq, wc).

Each of the steps (i.e. states labeled with an abbreviation of some data source) of the au- tomaton shown in Figure 3.3 performs a step in the classification process, using the particu- lar data source to classify the processed stem and decides which transition should be fol- lowed as next (and thus also whether a flag should be outputted and if, then which one).

In practice, however, the automaton is not run on each stem from the word list individually, but the whole word list goes through a batch process which iterates through all classifica- tion steps associated to states of the automaton and performs the respective classifying pro- gram for all stems in the word list for which it is applicable at once. The order in which the steps are iterated through may be for instance the one that corresponds to the numbering of the states in the figure, but it may be proved that this order can be chosen arbitrarily as long as there does not exist any path from the initial step to the step which is to be performed that would contain a step that had not been performed so far (this condition guarantees that the program realizing the following step will have at its disposal all the information about the classified stem from the previous steps it might require, and since all paths forming cir- cles which can be found in the graph may be distinguished from each other by the presence of a certain flag or set of flags in the output, it is also in every moment possible to deter- mine according to the output flags by which series of steps a particular stem has been clas- sified so far).

Each program performing a single classification step implements a universal interface, so that they may be easily lined up in a batch and swapped arbitrarily, as long as the condition given above is fulfilled. The word list which is being classified is stored in a text file, one item on a line which consists from the text of the stem itself, a slash (“/”) and a list of flags that have been assigned to that stem so far. When run, a program performing a step of the classification receives a copy of this partially flagged word list as an input file, inspects it and generates an output file containing those stems which it has changed, followed after the slash by the newly added flag(s). Every time after such a step has been performed, a merg- ing program joins the two files, updating the main word list file with the new flags from the output file.

For example, the partially flagged word list may at some moment look like this: dom/JO patr/O

21 Then, if the program assigned to the step in state “q10”, whose task is to make use of PMEG to determine if a stem describes a family relationship, is run, it produces the follow- ing result: patr/YP

This is due to the fact that “patr” (the stem for “father”) describes a family relationship (and thus is being given the flags “Y” for “family” and yet “P” for “person”), while the oth- er stem, “dom” (meaning “house”), does not describe a family relationship and thus re- mains unnoticed.

Finally, the merging program updates the main word list with the information from this output file, after which the main word list looks like this: dom/JO patr/OYP

Somewhat complicated is the problem of homonyms, since these are not distinguished by the algorithm. If it happens that a homonym occurs, such as when determining the inherent part of speech by means of PIV in state “q5” for the stem “bar”, which is a homonym with the basic form of either the verbal “bari” (”to obstruct”) or the nominal “baro” (“bar”, the unit of pressure), then it is assigned two otherwise exclusive flags in the same step (becom- ing bar/IO). This may produce a combination of flags which could never occur in a sin- gle unambiguous stem, but it is usually of little harm, since in the following steps it is still possible to recognize the series of steps the stem has been classified by, although they may be some “superfluous” flags left, from the point of view of a step-performing program.

The only real problem with homonyms seems to emerge when a stem is assigned a flag due to the traits of one of its meanings, but this flag is relevant also to the other meaning, for which the presence of this flag may or may not have been determined so far. This is the case of an attribute/object homonym for instance, whose meanings gets split in the “q5” state and the attribute and the object meanings may later be directed to the “q7” and the “q15” states, respectively, which both decide about the number character of the stem, as- signing it or not the N flag. Now it may happen that one of the meanings has a number character and the other one has not, but as soon as both of them “meet again” in the follow- ing state which is “q26”, this difference can not been identified anymore. Probably, either the restructuring of the automaton or the introduction of an unambiguous set of flags (i.e. one flag may be generated only in one single state) would help to solve the problem. In practice, however, this is of little harm, since there are especially few homonyms in the Es- peranto vocabulary (there are even Esperanto speakers who think there should be no homonyms in the language at all), and even if a homonym appears and a dubious event like the one described above occurs, the presence or not of the particular flag usually does not have any impact on the decision processes which occur in the step that are to follow (such as it is not important if a stem has a number character while it is being classified for the ca- pability of producing antonyms in “q26”).

22 3.5 Compiling a New Set of Affix Rules After a word list has been compiled and semantically classified, it is possible to create a set of rules which makes use this classification by imposing that certain affixes be combined only with stems from certain semantic classes.

3.5.1 Dictionary-Based Word Derivation System One possible approach would be to try to produce the rules manually, following theories of Esperanto word building that describe which affixes can be combined with which kind of stems (as we have seen in PMEG) and also make attempts at describing the Esperanto com- pounding system in general, using different terminologies and approaches. This process, however, would be particularly difficult, if feasible at all, because there is still no generally accepted overall theory on the compounding in Esperanto and the agglutinative character of the language makes compounds very frequent and varied, although the speakers usually do not have problems understanding each other, since circumstances like context and their own national languages seem to make comprehension relatively easy.

Instead of working out rules by hand, particularly if we want to concentrate on the lan- guage as it is actually being used, the usage of a dictionary or a corpus seems a suitable idea once again. There is presently no large enough Esperanto corpus which would contain information on the morphological structure of its words, but we may observe that Esperan- to dictionaries such as PIV are actually full of compounds. Toon Witkam, who has written on the topic of deriving Esperanto morphology from dictionaries (Witkam, 2007), notes that in PIV compounds consist two thirds of all lexical units (although he considers also words with a single stem but at least one suffix added to it a compound, what differs from my usage in this work). In his ESPSOF software package (Witkam, 2008), he makes use of morphologically analyzed words from PIV, for a lot of which he had to add the structure information himself, since the dictionary does not provide full morpheme structure for all the words it lists. Once a morpheme structure for a large amount o words is known, howev- er, it may be used in morphological analysis, both as direct models for known words, and as a guide for analyzing words which are not present in the dictionary but probably com- pounds of words which may be found there.

Witkam’s goal for ESPSOF, however, is to “construct a general word compound analyzer, which would work without any semantic knowledge on the text or the world” (Witkam, 2007, translation from Esperanto is mine). But as we have developed a system for semantic classification of stems, we now have the possibility to try to approach the problem of word derivation and compounding using this kind of information as well.

Taking Witkam’s list of ready-made morpheme segmentations, we may put it through an affix and stem recognition process, which tries to match the morphemes with either affixes or stems from the word list we have compiled. Those words, in which we have successfully recognized all compounds (which should be the case with most of them, since those words all come from PIV, which itself is the source of our word list), may serve as morphology models, since, together with the semantic classes that may be assigned to the stems found

23 in them, they give us hints about what the possible combinations of stems and affixes in Es- peranto are.

An analysis of Witkam’s morpheme structures provides interesting information on the fre- quency of different affix combinations in Esperanto. In his database of approximately 33,000 ready morpheme segmentations of word forms from the PIV, there are 632 different combinations of affixes and stem positions (independent on what the particular stems are). The frequencies of the combinations, except for the first two, approximately observe the Zipf’s law. Twenty most common morpheme structures in Esperanto are listed in Ta- ble 3.3. Note that the word endings were irrelevant to the analysis, and that an asterisk (*) denotes a position for a stem (in contrast with an affix).

Frequency Structure 12,970 * 6,005 */* 1,386 */o/* 800 */aĵ 795 */ad 642 */ist 631 */ig 619 */ec 505 */iĝ 417 */il 400 */ul 376 */ej 359 */et 318 */ism 317 */*/ig 311 mal/ 254 */*/iĝ 220 */uj 215 */ar 206 */*/il Table 3.3: Most common morpheme structures in Esperanto.

Yet if we group the word forms having the same morpheme structure together and classify the stems in them using our system of semantic classification, we may get a list of stem se- mantic classifications for each morpheme structure, from which information about the pre- vailing semantic class for each stem position could be retrieved and used to produce a mor- phological rule describing the semantic conditions for formation of words which follow the particular morpheme structure.

For example, inspecting the semantic analysis of stems in 70 word forms of the morpheme structure “dis/*”, it is possible to notice that the stem position in this structure is mostly (in 53 cases) occupied by a verbal transitive stem (semantic flags IT). In all other cases (ex-

24 cept for a one rather erroneous one), the stem in the stem position is a common noun (se- mantic flag O). Eventually, the analysis of this particular morpheme structure may result in producing two new word building rules, namely that a combination of “dis” and a stem having either the O flag or the IT flags is a valid compound part in Esperanto and should be recognized as such by the constructed spell checker.

3.5.2 Implementation of Esperanto Morphology in Hunspell After performing the analysis of Esperanto morpheme structures and the related combina- tions of affixes and semantic classes, the last part of the task of creating a spell checker for Esperanto was to implement the word list and the resulting rules in form of a Hunspell dic- tionary.

Several approaches have been tried to represent the created semantic classification and word building rules exploiting the morphology capabilities of Hunspell’s dictionary files. The most straight-forward ones, unfortunately, have been found unfeasible, because of some limitations of the Hunspell software. The last approach tried, however, has been suc- cessful and provided an actual method of implementing the derived Esperanto morphology rules in form of a Hunspell spell checking dictionary.

The main problem with a Hunspell implementation of a complex word building system, which is typical for Esperanto, has been its capabilities of compounding and affixation, which were found too limited – paradoxically, for a spell checker which boasts the best support for morphology ever. The main causes of the failure of several first attempts at im- plementing the Esperanto morphology in Hunspell were:

• The conflict between Hunspell’s old and new systems for defining permitted com- pound structure: The set of COMPOUNDFIRST, COMPOUNDMIDDLE, COM- POUNEND and COMPOUNDFLAG instructions is useful for defining local compound rules (such as that a word ending may appear only as the last element of a word form and never alone), while the newer COMPOUNDRULE instruction which takes as an a regular expresion may be used for defining the global structure of a compound (such as when trying to define word structure in a manner similar to what I have done in chapter 3.2.1 of this work). The problem is, that these two sets of instructions may not be used together, so one can actually opt for just one of them, and neither of them may seem potential enough at this time (although there are promises of the COMPOUNDRULE getting more potential in the future, and even- tually replacing the old instructions).

• A hidden limitation of Hunspell’s noted two-folded suffix stripping. It actually works (it is possible to assign an affix class to another affix), but it is limited only to the last compound part of the word form (or the first one, in case of right-to-left lan- guages). And, still, it is just a two-folded stripping, so it can not be used to freely implement the Esperanto morphology where combinations of three affixes (and es- pecially if we count word endings among them) are relatively often.

25 These problems with making use of the Hunspell’s prominent support for complex mor- phology have forced me to look for a less smooth solution, which, however, would at least be capable of expressing the systems of rules and stem classification I have developed. Fi- nally, I settled on using solely word compounds (by means of the COMPOUNDRULE in- struction) for implementing the Esperanto morphology, with not even a touch of the actual Hunspell’s affix system.

Thus, a feasible approach for implementing the created morphology rules is to produce a set of regular expressions (limited to the asterisk and question mark operators, however) which describe each of the identified rules, using a single character to refer to either a suf- fix or a stem semantic flag or semantic flag group. All the affixes, word endings and all stems used must be put in the word list file and marked with necessary Hunspell flags (“af- fix classes”) so that these flags may be used in these regular expressions. In order to keep things simple, I have kept the same system of flags I have developed for semantic classifi- cation, introducing new ones where necessary, for example if a created morphology rule expects a semantic flag combination in a position while Hunspell can accept only one flag per its compound part in the COMPOUNDRULE regular expression.

An example implementation of the rules for the “dis/*” morpheme structure discussed in the previous chapter for a sample dictionary of just several words could look like as fol- lows: a/z i/z o/z dis/d don/ITx est/I sem/O

COMPOUNDRULE dOz COMPOUNDRULE dxz

Here, the word list file (the top part) defines morphemes and their flags, and the excerpt from the affix file (the bottom part) shows how word building rules are written for the pre- fix “dis”. The flags for the morphemes “dis”, “don”, “est” and “sem” come directly from semantic classification of stems, but a new auxiliary flag x has been introduced, which is merely a replacement for the flag combination IT which, since not a single flag, can not be present in a compound rule on its own. The first compound rule describes the case when a nominal stem follows “dis” in a word, ended by a word ending (the flag z here marks a sample subset of possible Esperanto word endings in the word list). The second compound rule describes the case when “dis” is followed by a verbal transitive stem, followed by a word ending. Using these rules, the spell checker recognizes the following word forms: “dissema”, “dissemi”, “dissemo”, “disdona”, “disdoni”, “disdono”.

26 3.6 Integration in OpenOffice.org and the E@I Grammar Checker The above text shows that it is actually possible to develop a spell checking dictionary for Esperanto using Hunspell and that in spite of some Hunspell’s limitations, it is capable of expressing semantically based rules of Esperanto morphology, thus bringing in some new technology into the field of Esperanto spell checking which would be worth trying out in the practice.

Since this Esperanto spell checker is being developed in the scope of a broader grammar checker project called “Lingvohelpilo” (“Language Helper”) by the E@I organization, the first idea about its possible field of use is naturally this system, whose development has started in the beginning of 2008 and is supposed to last for a full year. The spell checker should be an integral part of the project and in fact the first component which would be processing the text after the user has submitted it and later forwarding it, along with a list of unrecognized word and suggestions of corrections for each of them, to a separate com- ponent dedicated to grammar checking. In order to connect the two components, the two processes could be connected with a pipeline, and some modifications to the Hunspell code are also foreseeable, so that the grammar checker receives the spell checker’s input in a form which is most convenient for it. Theoretically, it could also be possible to fight some Hunspell’s limitations on word building described above by creating an own modified ver- sion of the Hunspell source code, but this is yet to be discussed and its feasibility explored.

A more straight-forward and immediately possible use of the new Esperanto dictionary is to use it with OpenOffice.org Writer, the open source text processor which originally gave birth to MySpell. Since OpenOffice.org has been supporting Hunspell for quite a long time already, it is very easy to use the newly created dictionary with any recent version of this office package. Actually, it is enough to replace the two old Esperanto dictionary files in the OpenOffice.org directory with the new ones, restart the program and the new dictionary is working in the application.

For a more user-friendly process of installation, however, it would be worth to get the new dictionary officially accepted by the OpenOffice.org developer community, so that it could be downloadable from the project’s website, or even eventually become the packages main spell checking dictionary for Esperanto.

At the same time, if enough attention would be paid and some additional steps done, it should also be possible to actually get the dictionary distributed automatically with each download of the Esperanto localization of OpenOffice.org. The present situation, unfortu- nately, is, that the spell checking dictionary has to be downloaded separately, after in- stalling the office package, apparently because of problems with licensing (OpenOffice.org requires all components of its package to be triple-licensed, while the present Esperanto dictionary is available only under GNU GPL). Such a recognition of the dictionary, howev- er, seems to be a somewhat time-consuming process and not so much related to computer programming anymore, so, howsoever it is worth doing, it may probably be considered al- ready out of the scope of this work.

27 Chapter 4 Evaluation of the Newly Constructed Dictionary The dictionary whose construction has been described in this work is intended to be a proof-of-concept, so it still swarms with a lot of flaws which should be taken notice of in the upcoming development stages. These known problems and omissions include:

• Presently, Hunspell provides suggestions only for single items from the word list, not for compounds. This is especially annoying since due to the way in which the dictionary files were created, most of Esperanto words are now represented as com- pounds. Yet another working approach of implementing the constructed morpho- logical rules should be found, or the Hunspell code should be modified, so that the program not only recognizes valid and invalid words, but also provides suggestions in case of a misspelling.

• The issue of punctuation and non-alphabetical characters in general should be coped with. A hyphen, for example, is sometimes used within an Esperanto compound word in order to give hint on the word’s structure and achieve more clarity (e.g. to distinguish between “sent-ema” = “sensitive” and “sen-tema” = “without a topic”). Hyphens can be defined as word characters in Hunspell’s data files and used in the same manner as letters, but some external programs, such as OpenOffice.org, have their own hyphenation algorithms, which means the hyphen does not even get deliv- ered to the spell checker.

• A common challenge in spell checkers is the domain of proper names. This prob- lem, although worth attention, has not been touched by the proof-of-concept spell checker implementation described in this work. In order to develop a fully-featured Esperanto spell checking dictionary, however, a thorough analysis of the Esperanto system of proper nouns will have to be performed and its results implemented. A special care should be given to the suffixes “ĉj” and “nj”, which are used to form in- timate forms of given names in Esperanto. They are unique in Esperanto, since they are the only affixes which virtually require their preceding to be shortened, produc- ing forms such as “Joĉjo” from “Johano” (the male suffix) and “Manjo” from “Maria” (the female suffix). Perhaps it would be useful to employ the actual Hun- spell’s suffix system in order to implement this behavior.

• At last but not at least, the semantic classification system realized in the scope of this work may be further developed and improved. The output of the semantic clas- sification process for the check list given in chapter 3.2 has shown quite satisfying results (the only error being the classification of “reĝisoro” = “a director” as a neu- tral noun, since the stem was not present in the used word list of professions), how- ever, it still may be the case that many other words now get classified incorrectly, probably often ending up particularly in this neutral noun category.

28 Chapter 5 Conclusion This thesis has discussed the topic spell checking, with particular accent on spell checking of texts in the Esperanto language. An overview of major spell checking engines, their ca- pabilities and history has been presented, including several tools dedicated to Esperanto. The structure of Hunspell data files has been discussed in more detail. Existing Esperanto dictionaries for MySpell and Ispell have been described and communication has been es- tablished with the author of the latter one. Following an assessment of strengths and weak- nesses of these dictionaries and discussion of reachable improvements, a plan for the con- struction of a new Hunspell dictionary for Esperanto has been presented. Foundations of Esperanto morphology have been discussed in order to adopt an approach to it that could be made use of during the construction of the dictionary. Due to specific traits of Esperanto word building, semantic classification of stems from its vocabulary has been identified as an especially useful precondition for this activity, a classification scheme has been pro- posed based on research in a grammar reference, and a complex system implementing this classification has been planned and realized, making use of various sources of linguistic data including dictionaries and corpora. An analysis has been performed on a corpus of morphological segmentations of Esperanto words and word building rules employing se- mantic classes have been created based on results of the analysis. A dictionary-based word derivation system implementing these rules has been realized in Hunspell, making it possi- ble to utilize the acquired information on Esperanto morphology in form of a spell check- ing dictionary, which has been successfully tested in OpenOffice.org as a proof of concept and shall now be improved and integrated in the grammar checker project of the E@I orga- nization, which shall be released in 2009.

The created Hunspell dictionary for Esperanto may be accessed online at the following website: http://nlp.fi.muni.cz/~xblah/bc/

29 Bibliography There are several essential publications in present-day which are so common that they are well known just by their abbreviations. In this work, I follow this practice and refer to the “Plena manlibro de Esperanta gramatiko” (Wennergren, 2005) simply as “PMEG”, and to the “Plena ilustrita vortaro de Esperanto 2005” simply as “PIV”.

Aktoj de la Akademio 1963-1967. 2a eldono. Rotterdam : Akademio de Esperanto, 2007. Oficiala Bulteno de la Akademio de Esperanto; no. 9. Text in Esperanto. Available from WWW: .

ATKINSON, Kevin. GNU Aspell [online]. 0.50. SourceForge.net, 1998, 2002-08-21 [ac- cessed 2008-05-21]. Text in English. Available from WWW: .

BICK, Eckhard. Tagging and Parsing an Artificial Language: An Annotated Web-Corpus of Esperanto. In Proceedings of Corpus Linguistics 2007. Birmingham : University of Birmingham, 2007. Text in English. Available from WWW: .

GABINSKI, Dmitri, POKROVSKIJ, Sergio. Esperanta literumilo por OpenOffice.org [on- line]. 2003, 2003-12-23 [cit. 2008-05-22]. Text in Esperanto. Available from WWW: .

GLEDHILL, Christopher. Grammar of Esperanto, The: A corpus-based description. Editor U.J. Lüders. München / Newcastle : LINCOM Europa, 1998. 100 pp. Text in English. ISBN 3895862177.

GORDON, Raymond G., Jr. (ed.). Ethnologue : Languages of the World. Fifteenth edition. Dallas : SIL International, 2005. 1272 pp. Text in English. Available from WWW: . ISBN 10155671159X.

GORIN, R. E., WILLISSON, Pace, KUENNING, Geoff. International Ispell [online]. 3.3.02. 1971, 2005-06-11 [cit. 2008-05-21]. Text in English. Available from WWW: .

GRIMLEY EVANS, Edmund. Kapvortoj de PIV [online]. Versio 1.4. 2005-10-21 [ac- cessed 2008-05-20]. Text in Esperanto. Available from the Internet Archive: .

HANA, Jiří. Two-level morphology of Esperanto [master thesis]. Prague, 1998. 85 pp. Charles University Prague, Faculty of Mathematics and Physics. Master thesis supervisor RNDr. Jan Hajič, Ph.D. Text in English. Available from WWW: .

HENDRICKS, Kevin. MySpell [online]. 3.0. 2000 [accessed 2008-05-21]. Text in English. Available from WWW: .

30 HIROTAKA, Masaaki, ONO, Takao. Plena Elektronika Vorto-Listo Esperanto-Japana [online]. 1-a eldono. Jokohamo : 1997 , 1997-01-06 [accessed 2008-05-20]. Text in Japan and Esperanto. Available from WWW: .

LENDON, Klivo. Kontrolu Literumadon [online]. 1.0. Oakville, Ontario, Canada : 1992 , 1992-09-28 [cit. 2008-05-21]. Text in English and Esperanto. Available from FTP: . lernu!. Learning / Words / World learning / By topic [online]. E@I, 2002 [accessed 2008-05-20]. Text in English. Available from WWW: .

NÉMETH, László. Hunspell : open source spell checking, stemming, morphological analy- sis and generation under GPL, LGPL or MPL licenses [online]. 1.2.2. SourceForge.net, 2005a, 2008-04-12 [accessed 2008-05-20]. Text in English. Available from WWW: .

NÉMETH, László. Hunspell − format of Hunspell dictionaries and affix files [online]. 2005b, 2008-04-12 [accessed 2008-05-20]. Text in English. Available from WWW: .

NÉMETH, László. Hunspell − spell checker, stemmer and morphological analyzer [online]. 2005c, 2008-04-12 [accessed 2008-05-20]. Text in English. Available from WWW: .

PILGER, Wouter F. Provizora privata listo de komunlingvaj nomoj de plantoj de nordok- cidenta Eŭropo. Lelystad : Vulpo-libroj, 1982. 72 pp. Text in Esperanto. Available from WWW: . ISBN 9070074311.

PILGER, Wouter F. Provizora privata listo de nomoj de bestoj : Mamuloj. Lelystad : Vulpo-libroj, 1992. 72 pp. Text in Esperanto. Available from WWW: . ISBN 9070074354.

PILGER, Wouter F. Birdonomoj en Esperanto por Hejma Vortaro [online]. c1996a, 2002-05-10 [accessed 2008-05-20]. Text in Esperanto. Available from WWW: .

PILGER, Wouter F. Komunlingvaj nomoj de Eŭropaj birdoj [online]. c1996b, 2002-02-20 [accessed 2008-05-20]. Text in Esperanto. Available from WWW: .

PILGER, Wouter F. Provizora privata listo de nomoj de bestoj : Insektoj [online]. c1996c, 1996-09-02 [accessed 2008-05-20]. Text in Esperanto. Available from WWW: .

31 PILGER, Wouter F. Provizora privata listo de nomoj de legomoj en Nord-Okcidenta Eŭropo [online]. c1997, 1997-10-27 [accessed 2008-05-20]. Text in Esperanto. Available from WWW: .

Plena ilustrita vortaro de Esperanto 2005. Editor Gaston Waringhien. Paris : SAT, 2005. 1265 s. Text in Esperanto. ISBN 2950243282.

POKROVSKIJ, Sergio. Vortaro por ISpell [online]. c1997, 2002-03-12 [cit. 2008-05-22]. Text in Esperanto, English. Available from WWW: .

SAUSSURE, René de. La construction logique des mots en Espéranto : réponse a des cri- tiques. Genève : Par Antido, 1910. 83 pp. Text in French.

TRZEWIK, Artur. Esperantilo – text editor with particular Esperanto functions, spell and grammar checking and machine translation [online]. 0.982. [2003] [cit. 2008-05-21]. Text in English. Dostupný z WWW: .

WENNERGREN, Bertilo. Plena manlibro de Esperanta gramatiko. El Cerrito : ELNA, 2005. 696 pp. Text in Esperanto. Available from WWW: . ISBN 0939785072.

WENNERGREN, Bertilo. Seksa signifo de vortoj kaj radikoj en Esperanto [online]. 2008-05-04 [accessed 2008-05-20]. Text in Esperanto. Available from WWW: .

Wikipedia contributors. Esperanto vocabulary. In Wikipedia, The Free Encyclopedia [on- line]. 2008-05-17 [accessed 2008-05-21]. Text in English. Available from WWW: .

WITKAM, Toon. Automatische Morphemanalyse in Esperanto macht Komposita besser lesbar auf dem Bildschirm. In BLANKE, Detlev. Esperanto heute – Wie aus einem Projekt eine Sprache wurde : Beiträge der 16. Jahrestagung der Gesellschaft für Interlinguistik e.V., 1.-3. Dezember 2006 in Berlin. Berlin : GIL, 2007a. Text in German. ISSN 1432-3567.

WITKAM, Toon. La ekscito de vortstatistiko : Kiel krudforta kunmet-analizo kompletigas tekstkontrolon. Utrecht : [s.n.], [2007b]. 32 pp. Text in Esperanto.

WITKAM, Toon. ESPSOF : Esperanto-Softvaro por Vindozo. versio 0.8. Utrecht : [s.n.], 2008. 7 pp. plus Microsoft Office files. Text in Esperanto, Dutch.

Worsten. Multlingva vortaro pri profesioj, funkcioj kaj rangoj [online]. Wrocław : c2003 , 2008-03 [accessed 2008-05-20]. Text in Esperanto, Czech, German, English, Spanish, French, Italian, Polish. Available from WWW: .

32 ZAMENHOF, L. . Warszawa : [s.n.], 1905. Text in Esperanto, French, English, German, Russian, Polish. Available from WWW: .

33 Appendix A The 16 Rules of Esperanto Grammar The grammar of Esperanto has been analyzed for a lot of times and there are different theo- ries and opinions as to how the language shall be approached from the linguistic point of view. The most renowned contemporary Esperanto grammar reference is PMEG (Wenner- gren, 2005), but there has recently been also novel approaches to the topic, such as the de- scription by Gledhill (Gledhill, 1998) which is based on a modern method of corpus analy- sis. The very first description of Esperanto grammar, in 16 rules, howsoever incomplete and inexpert it is, has been given by the author of the language itself, Dr. L. L. Zamenhof, when the language was published in 1887. It later became a part of the so called “Funda- mento de Esperanto” (Zamenhof, 1905) which later showed itself to be an efficient tool for preventing the language from falling into dialects. The following is a copy of the English version of the original 16 rules of Esperanto grammar, provided here so that an uninformed reader may get a basic idea on the structure of Esperanto:

A) The alphabet Aa (a as in “last”), Bb (b as in “be”), Cc (ts as in “wits”), Ĉĉ (ch as in “church”), Dd (d as in “do”), Ee (a as in “make”), Ff (f as in “fly”), Gg (g as in “gun”), Ĝĝ (j as in “join”), Hh (h as in “half”), Ĥĥ (strongly aspirated h, “ch” in “loch”), Ii (i as in “marine”), Jj (y as in “yoke”), Ĵĵ (z as in “azure”), Kk (k as in “key”), Ll (l as in “line”), Mm (m as in “make”), Nn (n as in “now”), Oo (o as in “not”), Pp (p as in “pair”), Rr (r as in “rare”), Ss (s as in “see”), Ŝŝ (sh as in “show”), Tt (t as in “tea”), Uu (u as in “bull”), Ŭŭ (u as in “mount” – used in diphtongs), Vv (as in “very”), Zz (z as in “zeal”).

Remark: If it be found impraticable to print works with the diacritical signs (^, ˘), the letter h may be substituted for the sign (^), and the sign (˘), may be altogether omitted.9

B) Parts of Speech

1. The Article There is no indefinite, and only one definite, article, la, for all genders, numbers, and cases.

2. Substantives Substantives are formed by adding o to the root. For the plural, the letter j must be added to the singular. There are two cases: the nominative and the objective (accusative). The root with the added o is the nominative, the objective adds an n after the o. Other cases are

9 This surrogate system for representing some characters of the Esperanto alphabet is sometimes called the “Zamenhof-convention”, but it is not extremely popular. Nowadays, the most popular system to write Es- peranto letters in computer if Unicode support is missing is to use the “x-convention”, which places an “x” af- ter the alphabet version of the particular letter. This has been promoted as more practical, since the dif- ference between “ŭ” and “u” does not disappear as in case of the Zamenhof-convention, and also words writ- ten using this surrogate alphabet still appear in proper order when sorted alphabetically. Yet another system is the “circumflex-convention”, which puts the circumflex symbol (“^”) after the letter.

34 formed by prepositions; thus, the possessive (genitive) by de, “of”; the dative by al, “to”, the instrumental (ablative) by kun, “with”, or other preposition as the sense demands. E. g. root patr, “father”; la patr'o, “the father”; la patr'o'n, “the father” (objective), de la patr'o, “of the father”; al la patr'o, “to the father”; kun la patr'o, “with the father”; la patr'o'j, “the fathers”; la patr'o'j'n, “the fathers” (obj.), por la patr'o'j, “for the fathers”.

3. Adjectives Adjectives are formed by adding a to the root. The numbers and cases are the same as in substantives. The comparative degree is formed by prefixing pli (more); the superlative by plej (most). The word “than” is rendered by ol, e. g. pli blanka ol neĝo, “whiter than snow”.

4. Numerals The cardinal numerals do not change their forms for the different cases. They are: unu (1), du (2), tri (3), kvar (4), kvin (5), ses (6), sep (7), ok (8), naŭ (9), dek (10), cent (100), mil (1000). The tens and hundreds are formed by simple junction of the numerals, e. g. 533 = kvin'cent tri'dek tri. Ordinals are formed by adding the adjectival a to the cardinals, e. g. unu'a, “first”; du'a, “second”, etc. Multiplicatives (as “threefold”, “fourfold”, etc.) add obl, e. g. tri'obl'a, “threefold”. Fractionals add on, as du'on'o, “a half”; kvar'on'o, “a quarter”. Collective numerals add op, as kvar'op'e, “four together”. Distributive prefix po, e. g., po kvin, “five apiece”. Adverbials take e, e. g., unu'e, “firstly”, etc.

5. Pronouns The personal pronouns are: mi, “I”; vi, “thou”, “you”; li, “he”; ŝi, “she”; ĝi, “it”; si, “self”; ni, “we”; ili, “they”; oni, “one”, “people”, (French “on”). Possessive pronouns are formed by suffixing to the required personal, the adjectival termination. The of the pro- nouns is identical with that of substantives. E. g. mi, “I”; mi'n, “me” (obj.); mi'a, “my”, “mine”.

6. Verbs The verb does not change its form for numbers or persons, e. g. mi far'as, “I do”; la patr'o far'as, “the father does”; ili far'as, “they do”. The present tense ends in as, e. g. mi far'as, “I do”. The past tense ends in is, e. g. li far'is, “he did”. The future tense ends in os, e. g. ili far'os, “they will do”. The subjunctive mood ends in us, e. g. ŝi far'us, “she may do”. The imperative mood ends in u, e. g. ni far'u, “let us do”. The infinitive mood ends in i, e. g. fari, “to do”. There are two forms of the participle in the international language, the changeable or adjectival, and the unchangeable or adverbial. The present participle active ends in ant, e. g. far'ant'a, “he who is doing”; far'ant'e, “doing”. The past participle active ends in int, e. g. far'int'a, “he who has done”; far'int'e, “having done”. The future participle active ends in ont, e. g. far'ont'a, “he who will do”; far'ont'e, “about to do”. The present participle passive ends in at, e. g. far'at'e, “being done”. The past participle passive ends in it, e. g. far'it'a, “that which has been done”; far'it'e, “having been done”. The future partici- ple passive ends in ot, e. g. far'ot'a, “that which will be done”; far'ot'e, “about to be done”. All forms of the passive are rendered by the respective forms of the verb est (to be) and the

35 participle passive of the required verb; the preposition used is de, “by”. E. g. ŝi est'as am'at'a de ĉiu'j, “she is loved by every one”.

7. Adverbs Adverbs are formed by adding e to the root. The degrees of are the same as in adjectives, e. g., mi'a frat'o kant'as pli bon'e ol mi, “my brother sings better than I”.

8. Prepositions All prepositions govern the .

C) General Rules

9. Pronunciation Every word is to be read exactly as written, there are no silent letters.

10. Accent The accent falls on the last syllable but one, (penultimate).

11. Compounds Compound words are formed by the simple junction of roots, (the principal word standing last), which are written as a single word, but, in elementary works, separated by a small line ('). Grammatical terminations are considered as independent words. E. g. vapor'ŝip'o, “steamboat” is composed of the roots vapor, “steam”, and ŝip, “a boat”, with the substanti- val termination o.

12. Negative If there be one negative in a clause, a second is not admissible.

13. Direction In phrases answering the question “where?” (meaning direction), the words take the termi- nation of the objective case; e. g. kie'n vi ir'as? “where are you going?”; dom'o'n, “home”; London'o'n, “to London”, etc.

14. The Indefinite Preposition Every preposition in the international language has a definite fixed meaning. If it be neces- sary to employ some preposition, and it is not quite evident from the sense which it should be, the word je is used, which has no definite meaning; for example, ĝoj'i je tio, “to rejoice over it”; rid'i je tio, “to laugh at it”; enu'o je la patr'uj'o, “a longing for one’s fatherland”. In every language different prepositions, sanctioned by usage, are employed in these dubi- ous cases, in the international language, one word, je, suffices for all. Instead of je, the ob- jective without a preposition may be used, when no confusion is to be feared.

36 15. New Words The so-called “foreign” words, i. e. words which the greater number of languages have de- rived from the same source, undergo no change in the international language, beyond con- forming to its system of orthography. – Such is the rule with regard to primary words, derivatives are better formed (from the primary word) according to the rules of the interna- tional grammar, e. g. teatr'o, “theatre”, but teatr'a, “theatrical”, (not teatricul'a), etc.

16. Elision The a of the article, and final o of substantives, may be sometimes dropped euphoniae gra- tia, e. g. de l’ mond'o for de la mond'o; Ŝiller’ for Ŝiller'o; in such cases an apostrophe should be substituted for the discarded vowel.

37 Appendix B An Overview of Esperanto Affixes This list of some of the affixes which form an essential part of the Esperanto morphology is a slight modification of the one which can be found in the English Wikipedia (Wikipedia Contributors, 2008) and is provided here as a tool of reference, should any discussion of a particular affix in the text of this work be unclear.

-aĉ pejorative (expresses a skribaĉi (to scrawl, from 'write'); veteraĉo (foul poor opinion of the object weather); domaĉo (a hovel); rigardaĉi (to gape at, from or action) 'look at') -ad imperfective aspect (fre- kuradi (to keep on running); parolado (a speech); adi (to quent, repeated, or contin- carry on) ual action); as a noun, an action or process -aĵ a concrete manifestation manĝaĵo (food, from 'eat'); novaĵo (news, novelty) -an a member, follower, par- kristano (a Christian); marksano (a Marxist); usonano (a ticipant, inhabitant US American) [cf. amerikano (a continental American)] -ar a collective group arbaro (a forest, from 'tree'); vortaro (a dictionary, from 'word' [a set expression]); homaro (humanity, from 'hu- man' [a set expression; 'crowd, mob' is homamaso]) -ĉj masculine affectionate Joĉjo (Jack); paĉjo (daddy); fraĉjo (bro) form; the root is truncated -ebl possible kredebla (believable); videbla (visible) -ec an abstract quality amikeco (friendship); boneco (goodness); italeca (Ital- ianesque) -eg augmentative; sometimes domego (a mansion); virego (a giant); librego (a tome); pejorative connotations varmega (boiling hot); ridegi (to guffaw) when used with people -ej a place characterized by lernejo (a school, from 'to learn'), vendejo (a store, from the root (not used for to- 'to sell'), juĝejo (a court, from 'to judge'), kuirejo (a ponyms) kitchen, from 'to cook'), hundejo (a kennel, from 'dog'), senakvejo (a desert, from 'without water') -em having a propensity, ten- ludema (playful), parolema (talkative), kredema (credu- dency lous) -end mandatory pagenda (payable), legendaĵo (required reading) -er the smallest part ĉenero (a link, from 'chain'); fajrero (a spark, from 'fire'); neĝero (a snowflake, from 'snow'), kudrero (a stitch, from 'sew'), ero (a crumb etc) -estr a leader, boss lernejestro (a school principal); urbestro (a mayor, from 'city'); centestro (a centurion, from 'hundred') -et diminutive; sometimes af- dometo (a hut); libreto (a booklet); varmeta (lukewarm); fectionate connotations rideti (to smile) when used with people

38 -id an offspring, descendant katido (a kitten); reĝido (a prince, from 'king'); arbido (a sapling, from 'tree'); izraelido (an Israelite) -ig to make, to cause (transi- mortigi (to kill, from 'die'); purigi (to clean); konstruigi tivizer/causative) (to have built) -iĝ to become (intransitivizer/ amuziĝi (to enjoy oneself); naskiĝi (to be born); ruĝiĝi inchoative/middle ) (to blush, from 'red') -il an instrument ludilo (a toy, from 'play'); tranĉilo (a knife, from 'cut'); helpilo (a remedy, from 'help') -in female bovino (a cow); patrino (a mother); studentino (a co-ed) -ind worthy of memorinda (memorable); kredinda (credible); fidinda (dependable, trustworthy) -ing a holder, sheath glavingo (a scabbard, from 'sword'); kandelingo (a can- dle-holder); dentingo (a tooth socket) -ism a doctrine, system (as in komunismo (Communism); kristanismo (Christianity) English) -ist person professionally or instruisto (teacher); dentisto (dentist); abelisto (a bee- avocationally occupied keeper), komunisto (a communist) with an idea or activity (a narrower use than in En- glish) -nj feminine affectionate Jonjo (Joanie); panjo (mommy); anjo (granny) form; the root is truncated -obl multiple duobla (double); trioble (triply) -on fraction duona (half [of]); centono (one hundredth) -op collective numeral duope (by twos); gutope (drop by drop) -uj a (loose) container, coun- monujo (a purse, from 'money'); Anglujo (England [An- try (archaic when refer- glio in current usage]); Kurdujo (Kurdistan, the Kurdish ring to a political entity), a lands); pomujo (appletree [now pomarbo]) tree of a certain fruit (ar- chaic) -ul a person characterized by junulo (a youth); sanktulo (a saint, from 'holy'); abo- the root coulo (a beginning reader, from aboco "ABC's"); aĉulo (a wretch, from the suffix aĉ); tiamulo (a contemporary, from 'then') -um undefined ad hoc suffix kolumo (a collar, from 'neck'); krucumi (to crucify, from (used sparingly) 'cross'); malvarmumo (a cold, from 'cold'); plenumi (to fulfill, from 'full'); brakumi (to hug, from 'arm'); dek- strume (clockwise, from 'right') bo- relation by marriage, in- bopatro (a father-in-law); boedzino (a sister-wife) law dis- separation, scattering disĵeti (to throw about); dissendi (to distribute); disatomi (to split by atomic fission) ek- perfective aspect (begin- ekbrili (to flash); ekami (to fall in love); ekkrii (to cry ning, sudden, or momen- out); ekde (inclusive 'from'); ek! (hop to!)

39 tary action) eks- former, ex- eksedzo (an ex-husband); eksbovo (a steer [jokingly, from 'bull']); Eks la estro! (Down with our leader!) fi- shameful, nasty fihomo (a wicked person); fimensa (foul-minded); fivorto (a profane word); Fi al vi! (Shame on you!) ge- both sexes together gepatroj (parents); gesinjoroj (ladies and gentlemen); la geZamenhofoj (the Zamenhofs); gelernejo (a coeduca- tional school); geiĝi (to pair up, to mate) mal- antonym malgranda (small); malriĉa (poor); malino (a male [jok- ingly]); maldekstrume (counter-clockwise) mis- incorrectly, awry misloki (to misplace); misakuzi (to wrongly accuse); mis- famiga (disparaging, from fama 'well-known' and the causative suffix -ig) pra- great-(grand-), primordial, praavo (a great-grandfather); prapatro (a forefather); proto- prabesto (a prehistoric beast); prahindeŭropa (Proto-In- doeuropean) re- over again, back again resendi (to send back); rekonstrui (to rebuild); reaboni (to renew a subscription), rebrilo (reflection, glare, from 'shine'), reira bileto (a return ticket, from iri 'to go')

40