Oahpa! Opi!˜ Opiq! Developing free online programs for learning Estonian and Voro˜

Heli Uibo Jaak Pruulmann-Vengerfeldt University of UiT The Arctic University of Norway Cybernetica AS [email protected] [email protected]

Jack Rueter Sulev Iva University of Helsinki University of Tartu rueter.jack@.com [email protected]

Abstract (31.3% of the whole population in 2011). The mo- tivation to learn the is gener- This paper describes porting Oahpa, a set ally high among students and working-age people. of advanced interactive language learn- Estonian morphology and the use of correct cases ing programs, to two new languages both are the most difficult things for people with non- of which spoken in – Estonian as their mother . There- and Voro.˜ Our programs offer a plat- fore, a morphology-aware ICALL system would form where the user can practice vocab- be a helpful tool for Estonian language learners of ulary and the generation of morphologi- all ages. cally complex forms both in isolation and Recently, a couple of free online language within sentential contexts. An overview learning environments for Estonian have appeared of the Oahpa system and its two important that are not commercial but require the cre- building blocks – the morphological finite ation of a user account: keeleklikk.ee and state transducer and the pedagogical lexi- eestikeel.ee. These programs, however, con – is given. The development of mor- have slightly different foci and target groups com- phological finite state transducers for Es- pared to Oahpa. They are not very well suited to tonian and Voro,˜ as well as tailoring the the needs of university students. specific transducers for pedagogical pur- The Voro˜ language belongs to the same branch poses are described. The adaptation of of the Uralic as Estonian and both Estonian and Voro˜ Oahpa to the tar- Finnish. Traditionally it has been considered a get user groups is also discussed. subset of the group of the 1 Introduction Estonian language, but nowadays it has its own lit- erary language and the activists of Voro˜ are apply- 1.1 The languages ing for the recognition of Voro˜ as a regional offi- Estonian is the second largest Baltic Finnic lan- cial language in Estonian. The population of Voro˜ guage with approximately 1.2 million native speakers is estimated at 74,400, most of them re- speakers. It has several morphological features side in southeastern Estonia. common in agglutinative languages. Estonian, At the end of the 1980s a revival of South Es- however, has had a lot of influence from Swedish, tonian varieties started. A new standard of the German and Russian, as such it has lost har- Voro˜ language was developed by native speakers mony and is shifting towards becoming a fusional and activists, linguists and non-linguists alike. The language. standardisation led to the publication of a bilin- Estonian is the only official language in Esto- gual Voro-Estonian˜ dictionary in 2002, containing nia, and, in many professions, high-level Estonian 15,000 entries, and the Estonian-Voro˜ dictionary language skills are required. Free online programs in 2014, with 20,000 entries. for learning would contribute A course in the Voro˜ language and local (cul- to better Estonian language proficiency among tural) history, was taught in 19 schools in the lan- the people with other mother tongues in Estonia guage area in 2012/2013. The Voro˜ language is

Heli Uibo, Jaak Pruulmann-Vengerfeldt, Jack Rueter and Sulev Iva 2015. Oahpa! Õpi! Opiq! Developing free online programs for learning Estonian and Võro. Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015. NEALT Proceedings Series 26 / Linköping Electronic Conference Proceedings 114: 51–64. 51 taught mostly in primary school, in most cases The pedagogical motivation behind Oahpa was as an extracurricular activity, but as an elective in to develop a language tutoring system which nine schools (Koreinik, 2013). Most teaching materials for the Voro˜ language • has free-form dialogues and sophisticated er- have been created, published and provided by ror analysis the Voro˜ Institute. The materials include a • gives immediate error feedback and advice to reader/textbook (Vorokiilne˜ lugomik,˜ 1996), a the user primer (ABC kiraoppus,¨ 1998), a song collection (Tsirr-virr look˜ on˜ o,˜ 1999), a workbook for the • is flexible primer, a workbook for the audiotape, a book of local cultural history (Voromaa˜ kodolugu, 2004), • is easily integrated into instruction at schools an illustrated vocabulary (Piltsynastu, 2004), and and universities a variety of audio and (audio-)visual materials. In addition, there are many texts which can and are • enables the choice of main dialect and meta- being used for teaching: fiction, poetry, a travel- language ogue, print media and an annual series of the chil- dren’s own creation (Mino Voromaa,˜ since 1987). • is freely accessible via the Internet (Koreinik, 2013) Oahpa consists of six games: a vocabulary Since 1996 the Voro˜ language as a can quiz (Leksa) which is based solely on a se- be studied at the University of Tartu. Since 2003 mantically enriched electronic dictionary, a nu- the subject has been called “South Estonian I” for meral quiz (Numra) based on a small finite state beginners, and “South Estonian II” for advanced transducer that generates and recognises numbers, students. Since 2004 there have been two series of date and time expressions, the morphology drill lectures: “Modern Southern ” games Morfa-S (isolated word forms) and Morfa- and “History of the South Estonian literary lan- C (word forms in sentential contexts) that require guage”. The language of instruction of all these a morphological finite state transducer, a question- courses is Voro.˜ Some theses and dissertations answer drill (Vasta) and a dialogue game (Sahka). have also been defended in the Voro˜ language. In The last two games require morphological disam- 2006 and 2012 it was also taught at the University biguation and syntactic analysis on top of the mor- of Helsinki. phological analysis. A free online language learning system is very The first and so far the only instance of Oahpa important for the survival of the Voro˜ language. It that incorporates all the six modules – North will be integrated into the curriculum at Univer- Saami Oahpa – can be tried out on the URL sity of Tartu. At the same time we aim to design http://oahpa.no/davvi/. For some other languages the system in a way that would make it usable for a version of Oahpa with four modules exist individual internet-based learning. This is the only – Leksa, Numra, Morfa-S and Morfa-C. We are way to learn the Voru˜ literary language for many planning to create Voro˜ Oahpa in the same scale. people because most of the Voro˜ speakers have For Estonian our purpose is to go a step further never learned the language at school; there are still and also implement the fifth module, Vasta, that few possibilities for traditional learning and also assumes morphological disambiguation. the literary language is relatively new. Thanks to a cooperation project between the Universities of Tartu and Tromsø, we can make 1.2 Oahpa use of the powerful language technology devel- The ICALL system Oahpa (Antonsen et al, 2009) opment infrastructure (Moshagen et al, 2014) that has been developed at Giellatekno, the centre for has been set up at Giellatekno, and among other Saami language technology at UiT The Arctic things reuse their technologies of creating ICALL University of Norway. The intended target group applications. of Oahpa are adult language learners and it is pri- This paper presents work in progress. The de- marily meant as a supporting tool for learning vo- scribed systems are in the stage of development cabulary and grammar for a students attending re- and most of the modules of both Estonian and spective language courses. Voro˜ Oahpa are still incomplete.

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 52 2 The prerequisites for creating Oahpa 2.2 Morphological FST of Estonian

In order to set up the above mentioned modules North Saami and Estonian stand out among the of Oahpa the minimal set of language resources Uralic languages as the ones deviating most from consists of the agglutinative type. The net outcome of this is a system of non-concatenative morphology (con- sonant gradation, diphthong simplification) com- • a morphology engine, e.g. a morphological bined with a small set of reusable affixes. This finite state transducer (FST), requires concatenative and suprasegmental trans- ducers being composed as serial transducers in or- • a pedagogical lexicon that is enriched with der to represent the morphology in an adequate semantic categories and other information way (for an analysis of Saami and Estonian see that is used in Oahpa. (Trosterud and Uibo, 2005)).

2.1 Morphology engine 2.2.1 Existing implementations We have chosen finite state transducers as a model There are at least three implementations of com- for formalising Estonian and Voro,˜ partly because puter morphology of Estonian but they all share this technology is supported by the Giellatekno in- one common basis that was described in the lex- frastructure but also considering its theoretical and icon and grammar parts of Concise Morpholog- performance-related pros and cons. ical Dictionary (CMD) (Viks, 1992). On one Most modern natural language processing hand, CMD was created in cooperation with com- (NLP) applications perform their tasks using sta- putational linguists and is quite formal and easy tistical language models. At the same time, for to implement. On the other hand, CMD deals morphologically rich languages, estimation of the mostly with morphology of simple words and language models is problematic due to the high with some derivational processes but ignores com- number of compound words and inflected word pletely compounding which gives approximately forms. Thus, rule-based models are better suit- 10..20% of the words in Estonian texts. Also, its able for describing the morphology of highly in- base dictionary is an outdated normative dictio- flected languages. Another argument for choos- nary which has a lot of old words and words that ing the rule-based methods is the relatively limited are used only in some . There are words amount of electronically available texts for lan- that no one knows what they mean or where they guages such as Estonian with its 1.2 million speak- come from. That means that there are some prob- ers, and even more, Voro,˜ as its literary language lems with using this system for modern Estonian is new. both in rules and vocabulary. The attractiveness of the finite-state technology The best implementation of Estonian morphol- for natural language processing stems from four ogy is Estmorf (Kaalep and Vaino, 2000) by sources: modularity of the design; the compact Filosoft, with roots in CMD, the lexicon has been representation that is achieved through minimiza- heavily edited, rules have been adjusted and whole tion; efficiency, which is a result of linear recogni- new compounding mechanism is added so that tion time with finite-state devices; and reversibil- Estmorf would be suitable for using as a spelling ity, resulting from the declarative nature of such check engine and analyser for real Estonian texts. devices. (Wintner, 2008) Another implementation has been created at the Moreover, given the pedagogical applications in Institute of Estonian Language, based on the prin- sight, we were not only interested in automatic ciples of open morphology (Viks, 2000). It is morphological segmentation but in a system that mostly an implementation of CMD with an added would be able to generate the complete and correct mechanism to allow analysis of compounds. morphological paradigm for each lemma in the The third system is an FST-implementation of lexicon. That is, for our application correctness CMD that started its life as an experiment of was more important than coverage. The resources describing Estonian with two-level morphology of an educational application must be manually (Koskenniemi, 1983) in Heli Uibo’s master’s the- revised, otherwise such an application would not sis (Uibo, 1999). It was then gradually extended make any sense. with descriptions of some derivational processes

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 53 by Heli Uibo (Uibo, 2005) and with complete dic- automata to get a final lexical transducer. In order tionary of stems from CMD by Jaak Pruulmann- to build our FST in the Giellatekno infrastructure, Vengerfeldt in his master’s thesis (Pruulmann- our first step was to reorganize our sources. Some Vengerfeldt, 2010). Also, some compounding of the reorganization meant that we precompiled rules were added and the whole FST was com- some of the sources that were previously gener- piled of multiple smaller, specialized FSTs – there ated dynamically. Those build steps that com- was a FST that described generation of simple bined small FSTs were merged into the Giellate- word forms, another for simple-word exceptions kno infrastructure. The Giellatekno infrastructure that would override regular forms, a FST that de- is under active development to cater better to the scribed which of all possible concatenations of needs of languages and applications that use lan- simple word forms are allowed as compounds etc. guage descriptions. The maintainers of Giellate- All those smaller FSTs were combined to a large kno infrastructure added necessary hooks, so that final FST, that was able to generate and analyze we could do some specialised processing between word forms. There were a number of unsolved regular build steps of the Giellatekno infrastruc- problems like the need to revise the dictionary ture. similarly to what has been done for Estmorf, over- generation because of weak compounding rules After we had managed to build our sources us- etc. ing the Giellatekno infrastructure and get a FST that worked more or less identically to what we 2.2.2 Adaptation had had before, the next step was to adapt our Oahpa is built on the Giellatekno infrastucture and source. Mostly, this meant converting the tag set so far all the morphology systems that have been that was in use in the original FST to use the con- in Oahpa have been FST-based. Thus, it was quite ventions used in Giellatekno. Tag adaptation had natural to try and adapt the existing FST-based two aspects – most of the conversion was simple system for Oahpa by integrating the existing FST relabeling but in some instances the tag sets were into the Giellatekno infrastructure. This was use- not compatible or there were other reasons to con- ful for the other parts of the cooperation project sider bigger changes. Our existing tag system was that deal with machine translation as well. Also, mostly inspired by the structure that was dictated the wider context of cooperation project motivates by CMD. Specific labels were chosen so that it some of the decisions we made about FST. would be as compatible with an existing constraint For most languages, FST-s are described us- grammar syntax description of Estonian as possi- ing a large lexicon FST (usually as Xerox lexc ble. We suspect that Giellatekno’s tag system has (Karttunen, 1993) source or at least something similar roots. The tag system is based on the first that is compiled to become a lexc source) and an- supported language descriptions and it has been other FST to describe phonological processes us- extended and improved upon with the addition of ing two-level rules. The Giellatekno infrastruc- new languages with somewhat different require- ture is well suited for such a structure and offers ments. Most of the infrastructure and applications a comfortable set of supporting scripts and filters that depend on it have some adaptation to the ex- to generate a lot of specialized FSTs from the same isting Giellatekno tag set. We were also aware of source, if one follows some conventions. It is the fact that the constraint grammar tools we were also worth mentioning that most active languages originally trying to interface with were about to whose FST description is developed in Giellate- be integrated into the Giellatekno infrastructure as kno infrastructure are close relatives to Estonian well. So, it was decided that we would change tags – multiple Saami languages, Finnish and now also at source level in all our rules and in our morpho- Voro.˜ logical lexicon. In addition to simple relabeling Our FST started out with a two-FST model. where the same thing was expressed with differ- For various reasons, it was developed into a much ent tags (e.g. +in vs +Ine for inessive, +nom vs more complicated system of FSTs. The source +Nom for nominative) there were some minor dif- code of FSTs consisted of regular sources for au- ferences in the meaning of tags. For instance in tomata and custom made build scripts that gener- our system we had separate tags for number (e.g. ated full source files from smaller parts, compiled +pl, +sg) and person (e.g. +ps1, +ps2, +ps3) that binary FSTs from source and then combined those were combined where needed as Giellatekno has

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 54 precombined tags (e.g. +Sg1, +Pl2 for the first ial as some of such derivations are based on weak- person singular and the second person plural, re- grade stem (e.g. lugema ’to read’ and loetu spectively). ’something that has been read’). Some of our One smaller part of tag relabeling was to convert original more complicated system of FSTs dealt uppercase letters that were used in two level rules exactly with those derivations. However, Giel- to multichar tags so that uppercase letters could be latekno infrastructure does not do such things used as a part of regular alphabet. but rather adds derivation-tags to mark that this The sequence of tags in FSTs is as important as word was derived from some base lemma us- sequence of letters in a word. That means that in ing some specific derivation (loetu would be ana- order to generate a specific word form with a gen- lyzed as lugema+V+Der/tu+N+Sg+Nom instead erating FST, one usually has to know the lemma of loetu+N+Sg+Nom as before). It appeared that and the exact sequence of grammatical tags. The for disambiguation rules that kind of information simplest form of rule-based machine translation is useful and so, in the current version we analyze would take a word form in source language, an- (and generate) such derived words with both the alyze it, replace the stem using translation dic- synthetic lemma and original lemma with deriva- tionary and then try to generate the word using tional tags. One of the future tasks is to analyze the same grammatical tags as the original analy- whether the synthetic lemma is really useful for sis returned. Of course, the real languages are not any application or whether we could drop them that easy to translate and there are numerous more and simplify our build system. complicated rules but having the compatibility at During the adaptation and testing of our FST that level is still a desirable property. As Oahpa with Oahpa, it appeared that our system did needs to generate word forms as well, similar tag not have a good way to differentiate (partial) ordering rules for different languages are useful homonyms. There are quite a few paradigms that for developers and linguists who need to deal with have exactly the same written form for nomina- multiple versions of Oahpa in parallel. tive case forms, which is traditional the dictionary Our team members have studied languages that form for nouns and adjectives (e.g. sokk (nomina- we have been prioritized for a machine translation tive), soki (genitive) ’sock’ vs sokk (nominative), subproject, that is North Saami and Finnish. Com- soku (genitive) ’male goat’). This is often due to paring different languages we realized that there is the loss of the final vowel in the nominative and a lot of tradition involved in the ordering of tags such words actually inflect differently. For any in language descriptions. This means that even if application that needs to generate word forms by there were a generic ordering that could be used knowing the lemma and the grammatical informa- for all languages involved, for historical reasons, tion, that is of course a problem. So, to differen- it would be hard to enforce. tiate paradigms with overlapping nominative we used homonymy tags the Giellatekno tag set con- As a result of the analysis of different tagging tains. This means that applications using our FSTs conventions, the most notable change made to Es- have to be aware of those tags as well, usually in tonian system was the restructuring of the tags for the form of translation dictionaries having map- verb forms by Heiki-Jaan Kaalep theoretical foun- ping not to the usual nominative but to the lemma dations of which are described in (Kaalep, 2015). with an additional identifier (e.g. sokk+Hom1 and The aim of restucturing was to have a better match sokk+Hom2). between grammatical meanings and specific sur- face forms that are used nowadays. The generic conclusion from the last two prob- Another important difference between Giellate- lems is that the lemma in the morphological mod- kno tradition and our previous FST was that in ule is actually an identifier of paradigm and as with our original FST we automatically and dynami- other aspects of morphology module – what is cally generated regular derivations as if they were useful and what makes sense depends somewhat regular lemmas in the dictionary. For exam- on the intended users of the module. ple, there are productive rules that derive name The big problem of our current system is the of action and actor from a verb (e.g. ujuma overgeneration. The problem is largely a result of ’to swim’ gives ujumine ’swimming’ and ujuja the mechanism of compounding. The current sys- ’swimmer’). Generating lemmas is not quite triv- tem combines the automaton of simple words with

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 55 itself and then applies the filters that should only ence rules for different inflection classes (’majja’ allow proper compounds and simple words. Such is the preferred form but ’majasse’ is understood a structure enables us to add compounding rela- as well). Some inflection classes also allow mul- tively easily and deal with certain other problems tiple forms for some plural cases (regular plural that were hard to solve within our system of mul- with morpheme vs plural stem, e.g. opikutele˜ vs tiple automata. The current rules of compounding opikuile˜ ’onto the textbooks’). Knowing all those are too generic and there is a lot of allowed forms forms is necessary for analysis but for generation that actually are not used. Inclusion of some short in machine translation and for educational pur- words like names of musical notes in the lexicon poses it would be good to have only the preferred makes this problem worse. Using a regular un- forms generated. Giellatekno infrastructure offers weighted FST makes it also hard to prefer simple the tag +Use/NG to denote forms that are used but word analysis over compound word analysis. should not be generated for such purposes. We do The usual way to implement compounding rules have some experimental use of that flag but we still in the Giellatekno infra would use diacritic flags need to check and tag a lot of parallel forms either as described in (Beesley and Karttunen, 2003) and by word class or, in some cases, actually word by cycles in lexicon description. Converting our sys- word. tem to use compounding mechanism that would be more in line with other language descriptions that 2.3 Morphological FST of Voro˜ use the Giellatekno infra is something that we have Voro,˜ as is the situation with many of the other considered but have postponed so far. The most Uralic languages, has an abundance of regular important reason for this is that there is no clear morphology in both the nominal and verbal parts and formal description of compounding rules for of speech. As such, ready solutions for many of Estonian. The gap between existing formal rules the morphological challenges in Voro˜ might be (e.g. a noun can be added to the nominative or sought out in previous work done on the open- genitive form of another noun) and what really is source, Saami language technology infrastructure used (which of those two forms is preferred or if “Giellatekno” in Tromsø, Norway. Morphophono- some combinations are used at all) is quite big and logical work at Giellatekno on the Saami lan- sometimes explained only by tradition. guages has dealt with stem-internal vowel and As an experiment with flag diacritics we imple- change as well as orthographical word mented rules for the lowercasing of proper name compounding. derivations. We found that the current system of filters makes the creative use of extra flags quite In the initialization of a new language at “Giel- hard as the flags have to be precisely described latekno”, there are a number of default files for in filters to show where exactly they can appear. the development of two-level model and lexc de- Also, as flag diacritics are special and nontrivial scriptions. Concepts useful in the development to implement, they tend to cause problems that are morphophonological strategies and present in the hard to debug in both the filtering and composi- default files include triggers and allophonic vari- tion of FSTs. One issue, for instance, is whether ables. Typically, triggers might be used in coordi- the negation of the whole alphabet contain flag di- nating gradation in the stems, whereas allophonic acritics in different implementations of FST tools variables might be utilized in progressive vowel or even in different operations in the same tool? harmony. There are, however, a few advanced What happens when we apply priority union 1 to languages to follow in development at Giellate- FSTs with and without flag diacritics? kno, namely, Northern Saami, Southern Saami and The other aspect of overgeneration is parallel Finnish. When in doubt these are the ones to quote forms. There are alternative forms of illative (long and question because they are the scenes of most illative that uses regular morpheme and short illa- active development. tive that usually uses stem alteration, e.g. ma- The morphophonological characteristics of jasse vs majja ’into a house’) with some prefer- Voro,˜ at first glance, appear to be reminiscent of those attributed to Finnish. In addition to a par- 1an operation on FSTs which allows us to declare and allel of the gradation system found in Estonian combine a large regular FST and a smaller one with some exceptions that override regular rules with much shorter de- and Northern Saami, where changes in stem quan- scriptions than lexc-only descriptions would allow tity and quality can be attested without apparent

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 56 surface-level motivation, Voro˜ possesses progres- tral that do on block , in sive front- harmony. This said it was Voro˜ these vowels are e and i. Since work with easy to find parallels on the Giellatekno infrastruc- two-level rules is something that continues over a ture. certain period of time, it is always useful to pro- vide illustrative example contexts for the individ- 2.3.1 Initial approach to finite state ual rules. These examples will also help in future description development since the original writer will not al- Classification of the Voro˜ language is available ways remember or be there to explain them. from Sulev Iva’s dissertation (Iva, 2007) and on- Back vowel context, as illustrated in (2.1) can line in the Voro-Estonian-V˜ oro˜ dictionary site be broken into four increments. The first incre- http://synaq.org. Sulev Iva provided a dig- ment is a required word boundary followed by ital copy of his word type classification lists, and zero or more . The second increment work could be commenced. is the optional insertion of one or more neutral In a parallel to previous lexc work on the Giel- vowels followed by zero or more consonants. The latekno infrastructure, inflection type names are required third increment is the presence of an un- simply words representative of the given type. derlying or surface back vowel, which is followed Thus the inflection type name “VOROK˜ ON˜ O”˜ ’a by a fourth increment that cannot contain a word person from Voro’˜ is used to distinguish nominals boundary or front vowels, be they underlying of sharing its characteristics (cf. North surface. and South Saami and Finnish). Since subse- (2.1) quent work often includes syntactic disambigua- Back vowel context tion, a part-of-speech indicator is prefixed onto the BT = # Cns* ([VowNeutral]+ Cns:*) inflection type, which renders A VOROK˜ ON˜ O,˜ [VowBack: | :VowBack] N VOROK˜ ON˜ O˜ and PROP VOROK˜ ON˜ O˜ con- [# | VowFront: | :VowFront]* ; tinuation lexica that can be further directed to a (2.1.1) mutual nominal lexicon in NMN VOROK˜ ON˜ O˜ !e# viska%ˆWGStem%>%{A\"a%}q (it is done similarly in the morphological FSTs !e0 vis0a00aq of other Uralic languages such as Finnish, Livo- The example in (2.1.1) shows a combination of nian, Moksha, Hill Mari, Livvi, Skolt Saami and a trigger ˆ WGStem (weak grade stem) and an al- Nenets). lophonic target {Aa¨} – front-back harmony for According to the initial description used for low unrounded vowels a and a¨, where the result- Voro,˜ there are approximately 50 different declen- ing vowel harmony is back-harmony a¨. sion groups (47 vs 53) and 40 conjugation groups (2.1.2) (40 vs 36). The inflection groups contain words !e# fu¨usiga%ˆStrGStem%ˆVowRM%>i%>d%{¨ %}˜ e representative of both front and back vowel har- ! 0 fu¨usik0000i0de¨ mony, and therefore work was immediately begun (2.1.3) on the description of progressive front-back vowel !e# fu¨usiga%>l%{¨ OE%}˜ e harmony. ! 0 fu¨usiga0l¨ o˜ In accordance with previous work on Mari, Problematic contexts with mixed harmony can Erzya and certain Balto- a ready be observed in examples (2.1.2) and (2.1.3), where solution was reached for front-back vowel har- the word “fu¨usiga”¨ contains both front and back mony. Two-level rules can deal with vowel har- vowels. Here irregularity in just a few stems com- mony through the definition of vowel sets and promises the simplicity sought in two-level rules. contexts. In Voro˜ this was initially accomplished One possibility, of course, is to list these irregu- through the definition of back, front and neutral larities as exceptions. A second possibility, it will vowels (1), and subsequent contexts (2). be noted, is to classify stems on the basis of front- back harmony for all inflecting word classes. This (1) is what the OMorFi description of Finnish does, VowBack = a o u \˜{o}; no two-level rules are given for progressive vowel VowFront = \"a \"o \"u ; harmony. VowNeutral = e i ; The continuation lexica in the OMorFi Finnish As is the case in Finnish, there is a set of neu- description explicitly indicate both harmony and

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 57 gradation, and virtually leave the two-level rules utilized by the computer for meaningful feedback. unused. In practice this utilization of lexc dou- Such feedback in (3.1) might include “vowel loss” bles the number of inflection type lexica for those derived from the trigger “%ˆ VowRM”. The trig- with vowel harmony targets, which, in the case ger “%ˆ WGStem” would indicate “weak grade of Finnish nouns, would comprise seven instances stem”. Both would be associated with plural geni- out of twelve. The illative in Finnish provides its tive in “de”, as indicated expicitly in the continua- own challenge, since it entails a duplication of the tion lexicon “FRONT PL-GEN de” at (3.1.6) and stem-final vowel, i.e. eight vowels. Gradation in (3.1.7). The presence of the tag “+Use/NG” in Finnish centers on the “k”, “t” and “p”, (3.1.7) implies that the tagged sequence will be ac- which is a small number in comparison to what cepted by the analyzer but not generated. is attested in Voro.˜ These combinations are aug- (4) mented by the need for expressing pluralia tan- pereh:perre N\_PEREH ; tum, and the result is upward of 550 noun types (25.03.2015). Here the attribute values in the “” ele- This is a good time to ask whether such a solu- ment are hoped to be applied to the morphological tion might be used in Voro,˜ and whether it would games. be useful. First we have to ask ourselves what the (5) transducers will be used for. If we are interested in ICALL, then we want intelligent feedback for our language learners. in the descriptions accompanying each inflection pereh type. By establishing vowel harmony in an inflec- tion type, we are providing the computer the nec- essary information needed to prompt the learner ... with regard to vowel harmony issues. We are look- ing into making information available at the lexi- con level for computer reading. It is hoped that the 2.3.2 Brief assessment of progress information might be automatically added to indi- A happy medium is being sought for the intermin- vidual words directed through a given lexicon, see gling of lexc and two-level rules strategies. Grad- (3). ual progress is being made towards a lexc solu- (3) (3.1) LEXICON N_PEREH tion to vowel harmony, whereby word stems are (3.2) ! pereh:perre classified for front-back harmony, so as to allow (3.3) ! vowel_harmony: front for immediate vowel-harmony error generation. (3.4) ! gradation: yes Changes in the stem, however, are being worked (3.5) !! * Yaml: __N-pereh_gt-norm.yaml__ (3.6) :%ˆVowRM%>i FRONT_PL-GEN_de ; on with triggers paralleling morphophonological (3.7) +Use/NG:%ˆWGStem%> h FRONT_PL-GEN_de ; rules such that wrong triggers can be applied in The declarations for vowel harmony at (3.1.3), order to produce erroneous forms, according to and gradation at (3.1.4) are information bits that strategies already developed for Northern Saami. can be transferred to the ICALL infrastructure. In fact, it is the initial continuation lexicon associated 2.4 Pedagogical lexicons with a given lemma and stem pair that makes ref- We have used different approaches when compos- erence to all this information, see “N PEREH” in ing the lexicons for Estonian and Voro˜ Oahpa. (3.1). The lexicon of Estonian Oahpa has been created Hence the continuation lexicon “N PEREH” in on the basis of the word list of the Estonian text- (4) is associated with the definition at LEXICON book for beginners ”E nagu Eesti” (Pesti and Ahi, N PEREH in (3.1), and attribute values can be 2011). The word list divided into the twenty-five added to the “” element in (5). At the same chapters is given in the end of the textbook. It is time there is contemplation going on with regard a list of ca 1500 Estonian words and expressions to the use of well-documented triggers, such that with translations to English, Finnish, Russian and the information from a single input line could be German.

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 58 We can bring out the following advantages and framework for creating web applications support- disadvantages of using a textbook’s dictionary as ing the model-view-controller (MVC) design. a basis of Oahpa lexicon: 3.1 Setting up the Django application + Book and chapter information are given, the As there has already been set up a number of in- lexicon can be easily used for additional stances of Oahpa for different languages in the grammar training in courses that base on that Giellatekno infrastructure the process of creating book. Translations of words to four most Oahpa for Estonian and Voro˜ was quite routine common languages of learners of Estonian and did not require much effort. Obviously, each exist. Oahpa instance has different settings – paths to - Information about part-of-speech and seman- linguistic tools, database access data, list of sup- tics had to be added manually or semi- ported languages etc. We are aiming at having al- automatically. most all the language-specific information in the The creation of the Voro˜ Oahpa lexicon started settings file and in the database, rather than in the out with a small lexicon that had just one word Python code. However, the Python code is still for each inflection type. We have chosen this ap- not entirely language-independent. A few adjust- proach because we think it is important that the ments were needed in the lists of grammatical and morphological exercises cover all the inflection semantic categories, in the set of word attributes types. We have tagged the representative words that come in from the lexicon, in the list of lan- of inflection types so that exclusively these words guage pairs in the vocabulary drill game Leksa, in can be chosen for the morphology drill exercise the list of localisation languages, in the initial set- Morfa-S. Otherwise, the words are chosen from tings of the games and in the spell-relax function. the full lexicon where some of the inflection types 3.2 Creation of the database are less frequent and as the words are randomly selected there is a possibility that the user does not The complete database for an instance of Oahpa get a chance to practise some inflection types. Af- that incorporates Leksa, Morfa-S and Morfa-C ter that we added ca 2500 words from the lexicon (note that the program Numra is solely based on of North Saami Oahpa that incorporated transla- the finite state transductors converting the numer- tions to North Saami, Finnish, English, Norwegian ical, time and date expressions into textual form (bokmal),˚ Swedish and German. and vice versa) contains Advantages and disadvantages of using an • words, their translations and semantic classes Oahpa lexicon of another language as a basis were the following: • tags used in the morphological analysis and + Part-of-speech and ”Oahpa-style” semantic their possible sequences (paradigms) information were already there, as well as • word forms for Morfa-S translations to Finnish and English. • question templates for Morfa-C linked to the - The word list does not match the word list word forms that can replace the variables in of any textbook of Voro,˜ therefore this infor- the templates mation must be added afterwards. Because it was a North Saami lexicon, it contained many • morphological feedback information for each words that were irrelevant for Voro˜ (words word form (this is combined from different about Saami handicraft, reindeering, also too characteristics and features of the word that strong focus on the topic ”Christmas”). gives hints about how to inflect the particular word) - Translations to Voro˜ and Estonian had to be added. We use a morphological generator to make the paradigms automatically. During the generation of 3 Creation of the Oahpa applications for the word forms and saving them into the database Estonian and Voro˜ an error log is written to a file. So the database Oahpa is a web application developed in Django generation process also serves as testing of the framework. Django is a powerful open-source morphological FST.

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 59 Figure 1: Screenshot of the vocabulary drill program Leksa in Voro˜ Oahpa

So far we have set up Leksa, Morfa-S (substan- tonian). Other translation languages in the list are tives) and Morfa-C (substantives) for Estonian. Finnish, English, German, North Saami, Swedish The programs have been tested by the developers and Norwegian. The correct answers of the user of FSTs and Oahpa and demonstrated to the teach- are displayed in green. In the second column the ers of Estonian at Tartu and Uppsala universities correct answers are presented by the system. The and at Estonian School in Stockholm for getting correct answer is given in parentheses if the user’s some feedback. answer is correct. The working modules of Voro˜ Oahpa are The current version of Estonian Oahpa can Numra, Leksa and Morfa-S (substantives and be tried out at the address http://testing. verbs). The user interfaces of both Estonian and oahpa.no/eesti and Voro˜ Oahpa at http: Voro˜ Oahpa have been translated to Estonian. //testing.oahpa.no/voro. The programs are free to use for everyone and do not require any The user interface of Leksa in Voro˜ Oahpa can registration. be seen on Figure 1. There are three menus for specifying the exercise – teema (’topic’, i.e. se- 3.3 Some problems and their solutions mantic category), keeltepaar (’language pair’) and 3.3.1 Spell-relax Opik˜ / peatukid¨ (’book / chapters’). The first menu makes it possible to constrain the set of words Spell-relax means that the program accepts differ- offered to the user by semantics. On Figure 1 ent variants of typing for some characters or se- the words are chosen from the category Loomad quences of characters. This feature had been pre- (’animals’), for example orrav (’squirrel’), harm¨ viously implemented in Oahpa in order to make (’spider’), karbl¨ ane¨ (’fly’), sulg’ (’feather’). There Oahpa usable for users who do not have access to are 19 semantic categories in the list, among oth- a keyboard (either virtual or real) with the layout ers family, food/drink, time, body, clothes, build- of the language in focus. ings/rooms, work/economy/tools. From the sec- We have not implemented any spell-relax in ond pull-down menu the user can choose the lan- the Estonian Oahpa. We could perhaps consider guage pair. The default is from Voro˜ to the lan- accepting ’sh’ instead of ’s’ˇ and ’zh’ instead of guage of the user interface (in the given case – Es- ’z’ˇ because it might be that everybody has not

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 60 installed the Estonian keyboard but probably it sion of the North Saami Oahpa in 2009-2011) would be a better idea to have a link to installation were university students and other adult language instructions of an Estonian keyboard. The written learners who were learning North Saami as L2. Estonian is highly normative and it would not be The North Saami Oahpa has been integrated into pedagogical to accept wrong spellings where for the university courses at UiT. There are course example the letters a,¨ o,¨ u,¨ o˜ are replaced by cor- pages with different kinds of materials for learning responding letters without diacritics. It is also im- North Saami – texts with reading comprehension portant for the learner to capitalise proper names questions, recorded dialogues, grammar explana- etc. where needed. Uppercase/lowercase mistakes tions, lexicons. When taking a university course in are not tolerated. North Saami the students are working with Oahpa The situation is quite different for the Voro˜ lan- in the logged in mode that makes it possible for guage. We have implemented spell-relax in Voro˜ students to see their progress and for teachers and Oahpa because the written language of Voro˜ is researchers to track the activity of the students, relatively new and there is a big variety for how also they can see which topics in the course seem some phenomena are expressed. The things that to be most difficult and hence should be given are being spelled in various ways are not usual more attention. From the lessons there are direct but rather symbols that mark a slightly links to appropriate drill exercises in Oahpa. different pronunciation: On the course pages http://kursa. 1. palatalisation (conventionally denoted by oahpa.no and in North Saami Oahpa the modifier , but all the other scientific linguistic terminology is used and the apostrophe-like characters are also accepted) grammar is explained on a level that is appropriate for its primary target group – adult learners. It 2. glottal stop (conventionally denoted by the should be noted, however, that some primary and letter ’q’ but the use of ’q’ is not consequent secondary schools have also expressed an interest in the texts that are being published in the in using Oahpa. Since Oahpa is freely available Voro˜ language) on the Internet, it should be adapted to the wide 3.3.2 What is a correct word form? user group – there should be possibilities to ask for help about difficult terms etc. That is As a developing language with multiple accepted why the developers of the North Saami Oahpa forms, Voro˜ may prove overwhelming for the be- have introduced additional tooltip explanations ginner. For solely pedagogical purposes, it may of terms in Vasta and Sahka that pop up when prove necessary to limit the number of forms gen- pointing on a term in the error feedback. We erated by the computer prompter to one given stan- plan to implement such help tooltips for Estonian dard while allowing students the liberty of writing Oahpa as well. all possible forms. To this end the tag ”+Use/NG”, which has been used in MT previously at Giellate- The target group of Voro˜ Oahpa will be the stu- kno, can be used. Its use will provide form pref- dents at the Voro˜ language courses at University of erence, something parallel to word preference al- Tartu. These are typically students of the Estonian ready marked in the oahpa xml dictionaries with language, thus they usually have a solid linguistic the ”stat” attribute value ”pref”. background. Should we accept some forms that are not nor- The Estonian Oahpa will have three or four mative but widely used? We might do it for Voro˜ quite different user groups. The first group are as the standarisation of this language is not fi- the foreign students who have come to study at the nalised yet. The program should, however, sug- University of Tartu and are taking a course in Esto- gest the normative form as the correct answer after nian as L2. They will use Oahpa parallel to tradi- it has accepted a widely used but non-normative tional language lessons in the classroom. The stu- form. dents have different mother tongues and are learn- ing Estonian on the basis of English, Finnish or 4 User groups of Estonian and Voro˜ Russian. The University of Tartu Language Cen- Oahpa and adaptation issues tre is organising these kinds of courses and plenty The primary target group when designing Oahpa of students sign up for them each term. framework (that is, when developing the first ver- The second group are students who take the

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 61 Figure 2: Screenshot of a more advanced Morfa-C exercise web-based course in Estonian at Uppsala Univer- Oahpa to Swedish, for making Leksa usable for sity in . The course that has been cre- pupils at ESS. Leksa has been tested by the sec- ated by prof. Raimo Raag (Raag, 2010) is to- ond grade pupils at this school and the feedback tally internet-based, the teachers meet their stu- was positive. The teachers see the use of Leksa for dents only at video conferences. The course au- learning both Estonian and Swedish. For young thors and teachers estimated Estonian Oahpa po- children with Estonian as the mother and tentially useful for their students in achieving their for Estonian as L2 learners Leksa may be used for vocabulary and grammar learning goals, given that training spelling of Estonian words. Estonian chil- some more exercise types will be implemented dren who have recently moved to Sweden can also and links set from the lessons in the course ma- use Leksa for learning Swedish words. terials to the relevant exercises in Oahpa. Instead of international () case names that The third group are the pupils at Estonian are generally not known to primary and secondary School in Stockholm (ESS). This is the only school students and because the school grammar school in Sweden where Estonian language and books use Estonian case names, we are using case culture are taught. The pupils in this school have questions (e.g. kelle? ”whose” Omastav ”Geni- different language backgrounds. Part of them have tive” instead of Genitiiv) and Estonian case names. Estonian as their mother tongue and have recently Led by the feedback of university teachers we moved from Estonia to Sweden, another part has have deviated from the standard setup of case list lived in Sweden for a longer time and grown in Estonian Morfa-C. Instead of always explic- up in the environment. Some itly giving the case we have implemented some pupils are grandchildren of who moved exercises which include a choice between two to Sweden in the 1940s. Another part of the pupils grammatical forms. For example, in the exercise has no connection to the Estonian language at all. “Kuhu?” ’Where (to)?’ the student must choose Considering these different user groups we defi- between illative and allative. An example screen- nitely have to make some adaptations, in particular shot of this exercise is presented on Figure 2. for the young learners. The Estonian language teachers at ESS have We have translated the lexicon of Estonian also given some other ideas for Morfa-C exer-

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 62 cises that are on the waiting list of implementa- 5 Conclusions and future work tion: choosing the case, choosing the cor- rect infinitive form (there are two infinitives in Es- The use of FSTs for morphological analysis and tonian – da-infinitive and ma-supine – the usage generation and standardised XML formats to store of which in a given context is difficult for non- lexicon and exercise frames makes it possible natives) and more. These are examples of exer- to effectively create a variety of morphological cises that are well supported by the Oahpa frame- drills for learners of morphologically complex lan- work and easy to implement. guages. Our experiments with setting up language learn- Another possibility to adapt the same instance ing system for two new languages – Estonian of Oahpa to different user groups is to have the and Voro˜ – prove that the method that has choice between different sources of words. Both been worked out at Giellatekno research group in Leksa and Morfa-S have the corresponding menu Tromsø is efficient and makes it possible to create ’book’ in their user interface. Lexicons, word lists the first prototypes of vocabulary and morphology of textbooks, single chapters or groups of book drill modules with a relatively small effort. The chapters can be listed in this menu. obligatory prerequisites for creating such a system For teaching the university course of Estonian are a lexicon that can just be a word list of the as a foreign language the textbook ”E nagu Eesti” course textbook in the pdf format and the morpho- (Pesti and Ahi, 2011) is used both at the University logical FST that will be used for generating all the of Tartu and Uppsala University. inflection forms of the words in the lexicon. The major work that has to be done is the work We are also planning to add the dictionaries of in developing the morphological FST. At the same the Estonian textbooks used at the Estonian lan- time, the FST in itself is a multi-purpose building guage courses at ESS into the Oahpa lexicon. The block that can be used in a variety of applications same textbooks are also used at Russian schools in as for example spelling check, machine translation Estonia and the Estonian school in Riga. and an intelligent dictionary. One of the initial ideas when designing Oahpa In our case, we had to create the Voro˜ FST from was that only ”the known” words (words that oc- scratch. Despite that, we could start developing cur in the textbook’s word list and also in the the language learning system in parallel with that vocabulary drill program Leksa) will be used in and very soon come up with a prototype of the grammar exercises. We will make it more fine- morphology drill program that just contained one grained. According to feedback from the teachers representative from each inflection type. of Estonian it is important that beginners’ gram- There existed a “beta version” of FST for Esto- mar exercises would not contain too advanced vo- nian that had to be restructured somewhat in order cabulary. Thus, there is a new detail in the Morfa- to accommodate it in the Giellatekno infrastruc- C question frames for Estonian – not only the se- ture. mantic class but also the book chapter where the The pedagogical applications also set some ad- word is introduced is determined when selecting ditional constraints on the FST. It is usual that words for a particular grammar exercise. some of the parallel forms and infrequent words Some of the vocabulary can also be unknown have to be excluded from the pedagogical FST. because of cultural differences. For example food Here, once again, the Giellatekno infrastructure al- differs quite a lot even between otherwise cultur- ready had a solution that we could apply. ally quite similar countries Estonia and Sweden. The described systems are in the middle of de- People who have not grown up in Estonia may velopment but as the feedback from teachers has wonder what uhepajatoit¨ (a typical Estonian late been positive we feel optimistic about continuing summer / autumn hot pot usually made of pork, the work on both Estonian and Voro˜ finite state carrot, turnip and cabbage) or rosolje (a Russian transducers and the respective Oahpa instances beet root salad) is. We still think that learning where with first of all plan to complete Morfa-S a language cannot be separated from getting ac- and Morfa-C with other inflectable word classes. quainted with the culture. Probably, these words It is also important to add more guidelines and are not appropriate in the exercises meant for ab- error feedback to the systems. We are going to solute beginners but they could come a bit later. use the same approach as in North Saami where

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 63 the feedback of morphological forms is combined Kimmo Koskenniemi. 1983. Two-level Morphology: from pieces of information that characterise the in- A General Computational Model for Word-Form flection type of the given word. This approach is Recognition and Production. PhD Thesis. Univer- sity of Helsinki. described in (Antonsen, 2012) and (Antonsen et al., 2013). Sjur Moshagen, Jack Rueter, Tommi Pirinen, Trond Trosterud, and Francis M. Tyers. 2014. Open- Source Infrastructures for Collaborative Work on Under-Resourced Languages Proceedings of References CCURL (Collaboration and Computing for Under- Lene Antonsen, Saara Huhmarniemi, and Trond Resourced Languages in the Linked Open Data Trosterud. 2009. Interactive pedagogical programs Era) workshop 2014 organised with LREC2014: based on constraint grammar. Proceedings of the 71–77 European Language Resources Association 17th Nordic Conference of Computational Linguis- (ELRA). tics. Nealt Proceedings Series 4. Mall Pesti, and Helve Ahi. 2011. E nagu Eesti. Eesti Lene Antonsen. 2012. Improving feedback on L2 mis- keele opik˜ algajaile ’E as Estonia. Estonian for be- spellings – an FST approach. Proceedings of the ginners’. TEA Kirjastus, , Estonia. SLTC 2012 workshop on NLP for CALL, Lund, Jaak Pruulmann-Vengerfeldt. 2010. Praktiline loplikel˜ 25th October, 2012. Linkoping¨ Electronic Confer- automaatidel pohinev˜ eesti keele morfoloogiakirjel- ence Proceedings 80: 1-10. dus ’Practical Finite State Morphology of Estonian’. M.Sc. Thesis. Tartu Ulikool,¨ Tartu, Estonia. Lene Antonsen, Ryan Johnson, Trond Trosterud, and Heli Uibo. 2013. Generating modular grammar ex- Raimo Raag. 2010. Den sprakliga˚ mangfalden˚ – ercises with finite-state transducers. Proceedings of smaspr˚ akens˚ renassans.¨ (Ed.) Jenny Lee Kun- the second workshop on NLP for computer-assisted skapens nya varldar,¨ Uppsala: Uppsala Learning language learning at NODALIDA 2013, May 22-24, Lab, Uppsala universitet: 211–221. Oslo, Norway. NEALT Proceedings Series 17: 27- 38. Trond Trosterud and Heli Uibo. 2005. Consonant Gra- dation in Estonian and Sami: Two-Level Solution. Kenneth R. Beesley and Lauri Karttunen. 2003. Finite (Eds) Antti Arppe et al. Inquiries into Words, Con- State Morphology. CSLI publications in Computa- straints and Contexts: 136–150. tional Linguistics, USA. Heli Uibo. 1999. Eesti keele sonavormide˜ arvu- Sulev Iva. 2007. Voru˜ kirjakeele sonamuutmiss˜ usteem¨ tianalu¨us¨ ja -suntees¨ kahetasemelist morfoloogia- ’Inflectional Morphology in the Voro˜ Literary Lan- mudelit rakendades. ’The computerized analysis and guage’. PhD Thesis. Tartu Ulikooli¨ Kirjastus, Tartu, synthesis of Estonian word forms using the two-level Estonia. morphology model’. M.Sc. Thesis. Tartu Ulikool,¨ Tartu, Estonia. Heiki-Jaan Kaalep. 2015. Eesti verbi vormistik. ’Es- tonian verb paradigm’ Keel ja Kirjandus 1/2015: Heli Uibo. 2005. Finite-State Morphology of Es- 1–16. Eesti Teaduste Akadeemia ja Eesti Kirjanike tonian: Two-Levelness Extended. (Ed R. Mitkov) Liidu ajakiri. SA Kultuurileht, Tallinn, Estonia. Proceedings of the International Conference on Recent Advances in Natural Language Processing Heiki-Jaan Kaalep and Tarmo Vaino. 2000. Teksti (RANLP) 2005: 580–584. Borovets. taielik¨ morfoloogiline analu¨us¨ lingvisti to¨ovahendite¨ ¨ komplektis. ’The complete morphological anal- Ulle Viks. 1992. A concise morphological dictionary ysis of the text in the toolbox of a lin- of Estonian: introduction & grammar. Estonian guist.’ Arvutuslingvistikalt inimesele. Tartu Ulikooli¨ Academy of sciences, Institute of language and lit- uldkeeleteaduse¨ oppetooli˜ toimetised: 87–99. Tartu erature. ¨ Ulikooli Kirjastus, Tartu, Estonia. Ulle¨ Viks. 2000. Eesti keele avatud morfoloogia- mudel. ’An open morphology model of Esto- Lauri Karttunen. 1993. Finite-State Lexicon Com- nian’ Arvutuslingvistikalt inimesele. Tartu Ulikooli¨ piler. Technical Report. ISTL-NLTT-1993-04-02. uldkeeleteaduse¨ oppetooli˜ toimetised: 9–36. Tartu April 1993. Xerox Palo Alto Research Center. Palo Ulikooli¨ Kirjastus, Tartu, Estonia. Alto, California. Shuly Wintner. 2008. Strengths and weaknesses of Kadri Koreinik. 2013. The Voro˜ language in Esto- finite-state technology: a case study in morpho- nia. ELDIA Case-Specific Report. Studies in Euro- logical grammar development. Natural Language pean Language Diversity 23. (Ed.) Johanna Laakso Engineering 14(4):457-469. Cambridge University Research consortium ELDIA c/o Prof. Dr. An- Press. neli Sarhimaa, Northern European and Baltic Lan- guages and Cultures (SNEB), Gutenberg- Universitat¨ Mainz.

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015 64