Many shades of grammar checking – Launching a Constraint Grammar tool for North Sámi Linda Wiechetek [email protected] Sjur Moshagen [email protected] Børre Gaup [email protected] Thomas Omma [email protected]

Divvun UiT Norgga árktalaš universitehta

1 Introduction language community. The released grammar checker is a light gram- This paper discusses the characteristics and evalu- mar checker in the sense that it detects and corrects ation of the very first North Sámi spell- and gram- errors that do not require rearranging the whole mar checker. At its launch it supports LibreOf- sentence, but typically just one or several adjacent fice, MS and GoogleDocs. We describe word forms based on a grammatical analysis of its component parts, the technology used such as the sentence. Additionally, a number of format- Constraint Grammar (Karlsson, 1990; Karlsson ting errors are covered. The main new features et al., 1995; Bick and Didriksen, 2015) and Hfst- are implemented by means of several Constraint pmatch (Hardwick et al., 2015) and its evaluation Grammar-based modules. These include correc- with a new evaluation tool specifically designed tion of formatting and punctuation errors, filtering for that purpose. of speller suggestions, much improved tokenisa- Only the modules of the full-scale grammar tion and sentence boundary detection, as well as checker described in Wiechetek (2017) that have advanced error analysis and correction. been tested and improved sufficiently for the pub- This paper shows how a finite-state based lic to use are released in this launch. More spellchecker can be upgraded to a much more advanced syntactic errors will be included at powerful spelling and grammar checking tool by a later stage. The grammar checker modules adding several Constraint Grammar modules. are an enhancement to an existing North Sámi spellchecker (Gaup et al., 2006), following a phi- 2 Framework losophy of release early, release often, and using An open-source spelling checker for North Sámi continuous integration (ci) and continuous deliv- has been freely distributed since 2007, the be- ery (cd) to deliver updates with new error correc- ginnings of which have been described by Gaup tion types, as new parts of the grammar checker et al. (2006). The tool discussed in this paper are sufficiently tested. Releasing at an early stage includes the open-source spelling checker refer- of development gives the user community early enced above, but further developed and using the access to improved and much needed tools and al- hfst-based spelling mechanism described in Piri- lows the developers to improve the tools based on nen and Lindén (2014). The spelling checker is the community’s feedback. enhanced with five Constraint Grammar modules, North Sámi is a Uralic language spoken in cf. Figure 1. It should be noted that the spelling Norway, Sweden and Finland by approximately checker is exactly the same as the regular North 25 700 speakers (Simons and Fennig, 2018). Sámi spelling checker used by the language com- These countries have other majority languages, munity, but with the added functionality that all making all Sámi speakers bilingual. Bilingual suggestions are returned to the pipeline with their users frequently face bigger challenges regarding full morphological analysis. The analyses are then literacy in the lesser used language than in the ma- used by subsequent Constraint Grammar rules dur- jority language due to reduced access to language ing disambiguation and suggestion filtering. arenas (Outakoski, 2013; Lindgren et al., 2016). All components are compiled and built using Therefore, tools that support literacy, like spelling the Giella infrastructure (Moshagen et al., 2013). and grammar checkers, are more important in a Five constraint grammar modules, i.e. a va- minority language community than in a majority lency grammar (valency.cg3), a tokenizer (mwe- Figure 1: System architecture of the North Sámi grammar checker dis.cg3) and a morpho-syntactic disambiguator Sámi, but can directly be used by any language (grc-disambiguator.cg3), a disambiguation mod- in the Giella system, e.g. other Sámi languages ule for spellchecker suggestions (spellchecker.cg3) as well as any other language. Presently there is and a module for more advanced grammar check- an early version of a working Faroese grammar ing (grammarchecker-release.cg3) are included in checker in addition to the North Sámi one, and the spelling and grammar checker. initial work has started for a number of the other The current order of the modules has shown Sámi languages. to be the most optimal one for our use and The system does error detection at four differ- has been established during the work with the ent stages of the pipeline. Non-word typos are grammar checker. It follows the principle of marked by means of the spellchecker. Secondly, growing complexity, and information necessary a Constraint Grammar module marks whitespace to subsequent modules is made available to errors in punctuation contexts based on input from them. Valencies for example are used in the the second whitespace analyser. Compound errors disambiguation of compounds, which is why are identified by means of a Constraint Grammar- the module preceeds the multi-word disambigua- based tokenisation disambiguation file. And a tion module. The first whitespace tagging mod- fourth Constraint Grammar module marks quota- ule preceeds other modules because it is used tion errors. to inserts hints about the text structure, such As a tool intended to be used in production by as . Such hints regular users, it targets all types of errors, from must be available early on. The second whites- technical typesetting errors such as wrong quota- pace analyser is applied after the multiword dis- tion marks and faulty use of spaces, via spelling ambiguation, but could be applied later. Its pur- errors to advanced grammatical error detection pose is to tag potentially wrong use of whites- and correction. In the evaluation all of these are pace, and must be added before the final grammar counted as grammar checker errors, as we want to checking module, but any position between the evaluate the overall performance of the tool. The multiword disambiguation and grammar checking only errors not included in the evaluation are those is going to work. Spellchecking is performed be- that we do not target at all (which are quite a few fore disambiguation so that more sentence context in this first beta release). is available to the disambiguator. The sentence in (1) includes a typo (Norgag It should be noted that we do not use a guesser should be Norgga ‘Norway’ (Gen.)), a space er- for out-of-vocabulary . A major part of new ror (before ‘.’) and a compound error (iskkadan words are formed using productive morphology of bargguin should be iskkadanbargguin) ‘survey’ the language, like compounding and derivation, (Loc. Pl.). In addition, there is a congruence er- both of which are encoded in the morphological ror, i.e. dáid (Gen. Pl.) ‘these’ should be dáin analyser. Also, the lexicon is constantly being up- (Loc. Pl.), i.e. it should agree in case and num- dated, so that proper nouns and other potential out- ber with iskkadan bargguin. The launched gram- of-vocabulary words will quickly be covered. As mar checker can detect the first three errors, but updates are made available frequently, this should not the last one, since syntactic error rules are not be a major issue for users. not included in this initial launch of the grammar The infrastructure is not only valid for North checker. (1) Oktiibuot 13 Norgag doaktára leat sults for correcting both, spelling and compound- altogether 13 Norway.GEN doctor.GEN have ing errors are the following: precision is 90.8% leamaš mielde dáid and recall is 86.8%. been with these.GEN iskkadan bargguin_. As opposed to the evaluation done in Wiechetek testing work.LOC.PL et al. (2019), this evaluation will be automatic and ‘Altogether 13 Norwegian doctors have based on an evaluation corpus. This is meaning- participated in these surveys.’ ful as the error corrections intended in this launch typically include a limited amount of adjacent The whitespace analyser detects an erroneous word forms, which is relatively straightforward. space before ‘.’ The suggested correction is Reference-less approaches as proposed in Napoles "". The tokenisation disam- et al. (2016) are not an option as they require pre- biguation module detects the compound error and existing and independently developed tools that suggests iskkadanbargu+N+Sg+Com = iskkadan- are sensitive to grammatical errors, a luxury not bargguin. The tokeniser is used to disambiguate available to most minority languages. between syntactically related n-grams and mis- A part of the North Sámi SIKOR cor- spelled compounds, where the misspelling is an pus (SIKOR2016) containing error mark-up for erroneous space at the word boundaries. orthographical and grammatical errors is used as This module is clearly checking more than the evaluation corpus. It consists of 181 512 spelling conventions, i.e. grammar, as writing, words. SIKOR is split in one part that contains for example, two consecutive nouns as one or two publicly available texts, and one part that contains words has syntactic implications. Such noun-noun texts that cannot be redistributed.1 The genres rep- combinations do not necessarily need to be com- resented in the evaluation corpus are news, blogs, pounds even if the first element is in nominative and teaching materials.2 case. They can also be syntactically related as in We evaluate the output of the launched grammar the agent construction in ex. (2) where addin ‘giv- checker tool against a reference text that is manu- ing’ is a (nominalized) non-finite verb that modi- ally marked for spelling errors (only non-words), fies the second noun, vuod¯du¯ ‘basis’. compound errors, space errors and punctuation er- This grammar checker finds compound errors rors (only quotation marks for now), i.e. only the by distinguishing between syntactic readings as error types we actually try to correct. The per- the previous one and compound readings as addin- formance of the grammar checker is measured by vejolašvuoda¯ ‘giving possibility (Gen.)’. means of precision and recall. Good precision has (2) luossabivdu lea lunddolaš typically priority over good recall as users tend to salmon.fishing is natural react more critically to flagging correct input as Golf-rávnnji addin errors as opposed to not flagging error input. Golf-stream.GEN give.ACTIO.NOM vejolašvuoda¯ vuod¯du¯ 4 Conclusion possibility.GEN basis ‘salmon fishing is a natural resource for a This paper has presented the first released spelling possibility given by the Golf-stream’ and grammar checker tool for North Sámi that, in addition to spelling errors, checks and corrects 3 Evaluation (planned) punctuation errors, compound errors and white Previous evaluations of our most advanced error space errors. All error detection and correction, in- type correction have given good precision (76.6%) cluding context-aware filtering of spelling sugges- and recall (78.6%) (Wiechetek et al., 2019). The tions, is performed by Constraint Grammar mod- development of the grammar checker evaluation ules at all stages of the process. The automatic tool presented in this paper has been informed by evaluation of the launched grammar checker tool the previous evaluation, and many errors in the in- is expected to give good results based on recent frastructure and in the Constraint Grammar mod- 1https://giellalt.uit.no/ling/corpus_ ules have been corrected. Therefore higher preci- repositories.html (Retrieved 2019-06-19) sion and recall are expected in this evaluation. 2https://giellalt. uit.no/proof/nordplus/ As a reference, when Bick (2015) evaluates his StavekontrolltestingOgNorplusprosjektet. full-fledged grammar checker DanProof, the re- html (Retrieved 2019-06-19) previous evaluations. in Globalisation and Education, Cultural Studies Additionally this work also describes the full in- and Transdiciplinarity in Education, pages 55–68. frastructure for a full-scale grammar checker and Springer, Singapore. facilitates the implementation of any kind of gram- Sjur N. Moshagen, Tommi A. Pirinen, and Trond matical error correction as soon as these are con- Trosterud. 2013. Building an open-source de- sidered to be working well enough to be released. velopment infrastructure for language technology projects. In NODALIDA. The infrastructure is available for any language within the Giella framework. Courtney Napoles, Keisuke Sak- As the infrastructure is mainly ready, the plan aguchi, and Joel R. Tetreault. 2016. http://aclweb.org/anthology/D/D16/D16-1228.pdf is to include a wide range of real-word errors and There’s no comparison: Reference-less evalua- syntactic error detection in the next release. tion metrics in grammatical error correction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP References 2016, Austin, Texas, USA, November 1-4, 2016, pages 2109–2115. Eckhard Bick. 2015. DanProof: Pedagogical spell and grammar checking for Danish. In Proceedings of the Hanna Outakoski. 2013. Davvisámegielat cálamáhtuˇ 10th International Conference Recent Advances in konteaksta [The context of North Sámi literacy]. Natural Language Processing (RANLP 2015), pages Sámi diedalaš¯ áigeˇcála, 1/2015:29–59. 55–62, Hissar, Bulgaria. INCOMA Ltd. Tommi A. Pirinen and Krister Lindén. 2014. State- Eckhard Bick and Tino Didriksen. 2015. CG-3 – be- of-the-art in weighted finite-state spell-checking. In yond classical Constraint Grammar. In Proceed- Proceedings of the 15th International Conference on ings of the 20th Nordic Conference of Computa- Computational Linguistics and Intelligent Text Pro- tional Linguistics (NoDaLiDa 2015), pages 31–39. cessing - Volume 8404, CICLing 2014, pages 519– Linköping University Electronic Press, Linköpings 532, Berlin, Heidelberg. Springer-Verlag. universitet. SIKOR2016. 2016-12-08. SIKOR UiT The Arc- Børre Gaup, Sjur Moshagen, Thomas Omma, Maaren tic University of Norway and the Norwegian Palismaa, Tomi Pieski, and Trond Trosterud. 2006. Saami Parliament’s Saami text collection. URL: From Xerox to Aspell: A first prototype of a north http://gtweb.uit.no/korp (Accessed 2016-12-08). sámi speller based on twol technology. In Finite- State Methods and Natural Language Processing, Gary F. Simons and Charles D. Fennig, editors. 2018. pages 306–307, Berlin, Heidelberg. Springer Berlin http://www.ethnologue.com (Accessed 2018-10-09) Heidelberg. Ethnologue: Languages of the World, twenty-first edition. SIL International, Dallas, Texas. Sam Hardwick, Miikka Silfver- When grammar can’t be berg, and Krister Lindén. 2015. Linda Wiechetek. 2017. trusted – Valency and semantic categories in North http://aclweb.org/anthology/W/W15/W15-1842.pdf Sámi syntactic analysis and error detection Extracting semantic frames using hfst-pmatch. . PhD In Proceedings of the 20th Nordic Conference of thesis, UiT The Arctic University of Norway. Computational Linguistics, (NoDaLiDa 2015), Linda Wiechetek, Kevin Brubeck Unham- pages 305–308. mer, and Sjur Nørstebø Moshagen. 2019. https://www.aclweb.org/anthology/W19-6007 Fred Karlsson. 1990. Constraint Grammar as a Frame- Seeing more than whitespace – Tokenisation and work for Running Text. In Proceedings disambiguation in a North Sámi grammar checker. of the 13th Conference on Computational Linguis- In Proceedings of the third Workshop on the Use of tics (COLING 1990) , volume 3, pages 168–173, Computational Methods in the Study of Endangered Helsinki, Finland. Association for Computational Languages, pages 46–55. Linguistics.

Fred Karlsson, Atro Voutilainen, Juha Heikkilä, and Arto Anttila. 1995. Constraint Grammar: A Language-Independent System for Parsing Unre- stricted Text. Mouton de Gruyter, Berlin.

Eva Lindgren, Kirk P H Sullivan, Hanna Outakoski, and Asbjørg Westum. 2016. Researching literacy development in the globalised North: studying tri- lingual children’s english writing in Finnish, Nor- wegian and Swedish Sápmi. In David R. Cole and Christine Woodrow, editors, Super Dimensions