Launching a Constraint Grammar Tool for North Sámi
Total Page:16
File Type:pdf, Size:1020Kb
Many shades of grammar checking – Launching a Constraint Grammar tool for North Sámi Linda Wiechetek [email protected] Sjur Moshagen [email protected] Børre Gaup [email protected] Thomas Omma [email protected] Divvun UiT Norgga árktalaš universitehta 1 Introduction language community. The released grammar checker is a light gram- This paper discusses the characteristics and evalu- mar checker in the sense that it detects and corrects ation of the very first North Sámi spell- and gram- errors that do not require rearranging the whole mar checker. At its launch it supports LibreOf- sentence, but typically just one or several adjacent fice, MS Word and GoogleDocs. We describe word forms based on a grammatical analysis of its component parts, the technology used such as the sentence. Additionally, a number of format- Constraint Grammar (Karlsson, 1990; Karlsson ting errors are covered. The main new features et al., 1995; Bick and Didriksen, 2015) and Hfst- are implemented by means of several Constraint pmatch (Hardwick et al., 2015) and its evaluation Grammar-based modules. These include correc- with a new evaluation tool specifically designed tion of formatting and punctuation errors, filtering for that purpose. of speller suggestions, much improved tokenisa- Only the modules of the full-scale grammar tion and sentence boundary detection, as well as checker described in Wiechetek (2017) that have advanced compound error analysis and correction. been tested and improved sufficiently for the pub- This paper shows how a finite-state based lic to use are released in this launch. More spellchecker can be upgraded to a much more advanced syntactic errors will be included at powerful spelling and grammar checking tool by a later stage. The grammar checker modules adding several Constraint Grammar modules. are an enhancement to an existing North Sámi spellchecker (Gaup et al., 2006), following a phi- 2 Framework losophy of release early, release often, and using An open-source spelling checker for North Sámi continuous integration (ci) and continuous deliv- has been freely distributed since 2007, the be- ery (cd) to deliver updates with new error correc- ginnings of which have been described by Gaup tion types, as new parts of the grammar checker et al. (2006). The tool discussed in this paper are sufficiently tested. Releasing at an early stage includes the open-source spelling checker refer- of development gives the user community early enced above, but further developed and using the access to improved and much needed tools and al- hfst-based spelling mechanism described in Piri- lows the developers to improve the tools based on nen and Lindén (2014). The spelling checker is the community’s feedback. enhanced with five Constraint Grammar modules, North Sámi is a Uralic language spoken in cf. Figure 1. It should be noted that the spelling Norway, Sweden and Finland by approximately checker is exactly the same as the regular North 25 700 speakers (Simons and Fennig, 2018). Sámi spelling checker used by the language com- These countries have other majority languages, munity, but with the added functionality that all making all Sámi speakers bilingual. Bilingual suggestions are returned to the pipeline with their users frequently face bigger challenges regarding full morphological analysis. The analyses are then literacy in the lesser used language than in the ma- used by subsequent Constraint Grammar rules dur- jority language due to reduced access to language ing disambiguation and suggestion filtering. arenas (Outakoski, 2013; Lindgren et al., 2016). All components are compiled and built using Therefore, tools that support literacy, like spelling the Giella infrastructure (Moshagen et al., 2013). and grammar checkers, are more important in a Five constraint grammar modules, i.e. a va- minority language community than in a majority lency grammar (valency.cg3), a tokenizer (mwe- Figure 1: System architecture of the North Sámi grammar checker dis.cg3) and a morpho-syntactic disambiguator Sámi, but can directly be used by any language (grc-disambiguator.cg3), a disambiguation mod- in the Giella system, e.g. other Sámi languages ule for spellchecker suggestions (spellchecker.cg3) as well as any other language. Presently there is and a module for more advanced grammar check- an early version of a working Faroese grammar ing (grammarchecker-release.cg3) are included in checker in addition to the North Sámi one, and the spelling and grammar checker. initial work has started for a number of the other The current order of the modules has shown Sámi languages. to be the most optimal one for our use and The system does error detection at four differ- has been established during the work with the ent stages of the pipeline. Non-word typos are grammar checker. It follows the principle of marked by means of the spellchecker. Secondly, growing complexity, and information necessary a Constraint Grammar module marks whitespace to subsequent modules is made available to errors in punctuation contexts based on input from them. Valencies for example are used in the the second whitespace analyser. Compound errors disambiguation of compounds, which is why are identified by means of a Constraint Grammar- the module preceeds the multi-word disambigua- based tokenisation disambiguation file. And a tion module. The first whitespace tagging mod- fourth Constraint Grammar module marks quota- ule preceeds other modules because it is used tion errors. to inserts hints about the text structure, such As a tool intended to be used in production by as <firstWordOfParagraph>. Such hints regular users, it targets all types of errors, from must be available early on. The second whites- technical typesetting errors such as wrong quota- pace analyser is applied after the multiword dis- tion marks and faulty use of spaces, via spelling ambiguation, but could be applied later. Its pur- errors to advanced grammatical error detection pose is to tag potentially wrong use of whites- and correction. In the evaluation all of these are pace, and must be added before the final grammar counted as grammar checker errors, as we want to checking module, but any position between the evaluate the overall performance of the tool. The multiword disambiguation and grammar checking only errors not included in the evaluation are those is going to work. Spellchecking is performed be- that we do not target at all (which are quite a few fore disambiguation so that more sentence context in this first beta release). is available to the disambiguator. The sentence in (1) includes a typo (Norgag It should be noted that we do not use a guesser should be Norgga ‘Norway’ (Gen.)), a space er- for out-of-vocabulary words. A major part of new ror (before ‘.’) and a compound error (iskkadan words are formed using productive morphology of bargguin should be iskkadanbargguin) ‘survey’ the language, like compounding and derivation, (Loc. Pl.). In addition, there is a congruence er- both of which are encoded in the morphological ror, i.e. dáid (Gen. Pl.) ‘these’ should be dáin analyser. Also, the lexicon is constantly being up- (Loc. Pl.), i.e. it should agree in case and num- dated, so that proper nouns and other potential out- ber with iskkadan bargguin. The launched gram- of-vocabulary words will quickly be covered. As mar checker can detect the first three errors, but updates are made available frequently, this should not the last one, since syntactic error rules are not be a major issue for users. not included in this initial launch of the grammar The infrastructure is not only valid for North checker. (1) Oktiibuot 13 Norgag doaktára leat sults for correcting both, spelling and compound- altogether 13 Norway.GEN doctor.GEN have ing errors are the following: precision is 90.8% leamaš mielde dáid and recall is 86.8%. been with these.GEN iskkadan bargguin_. As opposed to the evaluation done in Wiechetek testing work.LOC.PL et al. (2019), this evaluation will be automatic and ‘Altogether 13 Norwegian doctors have based on an evaluation corpus. This is meaning- participated in these surveys.’ ful as the error corrections intended in this launch typically include a limited amount of adjacent The whitespace analyser detects an erroneous word forms, which is relatively straightforward. space before ‘.’ The suggested correction is Reference-less approaches as proposed in Napoles "<iskkadan bargguin.>". The tokenisation disam- et al. (2016) are not an option as they require pre- biguation module detects the compound error and existing and independently developed tools that suggests iskkadanbargu+N+Sg+Com = iskkadan- are sensitive to grammatical errors, a luxury not bargguin. The tokeniser is used to disambiguate available to most minority languages. between syntactically related n-grams and mis- A part of the North Sámi SIKOR cor- spelled compounds, where the misspelling is an pus (SIKOR2016) containing error mark-up for erroneous space at the word boundaries. orthographical and grammatical errors is used as This module is clearly checking more than the evaluation corpus. It consists of 181 512 spelling conventions, i.e. grammar, as writing, words. SIKOR is split in one part that contains for example, two consecutive nouns as one or two publicly available texts, and one part that contains words has syntactic implications. Such noun-noun texts that cannot be redistributed.1 The genres rep- combinations do not necessarily need to be com- resented in the evaluation corpus are news, blogs, pounds even if the first element is in nominative and teaching materials.2 case. They can also be syntactically related as in We evaluate the output of the launched grammar the agent construction in ex. (2) where addin ‘giv- checker tool against a reference text that is manu- ing’ is a (nominalized) non-finite verb that modi- ally marked for spelling errors (only non-words), fies the second noun, vuod¯du¯ ‘basis’.