Improving POS Tagging in Old Spanish Using TEITOK

Improving POS Tagging in Old Spanish Using TEITOK Maarten Janssen Josep Ausensi Josep M. Fontana CELGA-ILTEC Universitat Pompeu Fabra Universitat Pompeu Fabra [email protected] Department of Translation Department of Translation and Language Sciences and Language Sciences [email protected] [email protected] Abstract paleographic information, (ii) they should be enriched with linguistic information (initially POS In this paper, we describe how the tagging and eventually also syntactic annotation), TEITOK corpus tools helped to create a (iii) the corpus should be easy to use by non- diachronic corpus for Old Spanish that experts in NLP, and (iv) after the initial develop- contains both paleographic and linguistic ment stage, the corpus should also be easily main- information, which is easy to use for non- tainable and improvable by non-experts in NLP. specialists, and in which it is easy to per- The last requirement was especially relevant in our form manual improvements to automati- context because the development of corpora can cally assigned POS tags and lemmas. be a very long term process and the financial resources to hire collaborators with the necessary 1 Introduction technical skills are not constant and are heavily de- pendent on grants and projects which can be dif- Although the availability of computational re- ficult to obtain for corpora that have already been sources for the study of language change has expe- financed through previous grants. rienced a considerable growth in the last decade, scholars still face considerable challenges when Specifically, we will discuss how the TEITOK trying to conduct research in certain areas such as interface helped in reaching these requirements for syntactic change. This is true even in the case of a diachronic corpus of Spanish (OLDES). A large languages for which there already exist large cor- portion of our corpus came from the electronic pora that are freely accessible on the internet. texts compiled, transcribed and edited by the His- 3 One of such cases is Spanish. Despite the panic Seminary of Medieval Studies (HSMS) . size and quality of the textual resources available This is a large collection of critical editions of through online corpora such as CORDE1 or the original medieval manuscripts which comprise a Corpus del Espanol˜ 2, researchers interested in the wide variety of genres and extend from the 12th to evolution of the Spanish language cannot conduct the 16th centuries. The HSMS texts were turned the type of studies that have been conducted, for into a linguistic corpus enriched with POS tags instance, on the evolution of the English language and lemmas in the context of the dissertation work due to the fact that the diachronic corpora avail- conducted by Sanchez-Marco´ (Sanchez-Marco´ et able for Spanish are scarcely annotated with the al., 2012). The initial version of this corpus was relevant linguistic information and and the range created in a traditional verticalized set-up using of query options is not sufficiently broad. the Corpus Workbench (Evert and Hardy, 2015), This presentation reports work in progress henceforth CWB, and was tagged using a custom within a project that seeks to redress this situ- built version of Freeling (Padro´ et al., 2010) for ation for Spanish. Our goal is to develop re- Old Spanish. See Sanchez´ et al., (2010; 2011; sources to study the evolution of Spanish in at 2012) for a more detailed description of the cor- least the same depth as it is now possible for En- pus as well as of the problems encountered in the glish. These resources have to satisfy the follow- initial stages of development. ing requirements: (i) the texts should also contain 1http://corpus.rae.es/cordenet.html 3See Corfis et al. (1997), Herrera and de Fauve (1997), 2http://www.corpusdelespanol.org/hist-gen/ Kasten et al. (1997), Nitti and Kasten (1997), O’Neill (1999) Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language 2 2 TEITOK CWB search language, but it also provides a simple form that will automatically gener- The version of the corpus described here was create a CWB search query behind the scenes. ated in the TEITOK platform (Janssen, 2015). It also provides glosses for POS tags, elim- TEITOK is an online corpus management plat- inating the need to read through the tagset form in which a corpus consists of a collection of definitions. XML files in the TEI/XML format. Tokens are annotated inline, where token-based information (iv) Most relevantly for this paper, the same in- such as POS and lemmas is modeled as attributes terface that is used for searching and view- over those tokens. For searching purposes, an in- ing the corpus is also used to edit the cor- dexed version of the corpus in CWB is created pus. This makes it easy for the administra- automatically from the collection of XML files. tors and authorized users to correct errors With its CWB search option, TEITOK is com- whenever they encounter them. There are parable to systems like CQPWeb (Hardie, 2012), also several tools available to make structural Korp (Ahlberg et al., 2013), or Bwananet (Vivaldi, changes faster, which will be described in the 2009), with the difference that in TEITOK, the next section. search engine additionally facilitates access to the underlying XML documents, along the lines of Since philological information was removed in TXM (Heiden, 2010). the CWB version of the corpus, we created the cor- TEITOK has several attributes that make it able pus again from the original files, this time keep- to respond to the four requirements mentioned in ing all the information provided in it. Since the the introduction. two versions of the corpus were created independently, there are inevitably small differences be- (i) The files of a TEITOK corpus are encoded tween them: what counts as one token in one ver- in TEI/XML, a format that has been used ex- sion sometimes counts as more than one in the tensively for encoding paleographic informa- other. This makes it close to impossible to im- tion. In the TEITOK interface, this informa- port the tags from one version of the corpus to the tion is not just present in the source code, other. As such, we used the Freeling parameters but is graphically rendered, meaning that a for Old Spanish that were developed as part of the TEITOK document looks like a pleasant-to- original corpus, and applied them to the TEITOK read paleographic manuscript. version, resulting in a corpus that combines the (ii) TEITOK has inline nodes for tokens, as in linguistic and extralinguistic information in a sin- e.g. the XML version of the BNC (BNC, gle set of documents. 2007), which can be adorned with any type TEITOK allows for multiple orthographic real- of linguistic information that is traditionally isations of the same word, which makes it possi- encoded in a verticalized text format, such as ble to keep the paleographic form, and add a form POS tags, lemmas, dependencies relations, in modernized orthography, making the corpus etc. Furthermore, it makes a distinction be- much more accessible to those not familiar with tween orthographic words and grammatical the old spelling forms. Since the lemmas provided words, where a single orthographic word by Freeling are in modern spelling, the modern can contain multiple grammatical words (and spelling of the words was provided automatically vice-versa). This allows us to keep contrac- (wherever possible), by looking up which current tions such as del (‘of the’), while also having word corresponds to the POS tag and the mod- the option of specifying the two grammatical ernized lemma. For instance, the word rresçiban words that form it: de (‘of’) and el (‘the’). was tagged as a present subjunctive (VMSP3P0) of recibir (‘receive’). The modern Spanish lexicon (iii) The online interface of TEITOK is designed for Freeling lists the form reciban for this, which for a broad and diverse audience, adding sev- was hence added as the modernized form. eral features to make the corpus more easily Despite the efforts put into the initial tagging accessible than traditional corpus interfaces: of the OLDES corpus, the level of accuracy was it provides an easy interface to search the still not entirely satisfactory. The main objective corpus, in which it is possible to use the full in this stage of development was to improve the Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language 3 overall quality of the tagging. For this, we de- have to be corrected individually in this way, it be- cided to follow the following strategy: we set apart comes much easier to spot errors: any word that is a selection of texts summing up to 1 million to- not modernized was not recognized by the tagger, kens, and tagged it with the Freeling tagger for and will have an incorrect lemma, and, most likely, Old Spanish. We then used several techniques pro- an incorrect POS tag as well. In many cases, if vided by TEITOK to manually correct errors in a word was recognized, but incorrectly tagged, this gold standard part of the corpus. After correct- it will have an incorrect modernized form. This ing the major errors, we trained NeoTag (Janssen, makes it possible to just look for incorrect words 2012) on this gold standard corpus, and applied in modern Spanish, which are much easier to spot the trained tagger to the rest of the corpus. than errors in POS or lemma. For instance, if the previous example rresçiban had been incorrectly 3 Improving POS tags modernized as recib´ıan by the system, it would have been easy to recognize it by simply looking Independently of how good a POS tagger is, in- the normalized version of the text.

Load more