<<

Improving POS Tagging in Using TEITOK

Maarten Janssen Josep Ausensi Josep M. Fontana CELGA-ILTEC Universitat Pompeu Fabra Universitat Pompeu Fabra [email protected] Department of Translation Department of Translation and Language Sciences and Language Sciences [email protected] [email protected]

Abstract paleographic information, (ii) they should be en- riched with linguistic information (initially POS In this paper, we describe how the tagging and eventually also syntactic annotation), TEITOK corpus tools helped to create a (iii) the corpus should be easy to use by non- diachronic corpus for Old Spanish that experts in NLP, and (iv) after the initial develop- contains both paleographic and linguistic ment stage, the corpus should also be easily main- information, which is easy to use for non- tainable and improvable by non-experts in NLP. specialists, and in which it is easy to per- The last requirement was especially relevant in our form manual improvements to automati- context because the development of corpora can cally assigned POS tags and lemmas. be a very long term process and the financial re- sources to hire collaborators with the necessary 1 Introduction technical skills are not constant and are heavily de- pendent on grants and projects which can be dif- Although the availability of computational re- ficult to obtain for corpora that have already been sources for the study of language change has expe- financed through previous grants. rienced a considerable growth in the last decade, scholars still face considerable challenges when Specifically, we will discuss how the TEITOK trying to conduct research in certain areas such as interface helped in reaching these requirements for syntactic change. This is true even in the case of a diachronic corpus of Spanish (OLDES). A large languages for which there already exist large cor- portion of our corpus came from the electronic pora that are freely accessible on the internet. texts compiled, transcribed and edited by the His- 3 One of such cases is Spanish. Despite the panic Seminary of Medieval Studies (HSMS) . size and quality of the textual resources available This is a large collection of critical editions of through online corpora such as CORDE1 or the original medieval manuscripts which comprise a Corpus del Espanol˜ 2, researchers interested in the wide variety of genres and extend from the 12th to evolution of the cannot conduct the 16th centuries. The HSMS texts were turned the type of studies that have been conducted, for into a linguistic corpus enriched with POS tags instance, on the evolution of the English language and lemmas in the context of the dissertation work due to the fact that the diachronic corpora avail- conducted by Sanchez-Marco´ (Sanchez-Marco´ et able for Spanish are scarcely annotated with the al., 2012). The initial version of this corpus was relevant linguistic information and and the range created in a traditional verticalized set-up using of query options is not sufficiently broad. the Corpus Workbench (Evert and Hardy, 2015), This presentation reports work in progress henceforth CWB, and was tagged using a custom within a project that seeks to redress this situ- built version of Freeling (Padro´ et al., 2010) for ation for Spanish. Our goal is to develop re- Old Spanish. See Sanchez´ et al., (2010; 2011; sources to study the evolution of Spanish in at 2012) for a more detailed description of the cor- least the same depth as it is now possible for En- pus as well as of the problems encountered in the glish. These resources have to satisfy the follow- initial stages of development. ing requirements: (i) the texts should also contain

1http://corpus.rae.es/cordenet.html 3See Corfis et al. (1997), Herrera and de Fauve (1997), 2http://www.corpusdelespanol.org/hist-gen/ Kasten et al. (1997), Nitti and Kasten (1997), O’Neill (1999)

Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language 2 2 TEITOK CWB search language, but it also provides a simple form that will automatically gener- The version of the corpus described here was cre- ate a CWB search query behind the scenes. ated in the TEITOK platform (Janssen, 2015). It also provides glosses for POS tags, elim- TEITOK is an online corpus management plat- inating the need to read through the tagset form in which a corpus consists of a collection of definitions. XML files in the TEI/XML format. Tokens are annotated inline, where token-based information (iv) Most relevantly for this paper, the same in- such as POS and lemmas is modeled as attributes terface that is used for searching and view- over those tokens. For searching purposes, an in- ing the corpus is also used to edit the cor- dexed version of the corpus in CWB is created pus. This makes it easy for the administra- automatically from the collection of XML files. tors and authorized users to correct errors With its CWB search option, TEITOK is com- whenever they encounter them. There are parable to systems like CQPWeb (Hardie, 2012), also several tools available to make structural Korp (Ahlberg et al., 2013), or Bwananet (Vivaldi, changes faster, which will be described in the 2009), with the difference that in TEITOK, the next section. search engine additionally facilitates access to the underlying XML documents, along the lines of Since philological information was removed in TXM (Heiden, 2010). the CWB version of the corpus, we created the cor- TEITOK has several attributes that make it able pus again from the original files, this time keep- to respond to the four requirements mentioned in ing all the information provided in it. Since the the introduction. two versions of the corpus were created indepen- dently, there are inevitably small differences be- (i) The files of a TEITOK corpus are encoded tween them: what counts as one token in one ver- in TEI/XML, a format that has been used ex- sion sometimes counts as more than one in the tensively for encoding paleographic informa- other. This makes it close to impossible to im- tion. In the TEITOK interface, this informa- port the tags from one version of the corpus to the tion is not just present in the source code, other. As such, we used the Freeling parameters but is graphically rendered, meaning that a for Old Spanish that were developed as part of the TEITOK document looks like a pleasant-to- original corpus, and applied them to the TEITOK read paleographic manuscript. version, resulting in a corpus that combines the (ii) TEITOK has inline nodes for tokens, as in linguistic and extralinguistic information in a sin- e.g. the XML version of the BNC (BNC, gle set of documents. 2007), which can be adorned with any type TEITOK allows for multiple orthographic real- of linguistic information that is traditionally isations of the same word, which makes it possi- encoded in a verticalized text format, such as ble to keep the paleographic form, and add a form POS tags, lemmas, dependencies relations, in modernized orthography, making the corpus etc. Furthermore, it makes a distinction be- much more accessible to those not familiar with tween orthographic words and grammatical the old spelling forms. Since the lemmas provided words, where a single orthographic word by Freeling are in modern spelling, the modern can contain multiple grammatical words (and spelling of the words was provided automatically vice-versa). This allows us to keep contrac- (wherever possible), by looking up which current tions such as del (‘of the’), while also having word corresponds to the POS tag and the mod- the option of specifying the two grammatical ernized lemma. For instance, the word rresc¸iban words that form it: de (‘of’) and el (‘the’). was tagged as a present subjunctive (VMSP3P0) of recibir (‘receive’). The modern Spanish lexicon (iii) The online interface of TEITOK is designed for Freeling lists the form reciban for this, which for a broad and diverse audience, adding sev- was hence added as the modernized form. eral features to make the corpus more easily Despite the efforts put into the initial tagging accessible than traditional corpus interfaces: of the OLDES corpus, the level of accuracy was it provides an easy interface to search the still not entirely satisfactory. The main objective corpus, in which it is possible to use the full in this stage of development was to improve the

Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language 3 overall quality of the tagging. For this, we de- have to be corrected individually in this way, it be- cided to follow the following strategy: we set apart comes much easier to spot errors: any word that is a selection of texts summing up to 1 million to- not modernized was not recognized by the tagger, kens, and tagged it with the Freeling tagger for and will have an incorrect lemma, and, most likely, Old Spanish. We then used several techniques pro- an incorrect POS tag as well. In many cases, if vided by TEITOK to manually correct errors in a word was recognized, but incorrectly tagged, this gold standard part of the corpus. After correct- it will have an incorrect modernized form. This ing the major errors, we trained NeoTag (Janssen, makes it possible to just look for incorrect words 2012) on this gold standard corpus, and applied in modern Spanish, which are much easier to spot the trained tagger to the rest of the corpus. than errors in POS or lemma. For instance, if the previous example rresc¸iban had been incorrectly 3 Improving POS tags modernized as recib´ıan by the system, it would have been easy to recognize it by simply looking Independently of how good a POS tagger is, in- the normalized version of the text. Thus, in these correct tags will always be created. In the case of cases, there is no need to check the actual POS tag a closed corpus like the HSMS corpus, it quickly (something much harder to process), because the becomes more efficient to correct errors created tag can be inferred by the actual modern form. by the tagger than to attempt to improve the qual- ity of the tagger. Traditionally, tagging correction And finally, multiple tokens throughout the cor- has been done by hand, either in a text editor or pus can be corrected in batch mode using CWB an XML editor. Tools to facilitate tag correction queries. CWB can be used to search for very spe- are relatively new, such as ANNIS (Krause and cific words that are frequently tagged incorrectly, Zeldes, 2016) or eDictor (Feliciano de Faria et al., and all words in the resulting KWIC list are click- 2010). Unlike most of these tools, TEITOK allows able to correct any errors they contain. It is also editing directly from the XML interface. possible to correct all matching results in one go, either by changing the lemma for all of them to a The TEITOK version of the HSMS texts pro- specific value, or by going through all the matches vides a comfortable and quick way to manually in a verticalized format. Thus, TEITOK can use correct tagging errors. The base mode of editing the output of a CWB search to edit the underlying in TEITOK involves clicking on a word in the text. XML files. This provides a reliable and fast way This opens up an HTML form, where any of the to quickly correct the errors previously spotted on attributes of the word can be modified. Although the verticalized view; sometimes an error spotted this is very helpful when encountering a single er- while correcting a text on the verticalized view is ror while using the corpus, it is not very efficient indicative of a more general problem that applies for large corrections. Therefore, there are three to the whole corpus. This renders the whole cor- main options to speed up corrections. rection process faster since, by spotting a general- The first option is the closest to the traditional ized error on the verticalized view, the editor can way of correcting tagging errors: it is possible to simply correct all the incorrectly tagged tokens of get a verticalized version of a text, in which mul- the whole corpus via the interface. tiple tokens can be corrected at once, while still seeing the surrounding tokens. In the verticalized An example is given in figure 1, where a rel- version the editor can correct a token in all its dif- atively simple query is used to identify all words ferent layers of representation, i.e. transcription, starting with rr-, which is no longer used in current written form, editor form, expanded form, critical . We then asked the system to form and normalized form. It is possible to see the edit the normalized form for all of those, where different forms for the same token, and this ren- the normalized form was furthermore pre-treated ders the whole manual correcting process easier automatically by replacing all double rr for a sin- since it is possible to compare the original with gle r. This allows editing all such words in one go, the more modernized form of the same token. The independently of which XML file they appear in. verticalized version also allows the editor to cor- This general procedure can be enhanced via rect POS tags and lemmas at the same time. simple strategies to identify specific incorrectly A second option is to correct errors from the text tagged tokens. For instance, a recurrent incor- in modernized orthography. Although words still rectly tagged token is the word vienes, as it is of-

Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language 4 Figure 1: Multi-Editing in TEITOK ten tagged as a verb (‘you come’) even though it searching for the lemma ser (i.e. be, in any form) is actually related to the modern spelled bi- followed by partida marked as a noun will re- enes (‘goods’). By searching for all occurrences turn occurrences of partida that should have been of the word vienes it is possible to correct all in- tagged as a . correctly marked ones in one go. It is even possi- Since these different methods to correct errors ble to search specifically only for occurrences of in tags, lemmas, and normalized forms are easy to vienes that follow a determiner, by using a com- apply and do not require specific knowledge of the plex CQP query over multiple tokens in which the computational system, or imply that the corpus has word vienes is marked as the target word (using to be rebuilt by a computational linguist, TEITOK the CQP operator @). allows all administrators of the corpus to correct Other general problems that can be corrected errors over time - either by simply correcting indi- automatically include examples such as the fol- vidual errors, or by correcting multiple instances lowing: the form a is incorrectly tagged as a of an error throughout the corpus in batch mode as preposition (‘to’) when it relates to the modern described in the previous paragraph. This means form spelled ha (‘he has’); the form el´ (‘he’) is that the process of ironing out remaining errors is tagged as a pronoun when it relates to the deter- put back in the hands of the historical linguists, in- miner el, or partida is tagged as a noun (‘depar- stead of requiring the technical support of external ture’) when it relates to the participle (‘departed’). collaborators. All these generalized problems can be easily cor- rected taking advantage of the CWB interface and 4 Conclusion looking for specific combinations of the forms and specific lemmas or POS tags. For instance, search- In this article, we hope to have shown how the ing for occurrences of a marked as a preposition, TEITOK framework makes working with anno- that are followed by a participle, gives only oc- tated historical corpora much easier: not only does currences that should have been normalized as ha it allow one to keep all paleographic information from the verb haber, hence making it possible with the corpus, but it also makes it possible for to change all of them in batch mode. Searching linguists to correct annotation errors in an easy for el´ followed by a noun returns instances of el´ way, without the need to have detailed knowledge that should have been tagged as a determiner, or of the computational processes behind it. This re-

Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language 5 sult is a historical corpus that is useful not only for Maarten Janssen. 2015. Multi-level manuscript tran- corpus linguists or syntacticians, but also, for in- scription: TEITOK. In Congresso de Humanidades stance, for historical linguists or philologists, and Digitais em Portugal, Lisboa, October 8-9, 2015. which can be improved over time, given that it is Lloyd Kasten, John Nitti, and Wilhelmina Jonxis- possible to correct errors whenever they are en- Henkemans. 1997. The Electronic Texts and Con- countered. This is especially relevant in the con- cordances of the Prose Works of Alfonso , El Sabio. Madison. text of historical corpora, since there are so many different sources of possible errors. Thomas Krause and Amir Zeldes. 2016. Annis3: A new architecture for generic corpus query and vi- sualization. Digital Scholarship in the Humanities, References 31(1):118. Malin Ahlberg, Lars Borin, Markus Forsberg, Martin John Nitti and Lloyd Kasten. 1997. The Electronic Hammarstedt, Leif-Joran¨ Olsson, Olof Olsson, Jo- Texts and Concordances of Medieval Navarro- han Roxendal, and Jonatan Uppstrom.¨ 2013. Korp Aragonese Manuscripts. Madison. and karp – a bestiary of language resources: the re- John O’Neill. 1999. Electronic Texts and Concor- search infrastructure of sprakbanken.˚ dances of the Madison Corpus of Early Spanish Manuscripts and Printings. Madison. BNC. 2007. British national corpus, version 3 BNC XML edition. Llu´ıs Padro,´ Miquel Collado, Samuel Reese, Marina Lloberes, and Irene Castellon.´ 2010. Freeling Ivy A. Corfis, John O’Neill, and Jr. Theodore S. Beard- 2.1: Five years of open-source language processing sley. 1997. Early Celestina Electronic Texts and tools. In Proceedings of 7th Language Resources Concordances. Madison. and Evaluation Conference (LREC’10), La Valletta, Malta, May. Stefan Evert and Andrew Hardy. 2015. Twenty-first century corpus workbench: Updating a query archi- Cristina Sanchez-Marco,´ Gemma Boleda, and Llu´ıs tecture for the new millennium. In 10th Interna- Padro.´ 2010. Annotation and representation of a di- tional Conference on Open Repositories (OR2015), achronic corpus of spanish. In Proceedings of the June. Language Resources and Evaluation Conference, Malta, May. Association for Computational Linguis- Pablo Picasso Feliciano de Faria, Fabio Natanael Ke- tics. pler, and Maria Clara Paixao˜ de Sousa. 2010. An integrated tool for annotating historical corpora. Cristina Sanchez-Marco,´ Gemma Boleda, and Llu´ıs In Proceedings of the Fourth Linguistic Annotation Padro.´ 2011. Extending the tool, or how to anno- Workshop, LAW IV ’10, pages 217–221, Strouds- tate historical language varieties. In Proceedings of burg, PA, USA. Association for Computational Lin- the 5th ACL-HLT workshop on language technology guistics. for cultural heritage, social sciences, and humani- ties, pages 1–9. Association for Computational Lin- Andrew Hardie. 2012. Cqpweb combining power, guistics. flexibility and usability in a corpus analysis tool. In- ternational Journal of Corpus Linguistics, 17(3). Cristina Sanchez-Marco,´ .M. Fontana, and J. Domingo. 2012. Anotacion´ automatica´ de Serge Heiden. 2010. The TXM Platform: Build- textos diacronicos´ del espanol.˜ In Actas del VIII ing Open-Source Textual Analysis Software Com- Congreso Internacional de Historia de la Lengua patible with the TEI Encoding Scheme. In Ryo Espanola˜ , Universidad de Santiago de Compostela. Otoguro, Kiyoshi Ishikawa, Hiroshi Umemoto, Kei Jorge Vivaldi. 2009. Corpus and exploitation tool: Yoshimoto, and Yasunari Harada, editors, 24th Pa- Iulact and bwananet. In I International Confer- cific Asia Conference on Language, Information and ence on Corpus Linguistics (CICL 2009), A survey Computation, pages 389–398, Sendai, Japan. Insti- on corpus-based research, Universidad de Murcia, tute for Digital Enhancement of Cognitive Develop- pages 224–239. ment, Waseda University.

Mar´ıa Teresa Herrera and Mar´ıa Estela Gonzalez´ de Fauve. 1997. Textos y Concordancias Electronicos´ del Corpus Medico´ Espanol˜ . Madison.

Maarten Janssen. 2012. NeoTag: a POS tagger for grammatical neologism detection. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, pages 2118–2124.

Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language 6