Bigorna – a Toolkit for Orthography Migration Challenges
Total Page:16
File Type:pdf, Size:1020Kb
Bigorna – a toolkit for orthography migration challenges José João Almeida1, André Santos1, Alberto Simões2 1 Departamento de Informática, Universidade do Minho, Portugal 2 Escola Superior de Estudos Industriais e de Gestão, Instituto Politécnico do Porto, Portugal [email protected], [email protected], [email protected] Abstract Languages are born, evolve and, eventually, die. During this evolution their spelling rules (and sometimes the syntactic and semantic ones) change, putting old documents out of use. In Portugal, a pair of political agreements with Brazil forced relevant changes on the way the Portuguese language is written. In this article we will detail these two Orthographic Agreements (one in the thirties and the other more recently, in the nineties), and the challenges present on the automatic migration of old documents spelling to their actual one. We will reveal Bigorna, a toolkit for the classification of language variants, their comparison and the conversion of texts in different language versions. These tools will be explained together with examples of migration issues. As Birgorna relies on a set of conversion rules we will also discuss how to infer conversion rules from a set of documents (texts with different ages). The document concludes with a brief evaluation on the conversion and classification tool results and their relevance in the current Portuguese language scenario. 1. Introduction impact. In this article we are interested into the latest, that can be summarized as: Languages evolve. This evolution can be natural and grad- ual, when speakers change habits during decades, or can be • 1931: This was the first orthographic agreement be- forced and drastic, when some kind of regulatory institu- tween Portugal and Brazil trying to approximate both tion defines some rules (or heuristics) on how the language languages. This change abolished the silent s from orthography will change. words such as sciência, scena, scéptico. It also The case of the recent evolution on the Portuguese language changed the way compound verbs were written, from is the latest. This change was mainly a political issue to dir-se há and amar-te hei to dir-se-á and amar-te-ei. approximate the different countries that speak Portuguese. Details on the story of this evolution can be found on sec- • 1945: This new reform just hit the language in Por- tion 1.1.. tugal, cleaning a set of different accents, like remov- In any of the two presented evolution cases, documents lose ing completely the umlaut (trema, in Portuguese), and importance and relevance as soon as their orthography is removing the circumflex accent in homograph words outdated. such as, acêrto and acerto, cêrca and cerca, côr and With the latest changes on the Portuguese language, and cor, fôra and fora, dêsse and desse, and so on. the amount of information that is typical of this era, it be- came necessary to develop automatic methods for language • 1973: While not a political agreement, Portugal fol- modernization. lowed Brazil and abolished accent marks in secondary This document discusses Bigorna, a toolkit developed to stressed syllables (tètralogia and tetralogia). help on the update of documents orthography. It was devel- oped to be as language independent as possible, relying in • 1990/2009: On December the 16th of 1990 the Por- a set of rules learned by the comparison of old and actual tuguese Language Orthographic Agreement (PLOA) corpora orthographic differences. (da República, 1991) was approved in Lisbon by An- This section will present the main steps of Portuguese lan- gola, Cape Verde, Guinea-Bissau, Mozambique, Sao guage evolution, the motivation for our work, what was Tome and Principe, Brazil, Portugal and a delegation the initial state before the project beginnings, and Bigorna of galician observers. Its goal is to unify, as widely main goals. It will be followed by a section on language as possible, the general vocabulary of the Portuguese resources construction and tools development. Section 3. language. presents a brief evaluation of the developed tools. It is fol- The urge to bring closer all variants of Portuguese im- lowed by some conclusions and future work issues. plies some changes on the structure and content of the language itself. On those content changes, the pho- 1.1. Portuguese Language Evolution netic criteria (or pronunciation one) was chosen in- The Portuguese language evolved during the last century stead of the etymological one. This led to some cases through successive orthographic agreements, especially be- of multiple spelling (whenever a word is pronounced tween Portugal and Brazil. A full time-line of changes can differently according to the variant in use) like facto be consulted in (Wikipedia:Reforms, 2010). Some of the and fato, or suntuoso and sumptuoso. The lack of con- described changes are just political, some other had real sensus about whether some words are to be updated or not, and the absence of a common orthographic vocab- • therefore, it is important to create and maintain an Or- ulary (as foreseen in article 2 of the PLOA) gets this thographic Agreement’s Knowledge Base — OAKB, a process even more complex. table containing lemmas, changes and rules based on the currently existing lists, new examples found and While this latest political change was accepted and should the results of processing texts. This table will be up- be put in practice soon, real users of the language will con- dated and evolve (one might never know when it is tinue to write as they learned for some time, leading to a complete); double orthographic form. This means that the two forms (before-1990 and after-1990) are relevant at the moment. • the main problem is on how to determine which words are candidates to integrate the OAKB (what are the 1.2. Motivation changing words). The motivation to this work is two fold: to update old It was clear the need of a project dedicated to the migration documents in the public domain (with more than 70 years process. It was Bigorna project birth. after author dead) to current used orthographic form, and the need to update current documents to the future1 ortho- 1.4. Bigorna’s design goals graphic form, as defined by PLOA: In order to help in the migration process, it is necessary to have some resources and tools. The first set of requirements • With the actual tendency to public accessible repos- included: itories or libraries in the Internet, and as a special 2 • collect all the resources related to PLOA that are al- case Project Gutenberg , a lot of books that are cur- ready available in the Internet; rently under public domain are being scanned and transcribed. While it is crucial to keep the original • create a spell-checker dictionary for the new Por- orthographic form in the archive for linguistic stud- tuguese language; ies (for example), their modernization is also relevant, • build a tool to convert text from the pre-agreement namely for reading purposes. Portuguese to the updated Portuguese language; This leads to the necessity to develop a tool to per- form the systematic migration of language from before • build a classification tool, to identify the variant of 1930 to the currently used (before 1990/2009 reform). Portuguese used in a given text; • make the previous tools work with as few dependen- • We are in the information era. A lot of different doc- cies as possible. uments are currently available and there will be the need to update them to the new orthographic form. We soon understood that the set of changing words was be- ing updated frequently as PLOA is not defined by exhaus- The PLOA implementation demands a revision and tion (it does not list all words, just defines a few heuristics). an update of the existing linguistic tools and the cre- This lead to the following decisions: ation of new ones, and the complexity of these tasks increase due to the multiple spelling cases (see sec- • center all the information about the word changes in tion 1.3.). an unique table (OAKB); • build a system, that given that table, would generate: A great help to both processes would be the creation of a set of tools capable of inferring migration rules through the – the new spell-checker dictionary; comparison of texts. Tools like those would be useful to de- – the language migration tool; velop applications that people could use to face the issues – the Portuguese language variant classifier of PLOA. While this document will mainly focus PLOA implementation, the development tools also have a role in • develop tools to help in the construction and manipu- updating old documents spelling, or in future language up- lation of OAKB: dates. – tools to find lemma changing candidates from 1.3. Summarizing texts; – The current scenario in the Portuguese language, as created a tool to extract lexical differences from pairs of by the PLOA: texts; • make the previous tools as generic and reusable as pos- • the spelling changes dictated by the agreement cannot sible. be determined automatically, as they rely on phonetic criteria and may sometimes be ambiguous (different 2. Bigorna Development accents, for instance); This section describes the Bigorna approach and its devel- opment details. We will first focus how resources were 1We will use “current orthographic form” to refer to the ortho- compiled into OAKB and then how the spell checker dic- graphic form before 1990 agreement, and the “future form” for tionary was updated. Follows a description about the cre- the orthographic form after 1990 agreement. ated tools: conversion tools, the language classifier and the 2http://www.gutenberg.org/ lexical comparison tools. 2.1. Compiling OAKB Function newdic(oakdb,dicjs) for ( x 2 dom(oakb) To start the compilation of resources to integrate the knowl- ^ oakb[x].type = normal edge base we started by searching for free resources related ^ x 6= oakb[x].preferpt to the PLOA.