Apertium: a free/open-source machine platform

Mikel L. Forcada1,2

1Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant ()

2Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain)

April 14-16, 2009: Free/open-source MT tutorial at the CNGL Free/open-source software

Free software is also called open-source software: 0. anyone can use it for any purpose 1. anyone can examine it to see how it works and modify it for any new purpose 2. anyone can freely distribute it 3. anyone may release an improved version so that everyone benefits For conditions 1 and 3 to be met, anyone should be able to access the source code, hence the name open source. Copyleft/1

I Obviously a play on the word copyright.

I Copyleft, when added to a free license, means that modifications have to be distributed with the same (copylefted) license. Copyleft/2

Free software and contents can

I be copylefted using, for instance:

I the GNU General Public License (http://www.gnu.org/copyleft/gpl.html) I the Creative Commons Attribution-Sharealike license, version 3.0

I or not copylefted using, for instance:

I The “three-clause” or “Simplified” BSD (“Berkeley Software Distribution”) license, I the MIT License, I the Apache I the Creative Commons Attribution license, version 3.0 Copyleft and the commons/1

I Commons: a piece of land subject to common use: as (a) undivided land used especially for pasture or (b) a public open area in a municipality

I Analogously: software commons, code subject to common use Copyleft and the commons/2

I Copyleft secures the existence of a software or knowledge commons.

I It protects the commons from private appropriation (incorporation into non-free software)

I enables communities of programmers to build shared bodies of free software,

I by requiring that all derivative work is always distributed under the same license. Business models using free software/1

Free/open-source software changes the business model:

I from license-centered models to service-centered models

I from vendor lock-in to technological partnership Business models using free software/2

Note that third parties may engage in business without permission from the original authors (vulnerability from competition?):

I Copylefted:

I Installing, configuring, etc. for a customer. I Adapting the code (a price break may be offered in exchange for the right to distribute the modifications). I Distributing the software and charging for the distribution cost.

I Non-copylefted:

I All of the above, plus: I Creating closed-source software (back to license-centered models) MT software/1

I MT is special: it strongly depends on data

I rule-based MT (RBMT): dictionaries, rules I corpus-based MT (CBMT): sentence-aligned parallel text, monolingual corpora

I Three components in every MT system:

I Engine (also decoder, recombinator...) I Data (linguistic data, corpora) I Tools to maintain these data and convert them to the format used by the engine MT software/2 : commercial

I Most commercial MT systems are RBMT (but LanguageWeaver, are CBMT).

I They use proprietary technologies which are not disclosed (perceived as their main competitive advantage).

I Only partial modification (customization) of linguistic data is allowed. MT software/3: free/open-source machine translation

I For MT to be free/open-source, the engine, the data and the tools must all be free/open-source

I In the case of CBMT this means that corpora must also be free/open-source (hard to come by) Opportunities from free/open-source MT systems

I Even if reasonable-quality closed-source MT is available, the development and use of free/open-source MT systems provides additional opportunities:

I Increases language expertise and resources I Increases technological independence These advantages are specially clear when copyleft is used (expertise and resources form a commons that stands a better chance of being preserved and extended). Increasing expertise and language resources

I When building a free/open-source MT system for a language pair, a variety of situations may occur.

I All of them involve building linguistic expertise and resources through

I reflection about the languages involved I elicitation of linguistic (monolingual and bilingual) knowledge about them I subsequent encoding of this knowledge

I The free/open-source setting makes new expertise and resources naturally available to the community.

I Three scenarios may occur: Building data for an existing MT engine from scratch

I One needs:

I A freely available (free/open-source or not) MT engine I Freely available (free/open-source or not) tools to manage linguistic data I Complete documentation on how to build linguistic data for use with the engine and tools

I This is a very unfavourable setting. Many decisions have to be made, e.g., defining the set of lexical categories and inflection indicators.

I The blank sheet syndrome may paralyze the project.

I If overcome, the expertise acquired and the resulting free/open-source data could be improved or used for other purposes: positive effect on the languages involved. Building data for an existing MT engine from existing language-pair data

I If freely available tools and engine and free/open-source data are available for another pair with a similar or related language, the blank sheet syndrome is drastically reduced. One could, for example:

I use the same set of lexical categories and inflection indicators I build inflection paradigms on top of existing ones Adapting a new free/open-source engine or tools for a new language pair

I If source code is available for the engine and tools, experts could enhance or adapt them to address new features of the languages not dealt with adequately by the current code:

I character sets I structural transfer not powerful enough, etc.

I More challenging than building new data

I But programmers do not need to have full command of the languages involved (abstract management of linguistic issues possible). Code rewriting would add expertise and resources to the language community. Increasing technological independence

I Having a free/open-source engine, tools and data makes users of the involved languages less dependent on a single commercial, closed-source provider.

I This has an analogous effect, not only on machine translation, but also on other language technologies. Background

Apertium is based on the technologies developed by the Transducens group at the Universitat d’Alacant during the development of two existing closed-source systems:

I interNOSTRUM (interNOSTRUM.com, Spanish–Catalan)

I Tradutor Universia (tradutor.universia.net, Spanish–Portuguese) These technologies, initially designed for related-language pairs, have been extended to handle language pairs which are not so related. Rationale /1

To generate which are

I reasonably intelligible and

I easy to correct between related languages such as Spanish (es) and Catalan (ca) or Portuguese (pt), etc., or (nn), Bokmål (no) and Icelandic (is), or Irish (ga) and Scottish Gaelic (gd) one can just augment word for word translation with

I robust lexical processing (including multi-word units)

I lexical categorial disambiguation (part-of-speech tagging)

I local structural processing based on simple and well-formulated rules for frequent structural transformations (reordering, agreement) Rationale /2

For harder, not so related, language pairs:

I It should be possible to build on that simple model.

I It should be possible to generalize its concepts so that complexity is kept as low as possible. Rationale /3

I It should be possible to generate the whole system from linguistic data (monolingual and bilingual dictionaries, grammar rules) specified in a declarative way.

I This information, i.e.,

I (language-independent) rules to treat text formats I specification of the part-of-speech tagger I morphological and bilingual dictionaries and dictionaries of orthographical transformation rules I structural transfer rules should be provided in an interoperable format ⇒ XML. Rationale /4

I It should be possible to have a single generic (language-independent) engine reading language-pair data (“separation of algorithms and data”).

I Language-pair data should be preprocessed so that the system is fast (>10,000 words per second) and compact; for example, lexical transformations are performed by minimized finite-state transducers (FSTs). Rationale /5

Reasons for the development of Apertium as free/open-source software:

I To give everyone free, unlimited access to the best possible machine-translation technologies.

I To establish a modular, documented, open platform for shallow-transfer machine translation and other human language processing tasks.

I To favour the interchange and reuse of existing linguistic data.

I To make integration with other free/open-source technologies easier. Rationale /6

More reasons for the development of Apertium as free/open-source software

I To benefit from collaborative development

I of the machine translation engine I of language-pair data for currently existing or new language pairs from industries, academia and independent developers.

I To help shift MT business from the obsolescent licence-centered model to a service-centered model.

I To radically guarantee the reproducibility of machine translation and natural language processing research.

I Because it does not make sense to use public funds to develop non-free, closed-source software. The Apertium platform

Apertium is a free/open-source machine translation platform (http://www.apertium.org) providing: 1. A free/open-source modular shallow-transfer machine translation engine with:

I text format management I finite-state lexical processing I statistical lexical disambiguation I shallow transfer based on finite-state pattern matching 2. Free/open-source linguistic data in well-specified XML formats for a variety of language pairs 3. Free/open-source tools: compilers to turn linguistic data into a fast and compact form used by the engine and software to learn disambiguation or structural transfer rules. The Apertium engine/1

SL text→ De-formatter ↓ Morphological analyser [←FST] ↓ Categorial disambiguator [←FST+stat.] ↓ [rules→] Structural transfer ↔ Lexical transfer [←FST] ↓ Morphological generator [←FST] ↓ Post-generator [←FST] ↓ Re-formatter →TL text The Apertium engine/2

Communication between modules: text (Unix “pipelines”). Advantages:

I Simplifies diagnosis and debugging

I Allows the modification of data between two modules using, e.g., filters

I Makes it easy to insert alternative modules (interesting for research and development purposes)

I An example: some language pairs have an additional module (based on VISL CG3) before the part-of-speech tagger. De-formatter

I Separates text from format information.

I Currently available for ISO-8859 or UTF-8 plain text, HTML, RTF, ODF, OpenOffice.org .sxw, etc.).

I Based on finite-state techniques.

I Most of these filters are generated (using a XSLT stylesheet) from an XML de-formatter specification file. Morphological analyser

I segments the source text in surface forms (SFs), I assigns to each SF one or more lexical forms (LFs), each one with:

I lemma I lexical category (part-of-speech) I morphological inflection information

I processes contractions (en: can’t=can+not; won’t=will+not) and multi-word units which may be invariable (es: a través de [=en through, across]) or variable (es: echó de menos → echar de menos [=en missed]).

I reads finite-state transducers generated from a morphological dictionary in XML (using a compiler). Categorial disambiguator (part-of-speech tagger)

I picks one of the LFs corresponding to each ambiguous SF (about 30% of them) according to context

I uses hidden Markov models and hand-written constraint rules

I is trained using representative corpora for the source language (manually disambiguated or not) or, recently, using statistical models for the TL

I its behavior is completely specified by an XML archive Structural transfer /1

I It is based on finite-state techniques (finite-state recognizers).

I The XML transfer-rule file is preprocessed for faster interpreting.

I Rules have a pattern–action form.

I It detects LF patterns to be processed using a left-to-right, longest-match strategy.

I It executes the actions associated to each pattern in the rule file to generate the corresponding LF pattern for the TL. Structural transfer /2

For “harder” language pairs, a three-stage structural transfer is available:

I Patterns of LFs (chunks) are detected, processed and marked

I Patterns of chunks are detected and processed: this interchunk processing allows for longer-range (“inter-chunk”) syntactic transformations

I The output chunks are finished and the resulting LFs are written. Lexical transfer module

I reads each SL LF and generates the corresponding TL LF

I reads finite-state transducers generated from bilingual dictionaries in XML (using a compiler).

I invoked by the structural transfer module Morphological generator

I Generates a TL SF from each TL LF after adequately inflecting it

I It reads finite-state transducers generated from a morphological dictionary in XML (using a compiler) Post-generator

I Performs some TL orthographical transformations, such as contractions (ca: de +els → dels; pt: dizer + o → dizê-lo); en: can + not → cannot); inserting apostrophes (ca: de + amics → d’amics), etc.

I It is based on finite-state transducers generated from a post-generation rule dictionary (using a compiler). Re-formatter

I Reestablishes format information into the translated text.

I Also used to modify URLs in links for translate-as-you-surf .

I It is based on finite-state techniques.

I It is generated (using a XSLT stylesheet) from an XML de-formatter specification file Language-pair data

The Apertium project hosts the development of a large number of language pairs:

I Stable language pairs include: ca→eo ca↔oc, cy→en, en↔ca, en↔es, en↔gl, es↔ca, es→eo, es↔fr, es↔gl, es↔pt, es↔oc, eu→es, pt↔ca, pt↔gl, ro→es

I There is also a growing number of language pairs under development. Some include Celtic languages (br→fr, ga↔gd). Project funding

Funded by

I The Ministry of Industry, Tourism and Commerce of Spain (also, the Ministries of Education and Science and of Science and Technology of Spain)

I The Secretariat for Technology and the Information Society of the Government of Catalonia

I The Ministry of Foreign Affairs of Romania

I The Universitat d’Alacant

I The Ofis ar Brezhoneg ( Board)

I Google (Google Summer of Code 2009) scholarships

I Companies: Prompsit Language Engineering, ABC Enciklopedioj, Eleka Ingeniartiza Linguistikoa, imaxin|software, Eolaistriu Technologies, etc. The Apertium community/1

Not the ideal community development situation, but close. In addition to the original (funded) developers, a community (instigated by Francis Tyers) has formed around the platform.

I More than 100 developers in .net/projects/apertium/, many outside the original group; code updated very frequently, hundreds of monthly SVN commits.

I A collectively-maintained wiki shows the current development and tips for people building new language pairs or code. The Apertium community/2

I Externally developed tools and code:

I a graphical user interface apertium-tolk, and the diagnostic tool apertium-view I plugins for OpenOffice.org, the Pidgin (previously Gaim) messaging program, for the Wordpress content management system, etc. I A film subtitling application (apertium-subtitles) I Dictionaries adapted to mobile phones and handhelds (tinylex) I Windows ports.

I Many people gather and interact in the #apertium IRC channel (at freenode.net).

I Stable packages ported to Debian GNU/Linux (and therefore to Ubuntu). License

This work may be distributed under the terms of

I the Creative Commons Attribution–Share Alike license: http: //creativecommons.org/licenses/by-sa/3.0/

I the GNU GPL v. 3.0 License: http://www.gnu.org/licenses/gpl.html Dual license! E-mail me to get the sources: [email protected]