Apertium: a Free/Open-Source Machine Translation Platform

Apertium: a free/open-source machine translation platform Mikel L. Forcada1;2 1Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant (Spain) 2Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain) April 14-16, 2009: Free/open-source MT tutorial at the CNGL Free/open-source software Free software is also called open-source software: 0. anyone can use it for any purpose 1. anyone can examine it to see how it works and modify it for any new purpose 2. anyone can freely distribute it 3. anyone may release an improved version so that everyone benefits For conditions 1 and 3 to be met, anyone should be able to access the source code, hence the name open source. Copyleft/1 I Obviously a play on the word copyright. I Copyleft, when added to a free license, means that modifications have to be distributed with the same (copylefted) license. Copyleft/2 Free software and contents can I be copylefted using, for instance: I the GNU General Public License (http://www.gnu.org/copyleft/gpl.html) I the Creative Commons Attribution-Sharealike license, version 3.0 I or not copylefted using, for instance: I The “three-clause” or “Simplified” BSD (“Berkeley Software Distribution”) license, I the MIT License, I the Apache Software License I the Creative Commons Attribution license, version 3.0 Copyleft and the commons/1 I Commons: a piece of land subject to common use: as (a) undivided land used especially for pasture or (b) a public open area in a municipality I Analogously: software commons, code subject to common use Copyleft and the commons/2 I Copyleft secures the existence of a software or knowledge commons. I It protects the commons from private appropriation (incorporation into non-free software) I enables communities of programmers to build shared bodies of free software, I by requiring that all derivative work is always distributed under the same license. Business models using free software/1 Free/open-source software changes the business model: I from license-centered models to service-centered models I from vendor lock-in to technological partnership Business models using free software/2 Note that third parties may engage in business without permission from the original authors (vulnerability from competition?): I Copylefted: I Installing, configuring, etc. for a customer. I Adapting the code (a price break may be offered in exchange for the right to distribute the modifications). I Distributing the software and charging for the distribution cost. I Non-copylefted: I All of the above, plus: I Creating closed-source software (back to license-centered models) MT software/1 I MT is special: it strongly depends on data I rule-based MT (RBMT): dictionaries, rules I corpus-based MT (CBMT): sentence-aligned parallel text, monolingual corpora I Three components in every MT system: I Engine (also decoder, recombinator...) I Data (linguistic data, corpora) I Tools to maintain these data and convert them to the format used by the engine MT software/2 : commercial machine translation I Most commercial MT systems are RBMT (but LanguageWeaver, Google Translate are CBMT). I They use proprietary technologies which are not disclosed (perceived as their main competitive advantage). I Only partial modification (customization) of linguistic data is allowed. MT software/3: free/open-source machine translation I For MT to be free/open-source, the engine, the data and the tools must all be free/open-source I In the case of CBMT this means that corpora must also be free/open-source (hard to come by) Opportunities from free/open-source MT systems I Even if reasonable-quality closed-source MT is available, the development and use of free/open-source MT systems provides additional opportunities: I Increases language expertise and resources I Increases technological independence These advantages are specially clear when copyleft is used (expertise and resources form a commons that stands a better chance of being preserved and extended). Increasing expertise and language resources I When building a free/open-source MT system for a language pair, a variety of situations may occur. I All of them involve building linguistic expertise and resources through I reflection about the languages involved I elicitation of linguistic (monolingual and bilingual) knowledge about them I subsequent encoding of this knowledge I The free/open-source setting makes new expertise and resources naturally available to the community. I Three scenarios may occur: Building data for an existing MT engine from scratch I One needs: I A freely available (free/open-source or not) MT engine I Freely available (free/open-source or not) tools to manage linguistic data I Complete documentation on how to build linguistic data for use with the engine and tools I This is a very unfavourable setting. Many decisions have to be made, e.g., defining the set of lexical categories and inflection indicators. I The blank sheet syndrome may paralyze the project. I If overcome, the expertise acquired and the resulting free/open-source data could be improved or used for other purposes: positive effect on the languages involved. Building data for an existing MT engine from existing language-pair data I If freely available tools and engine and free/open-source data are available for another pair with a similar or related language, the blank sheet syndrome is drastically reduced. One could, for example: I use the same set of lexical categories and inflection indicators I build inflection paradigms on top of existing ones Adapting a new free/open-source engine or tools for a new language pair I If source code is available for the engine and tools, experts could enhance or adapt them to address new features of the languages not dealt with adequately by the current code: I character sets I structural transfer not powerful enough, etc. I More challenging than building new data I But programmers do not need to have full command of the languages involved (abstract management of linguistic issues possible). Code rewriting would add expertise and resources to the language community. Increasing technological independence I Having a free/open-source engine, tools and data makes users of the involved languages less dependent on a single commercial, closed-source provider. I This has an analogous effect, not only on machine translation, but also on other language technologies. Background Apertium is based on the technologies developed by the Transducens group at the Universitat d’Alacant during the development of two existing closed-source systems: I interNOSTRUM (interNOSTRUM.com, Spanish–Catalan) I Tradutor Universia (tradutor.universia.net, Spanish–Portuguese) These technologies, initially designed for related-language pairs, have been extended to handle language pairs which are not so related. Rationale /1 To generate translations which are I reasonably intelligible and I easy to correct between related languages such as Spanish (es) and Catalan (ca) or Portuguese (pt), etc., or Nynorsk (nn), Bokmål (no) and Icelandic (is), or Irish (ga) and Scottish Gaelic (gd) one can just augment word for word translation with I robust lexical processing (including multi-word units) I lexical categorial disambiguation (part-of-speech tagging) I local structural processing based on simple and well-formulated rules for frequent structural transformations (reordering, agreement) Rationale /2 For harder, not so related, language pairs: I It should be possible to build on that simple model. I It should be possible to generalize its concepts so that complexity is kept as low as possible. Rationale /3 I It should be possible to generate the whole system from linguistic data (monolingual and bilingual dictionaries, grammar rules) specified in a declarative way. I This information, i.e., I (language-independent) rules to treat text formats I specification of the part-of-speech tagger I morphological and bilingual dictionaries and dictionaries of orthographical transformation rules I structural transfer rules should be provided in an interoperable format ) XML. Rationale /4 I It should be possible to have a single generic (language-independent) engine reading language-pair data (“separation of algorithms and data”). I Language-pair data should be preprocessed so that the system is fast (>10,000 words per second) and compact; for example, lexical transformations are performed by minimized finite-state transducers (FSTs). Rationale /5 Reasons for the development of Apertium as free/open-source software: I To give everyone free, unlimited access to the best possible machine-translation technologies. I To establish a modular, documented, open platform for shallow-transfer machine translation and other human language processing tasks. I To favour the interchange and reuse of existing linguistic data. I To make integration with other free/open-source technologies easier. Rationale /6 More reasons for the development of Apertium as free/open-source software I To benefit from collaborative development I of the machine translation engine I of language-pair data for currently existing or new language pairs from industries, academia and independent developers. I To help shift MT business from the obsolescent licence-centered model to a service-centered model. I To radically guarantee the reproducibility of machine translation and natural language processing research. I Because it does not make sense to use public funds to develop non-free, closed-source software. The Apertium platform Apertium is a free/open-source machine translation platform (http://www.apertium.org) providing: 1. A free/open-source modular shallow-transfer machine translation engine with: I text format management I finite-state lexical processing I statistical lexical disambiguation I shallow transfer based on finite-state pattern matching 2. Free/open-source linguistic data in well-specified XML formats for a variety of language pairs 3. Free/open-source tools: compilers to turn linguistic data into a fast and compact form used by the engine and software to learn disambiguation or structural transfer rules.

Load more