Apertium: a Rule-Based Machine Translation Platform
Total Page:16
File Type:pdf, Size:1020Kb
Apertium: A rule-based machine translation platform Francis M. Tyers HSL-fakultehta UiT Norgga árktalaš universitehta 9018 Romsa (Norway) 28th September 2014 1 / 41 Outline Introduction Design Development Status Teaching Courses Google Summer of Code Google Code-in Research New language pairs Applying unsupervised methods Hybrid systems Future work and challenges Challenges Plans Collaboration 1 / 41 History I 2004 — Spain gets a new government which launches a call for proposals to build machine translation systems for the languages of Spain I The Universitat d’Alacant (UA), in consortium with EHU, UPC, UVigo, Eleka, Elhuyar (Basque Country) and Imaxin (Galicia) get funded to develop two MT systems: I Apertium: Spanish–Catalan, Spanish–Galician I Matxin: Spanish!Basque I Apertium was not built from scratch, but was rather a rewrite of two existing closed-source systems which had been built by UA 2 / 41 Focus I Marginalised: Languages which are on the periphery either societally or technologically (from Breton to Bengali). I Lesser-resourced: Languages for which few free/open-source language resources exist. I Closely-related: Languages which are suited to shallow-transfer machine translation. 3 / 41 Translation philosophy I Build on word-for-word translation I Avoid complex linguistic formalisms I We like saying “only secondary-school linguistics required”1 I Intermediate representation based on morphological information I Transformer-style systems I Analyse the source language (SL), then I Apply rules to ‘transform’ the SL to the TL 1Otte, P. and Tyers, F. (2011) “Rapid rule-based machine translation between Dutch and Afrikaans”. Proceedings of the 16th Annual Conference of the European Association of Machine Translation 4 / 41 Pipeline SL deformatter text morph. morph. lexical lexical structural morph. post- analyser disambig. transfer selection transfer generator generator reformatter TL text 5 / 41 Step-by-step SL deformatter text morph. morph. lexical lexical structural morph. post- analyser disambig. transfer selection transfer generator generator reformatter TL text Morphological analysis $ echo "Tiilitalojen keskellä hän kirjoitti minulle nimensä kiinalaisin kirjaimin." | apertium -d . fin-eng-morph ^Tiilitalojen/tiilitalo<n><pl><gen>$ ^keskellä/keskellä<adv>/keskellä<post>$ ^hän/hän<prn><pers><sg><nom>$ ^kirjoitti/kirjoittaa<vblex><actv><past><p3><sg>$ ^minulle/minä<prn><pers><sg><all>$ ^nimensä/nimi<n><pl><nom><pxsp3>/nimi<n><sg><gen><pxsp3>/nimi<n><sg><nom><pxsp3>$ ^kiinalaisin/kiinalainen<adj><pos><pl><ins>/kiinalainen<adj><v→n><sup><sg><nom>/kiinalainen<n><pl><ins>$ ^kirjaimin/kirjain<n><pl><ins>$^./.<sent>$ 6 / 41 Step-by-step SL deformatter text morph. morph. lexical lexical structural morph. post- analyser disambig. transfer selection transfer generator generator reformatter TL text Morphological disambiguation $ echo "Tiilitalojen keskellä hän kirjoitti minulle nimensä kiinalaisin kirjaimin." | apertium -d . fin-eng-tagger ^Tiilitalo<n><pl><gen>$ ^keskellä<post>$ ^hän<prn><pers><sg><nom>$ ^kirjoittaa<vblex><actv><past><p3><sg>$ ^minä<prn><pers><sg><all>$ ^nimi<n><sg><gen><pxsp3>$ ^kiinalainen<adj><pos><pl><ins>$ ^kirjain<n><pl><ins>$^.<sent>$ 6 / 41 Step-by-step SL deformatter text morph. morph. lexical lexical structural morph. post- analyser disambig. transfer selection transfer generator generator reformatter TL text Lexical transfer $ echo "Tiilitalojen keskellä hän kirjoitti minulle nimensä kiinalaisin kirjaimin." | apertium -d . fin-eng-biltrans ^Tiilitalo<n><pl><gen>/Brick house<n><pl><gen>$ ^keskellä<post>/in between<pr>$ ^hän<prn><pers><sg><nom>/she<prn><pers><f><sg><nom>/they<prn><pers><mf><pl><nom>/ he<prn><pers><m><sg><nom>$ ^kirjoittaa<vblex><actv><past><p3><sg>/author<vblex><actv><past><p3><sg>/ cut<vblex><actv><past><p3><sg>/write<vblex><actv><past><p3><sg>/type<vblex><actv><past><p3><sg>/ draft<vblex><actv><past><p3><sg>/spell<vblex><actv><past><p3><sg>/...$ ^minä<prn><pers><sg><all>/I<prn><pers><mf><sg><all>$ ^nimi<n><sg><gen><pxsp3>/name<n><sg><gen><pxsp3>/title<n><sg><gen><pxsp3>/header<n><sg><gen><pxsp3>$ ^kiinalainen<adj><pos><pl><ins>/Chinese<adj><pos><pl><ins>$ ^kirjain<n><pl><ins>/letter<n><pl><ins>/character<n><pl><ins>$^.<sent>/.<sent>$ 6 / 41 Step-by-step SL deformatter text morph. morph. lexical lexical structural morph. post- analyser disambig. transfer selection transfer generator generator reformatter TL text Lexical selection $ echo "Tiilitalojen keskellä hän kirjoitti minulle nimensä kiinalaisin kirjaimin." | apertium -d . fin-eng-lextor ^Tiilitalo<n><pl><gen>/Brick house<n><pl><gen>$ ^keskellä<post>/in between<pr>$ ^hän<prn><pers><sg><nom>/they<prn><pers><p3><mf><sg><nom>$ ^kirjoittaa<vblex><actv><past><p3><sg>/write<vblex><actv><past><p3><sg>$ ^minä<prn><pers><sg><all>/I<prn><pers><mf><sg><all>$ ^nimi<n><sg><gen><pxsp3>/name<n><sg><gen><pxsp3>$ ^kiinalainen<adj><pos><pl><ins>/Chinese<adj><pos><pl><ins>$ ^kirjain<n><pl><ins>/character<n><pl><ins>$^.<sent>/.<sent>$ 6 / 41 Step-by-step SL deformatter text morph. morph. lexical lexical structural morph. post- analyser disambig. transfer selection transfer generator generator reformatter TL text Structural transfer $ echo "Tiilitalojen keskellä hän kirjoitti minulle nimensä kiinalaisin kirjaimin." | apertium -d . fin-eng-postchunk ^In between<pr>$ ^brick house<n><pl>$ ^they<prn><pers><p3><mf><sg>$ ^write<vblex><past>$ ^to<pr>$ ^I<prn><pers><p1><mf><sg><acc>$ ^name<n><sg>$ ^with<pr>$ ^Chinese<adj>$ ^character<n><pl>$^.<sent>$ 6 / 41 Step-by-step SL deformatter text morph. morph. lexical lexical structural morph. post- analyser disambig. transfer selection transfer generator generator reformatter TL text Morphological generation $ echo "Tiilitalojen keskellä hän kirjoitti minulle nimensä kiinalaisin kirjaimin." | apertium -d . fin-eng In between brick houses they wrote to me name with Chinese characters. 6 / 41 Modularity The pipeline architecture makes it straightforward to insert or substitute modules: I HFST: I Replacement for lttoolbox I Used for most “morphologically complex” languages I VISL CG: I Accompanies apertium-tagger I Rule-based morphological disambiguation and shallow-syntactic analysis 7 / 41 What kind of systems are developed? Dissemination Assimilation Low precision High precision Gisting Min!Min Min!Maj Maj!Min Closely-related languages Unrelated or distant languages Real translation aids Reading aids 8 / 41 9 / 41 10 / 41 10 / 41 Effort How much effort does it take to develop a system with Apertium ? I 2 weeks minimum I Fastest system made (Macedonian!English) I After 3 months, success rate is around 50% I Around half of the students that participate in the Google Summer of Code are able to finish their systems in 12 weeks I After 6 months, perhaps 75% success rate I Students who don’t quite make it in the Google Summer of Code can have their projects picked up and finished in about three months more. I After 1 year, … I When asking for funding, this is probably the minimum amount of time you’d put. 11 / 41 Success factors Technical: I How much data is freely available I In terms of dictionaries, rule descriptions and corpora I Stability and compatibility of tagsets (intermediate representation) Linguistic: I Morphological complexity of the languages I Genetic and structural similarity of the languages Personal: I Experience of the developer with the tools I Staying power of the developer I Do they have the staying power to check 100s of lexicon entries/day for weeks on end ? 12 / 41 Language pairs 2005 spa cat glg por 3 13 / 41 Language pairs 2006 spa oci cat fra eng glg por 6 13 / 41 Language pairs 2007 ron spa oci cat fra eng glg por 7 13 / 41 Language pairs 2008 eus ron spa oci cat fra epo cym eng glg por 16 13 / 41 Language pairs 2009 eus ron spa oci bre cat fra epo cym eng nob glg swe nno por dan 21 13 / 41 Language pairs 2010 eus ast ron spa oci bre arg cat fra hbs bul epo cym mkd eng isl nob glg swe nno por dan 27 13 / 41 Language pairs 2011 eus ast ron spa oci ita bre arg cat fra hbs bul epo cym mkd eng isl nob glg swe nno por dan 30 13 / 41 Language pairs 2012 eus ast ron spa oci ita bre arg cat afr fra hbs bul nld epo cym mkd eng isl nob glg swe nno por dan 31 13 / 41 Language pairs 2013 eus ast kaz mlt ind ron spa oci ita tat ara msa bre arg cat slv afr fra hbs bul nld epo cym mkd sme eng isl nob glg swe nno por dan 38 13 / 41 Language pairs 2014 eus ast kaz mlt ind ron urd spa oci ita tat ara msa hin bre arg cat slv afr fra hbs bul nld epo cym mkd sme eng isl nob glg swe nno por dan 40 13 / 41 Major users I La Voz de Galicia I La Generalitat de Catalunya I Ofis Publik ar Brezhoneg I WikiMedia I Oslo School District , 14 / 41 Online users 15 / 41 Online translation statistics 16 / 41 Outline Introduction Design Development Status Teaching Courses Google Summer of Code Google Code-in Research New language pairs Applying unsupervised methods Hybrid systems Future work and challenges Challenges Plans Collaboration 16 / 41 Courses organised I Luxembourg, May 2011 I Shupashkar, January 2012 I Helsinki, May 2013 17 / 41 Luxembourg, May 2011 I 2-day course, funded by the European Commission DGT I Held at the European Commission I Organised for translators I Produced an 80-page course book used in later courses 18 / 41 Luxembourg, May 2011 I 2-day course, funded by the European Commission DGT I Held at the European Commission I Organised for translators I Produced an 80-page course book used in later courses I Translators not ever so interested in making MT systems , 18 / 41 Shupashkar, January 2012 I 5-day course, funded by Apertium I Held in Shupashkar (Cheboksary), Chuvashia I 8 modules, covering all aspects of the system I 3 teachers I 30 participants from Russia I At least two participants have gone on to work on MT 19 / 41 Helsinki, May 2013 I Course organised at the Dept.