Introducing Linguistic Knowledge Into Statistical Machine Translation

Introducing Linguistic Knowledge Into Statistical Machine Translation

Ph.D. Thesis Dissertation Introducing Linguistic Knowledge into Statistical Machine Translation Author: Adri`ade Gispert Ramis Advisor: Prof. Jos´eB. Mari˜noAcebal TALP Research Center, Speech Processing Group Department of Signal Theory and Communications Universitat Polit`ecnica de Catalunya Barcelona, October 2006 Time flies like an arrow. Fruit flies like a banana. Groucho Marx. iv Abstract This Ph.D. thesis dissertation addresses the use of morphosyntactic information in order to improve the performance of Statistical Machine Translation (SMT) systems, providing them with additional linguistic information beyond the surface level of words from parallel corpora. The statistical machine translation system in this work here follows a tuple-based approach, modelling joint-probability translation models via log-linear combination of bilingual n-grams with additional feature functions. A detailed study of the approach is conducted. This includes its initial development from a speech-oriented Finite-State Transducer architecture implement- ing X-grams towards a large-vocabulary text-oriented n-grams implementation, training and decoding particularities, portability across language pairs and tasks, and main difficulties as revealed in error analyses. The use of linguistic knowledge to improve word alignment quality is also studied. A cooccurrence-based one-to-one word alignment algorithm is extended with verb form classifica- tion with successful results. Additionally, we evaluate the impact in word alignment and transla- tion quality of Part-Of-Speech, base form, verb form classification and stemming on state-of-art word alignment tools. Furthermore, the thesis proposes a translation model tackling verb form generation through an additional verb instance model, reporting experiments in English→Spanish tasks. Disagree- ment is addressed via incorporating a target Part-Of-Speech language model. Finally, we study the impact of morphology derivation on Ngram-based SMT formulation, empirically evaluating the quality gain that is to be gained via morphology reduction. v vi Resum Aquesta tesi est`adedicada a l’estudi de la utilitzaci´ode informaci´omorfosint`acticaen el marc dels sistemes de traducci´oestoc`astica, amb l’objectiu de millorar-ne la qualitat a trav´esde la incorporaci´ode informaci´oling¨u´ıstica m´es enll`adel nivell simb`olic superficial de les paraules. El sistema de traducci´oestoc`asticautilitzat en aquest treball segueix un enfocament basat en tuples, unitats biling¨ues que permeten estimar un model de traducci´ode probabilitat conjunta per mitj`ade la combinaci´o,dins un entorn log-linial, de cadenes d’n-grames i funcions carac- ter´ıstiques addicionals. Es presenta un estudi detallat d’aquesta aproximaci´o,que inclou la seva transformaci´odes d’una implementaci´od’X-grames en aut`omatsd’estats finits, m´esorientada a la traducci´ode veu, cap a l’actual soluci´od’n-grames orientada a la traducci´ode text de gran vocabulari. La tesi estudia tamb´eles fases d’entrenament i decodificaci´o,aix´ıcom el rendiment per a diferents tasques (variant el tamany dels corpora o el parell d’idiomes) i els principals problemes reflectits en les an`alisisd’error. La tesis tamb´einvestiga la incorporaci´ode informaci´oling¨u´ısticaespec´ıficament en alini- ament per paraules. Es proposa l’extensi´omitjan¸cant classificaci´ode formes verbals d’un al- gorisme d’aliniament paraula a paraula basat en co-ocurr`encies, amb resultats positius. Aix´ı mateix, s’avalua de forma emp´ırica l’impacte en qualitat d’aliniament i de traducci´oque s’obt´e mitjan¸cant l’etiquetatge morfol`ogic,la lematitzaci´o,la classificaci´ode formes verbals i el trun- cament o stemming del text paral·lel. Pel que fa al model de traducci´o,es proposa un model de tractament de les formes ver- bals per mitj`ad’un model de instanciaci´oaddicional, i es realitzen experiments en la direcci´o angl`es→castell`a.La tesi tamb´eintrodueix un model de llenguatge d’etiquetes morfol`ogiques del dest´ı per tal d’abordar problemes de concordan¸ca.Finalment, s’estudia l’impacte de la derivaci´omorfol`ogicaen la formulaci´ode la traducci´oestoc`asticamitjan¸cant n-grames, aval- uant emp´ıricament el possible guany derivat d’estrat`egies de reducci´omorfol`ogica. vii viii Agra¨ıments Vull donar les gr`aciesde tot cor al Pepe, no nom´esper haver-me ajudat a tirar endavant aquesta tesi, sin´oper mostrar-se sempre com una persona exemplar dins i fora de l’entorn professional, i de la qual he apr`es molts valors, i moltes actituds i maneres de treballar admirables. Sempre li estar´emolt agra¨ıtper tot el que ha fet per mi. Gr`acies tamb´eals companys i companyes del grup de traducci´oestad´ıstica, i en especial al Josep Maria Crego, sense l’aportaci´odel qual aquesta tesi no hauria anat tan lluny, i amb qui he apr`es a fer recerca de manera seriosa. Tamb´evull donar les gr`acies a aquells i aquelles que, amb el seu treball i potser sense adonar- se’n, han estat i sempre seran un referent de comprom´ısi motivaci´oper a mi, com ara el Climent Nadeu, el Jaume Padrell, el Xavi P´erez (el company de feina que tothom desitjaria tenir), el Llu´ısPadr´oi el Llu´ısM`arquez. Gr`acies tamb´eals molts companys de tesi amb qui he compartit molt bones estones com l’Alberto, el Ramon, el Pere, l’Ignasi, la Marta, el Javi o el Joel, entre molts d’altres. Moltes gr`aciestamb´eals grans amics i amigues que m’han acompanyat en aquest trajecte, no nom´es pel que m’han aguantat quan ha fet falta, sin´oper fer-me veure clarament que sempre hi tornarien a ser si calgu´es(Jordi, Mari, Oriol, N´uria, Paqui, Ferran, Pablo, Ana, Llu´ıs,David, etc.). Per ´ultim, vull dedicar aquesta tesi als meus pares i a l’Aleyda, que amb el seu amor constant em donen la for¸canecess`aria per treballar i viure amb il·lusi´o.Aquesta tesi ´es, en bona part, tamb´evostra. Adri`a Barcelona, octubre del 2006 ix x Contents 1 Introduction 1 1.1 Machine Translation and the Statistical Approach ................. 2 1.1.1 A brief history of MT ............................. 2 1.1.2 Approaches to MT ............................... 3 1.1.3 Statistical Machine Translation ........................ 5 1.2 Motivation ....................................... 6 1.3 Objectives of this Ph.D. ................................ 7 1.4 Thesis Organisation .................................. 8 1.5 Research Contributions ................................ 9 2 State of the art 11 2.1 Word-based translation models ............................ 11 2.1.1 IBM translation and alignment models .................... 12 2.1.2 Training and decoding tools .......................... 14 2.2 Phrase-based translation models ........................... 15 2.2.1 Alignment templates .............................. 15 2.2.2 Phrase-based SMT ............................... 16 2.2.3 Training and decoding tools .......................... 17 2.3 Tuple-based translation model ............................. 18 2.3.1 Finite-State Transducer implementation ................... 18 2.3.2 Other implementations ............................. 19 xi 2.4 Feature-based models combination .......................... 20 2.4.1 Minimum-error training ............................ 20 2.4.2 Re-ranking ................................... 21 2.5 Statistical Word Alignment .............................. 21 2.5.1 Evaluating Word Alignment .......................... 22 2.5.2 Word Alignment approaches .......................... 23 2.6 Use of linguistic knowledge into SMT ......................... 24 2.6.1 Other approaches ................................ 25 2.7 Machine Translation evaluation ............................ 26 2.7.1 Automatic evaluation metrics ......................... 27 2.7.1.1 BLEU score ............................. 27 2.7.1.2 NIST score .............................. 29 2.7.1.3 mWER ................................ 30 2.7.1.4 mPER ................................ 31 2.7.1.5 Other evaluation metrics ...................... 32 2.7.2 Human evaluation metrics ........................... 33 2.7.3 International evaluation campaigns ...................... 34 3 The Bilingual N-gram Translation Model 37 3.1 Introduction ....................................... 37 3.2 X-grams FST implementation ............................. 38 3.2.1 Reviewing X-grams for Language Modelling ................. 38 3.2.2 Bilingual X-grams for Speech Translation .................. 39 3.2.2.1 Training from parallel data ..................... 41 3.2.2.2 Preliminary experiment ....................... 42 3.2.3 Tuple definition: from one-to-many to many-to-many ............ 43 3.2.4 Monotonicity vs. word reordering ....................... 46 xii 3.2.4.1 Studying English–Spanish cross patterns ............. 47 3.2.4.2 An initial reordering strategy .................... 50 3.2.4.3 Morphology-reduced word alignment ................ 51 3.2.5 The TALP X-grams translation system ................... 52 3.2.5.1 FAME project public demonstration ................ 52 3.2.5.2 IWSLT’04 participation ....................... 53 3.3 N-gram implementation ................................ 56 3.3.1 Modelling issues ................................ 56 3.3.1.1 History length ............................ 57 3.3.1.2 Pruning strategies .......................... 58 3.3.1.3 Smoothing the bilingual model ................... 60 3.3.2 Case study: the Catalan-Spanish task .................... 62 3.4 The Tuple as Translation Unit ............................ 65 3.4.1 Embedded words ................................ 65 3.4.2 Tuple segmentation .............................

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    214 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us