PANACEA Project Grant Agreement No.: 248064
Total Page:16
File Type:pdf, Size:1020Kb
SEVENTH FRAMEWORK PROGRAMME THEME 3 Information and communication Technologies PANACEA Project Grant Agreement no.: 248064 Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies D5.1 Parallel technology tools and resources Dissemination Level: Public Delivery Date: July 16th 2010 Status – Version: Final v1.0 Author(s) and Affiliation: Pavel Pecina (DCU), Antonio Toral (DCU), Gregor Thurmair (LinguaTec), Andy Way (DCU) Parallel technology tools and resources Table of contents Table of contents ........................................................................................................................... 1 1 Introduction ........................................................................................................................... 4 2 Terminology .......................................................................................................................... 4 2.1 Definitions ..................................................................................................................... 4 2.2 Acronyms ....................................................................................................................... 5 2.3 Related documents ......................................................................................................... 6 3 Task description .................................................................................................................... 6 3.1 Alignment ...................................................................................................................... 6 3.1.1 Sentential alignment ................................................................................................. 6 3.1.2 Sub-sentential alignment .......................................................................................... 6 3.2 Bilingual dictionary induction ....................................................................................... 7 3.3 Transfer grammar induction .......................................................................................... 8 3.3.1 Problem description .................................................................................................. 8 3.3.2 Scope and limitations ................................................................................................ 8 4 Current state of the art ........................................................................................................... 9 4.1 Alignment .................................................................................................................... 10 4.1.1 Sentential alignment ............................................................................................... 10 4.1.2 Sub-sentential alignment ........................................................................................ 11 4.2 Bilingual dictionary induction ..................................................................................... 14 4.3 Transfer grammar induction ........................................................................................ 15 4.3.1 Learning structural transfer rules ............................................................................ 15 4.3.2 Learning simple lexical transfer: Extracting transfer entries from corpora ............ 15 4.3.3 Selecting the best transfer for a given context ........................................................ 16 5 Existing tools ....................................................................................................................... 26 5.1 Alignment .................................................................................................................... 26 5.1.1 Sentential alignment ............................................................................................... 26 5.1.2 Sub-sentential alignment ........................................................................................ 26 5.2 Bilingual dictionary induction ..................................................................................... 27 5.3 Transfer grammar induction ........................................................................................ 27 5.3.1 Apertium ................................................................................................................. 28 5.3.2 Required tools ......................................................................................................... 28 1 Parallel technology tools and resources 6 Resource description ........................................................................................................... 29 6.1 Alignment .................................................................................................................... 29 6.1.1 Types of parallel corpora ........................................................................................ 29 6.1.2 Domains and languages .......................................................................................... 29 6.2 Bilingual dictionary induction ..................................................................................... 30 6.3 Transfer grammar induction ........................................................................................ 30 7 Solution path and work plan ................................................................................................ 33 7.1 Alignment .................................................................................................................... 33 7.1.1 Strategy ................................................................................................................... 33 7.1.2 Setup ....................................................................................................................... 34 7.1.3 Preprocessing .......................................................................................................... 34 7.1.4 Parallel sentence alignment and extraction from comparable corpora ................... 34 7.1.5 Sub-sentential alignment ........................................................................................ 34 7.1.6 Additional experiments ........................................................................................... 35 7.1.7 Testing .................................................................................................................... 35 7.2 Bilingual dictionary induction ..................................................................................... 35 7.2.1 Strategy ................................................................................................................... 35 7.2.2 Setup ....................................................................................................................... 35 7.2.3 Methodology ........................................................................................................... 36 7.2.4 Testing .................................................................................................................... 36 7.3 Transfer grammar induction ........................................................................................ 36 7.3.1 Strategy ................................................................................................................... 37 7.3.2 Setup ....................................................................................................................... 37 7.3.3 Preprocessing .......................................................................................................... 38 7.3.4 Topic test processing .............................................................................................. 39 7.3.5 Grammatical testing ................................................................................................ 39 7.3.6 Conceptual context determination .......................................................................... 40 7.3.7 Order of tests, entry packages ................................................................................. 40 7.3.8 Testing .................................................................................................................... 41 8 Bibliography ........................................................................................................................ 42 A. Tool Documentation Forms.................................................................................................... 53 A. 1 Hunalign .......................................................................................................................... 53 A.2 Geometric Mapping and Alignment ................................................................................. 55 2 Parallel technology tools and resources A.3 Bilingual Sentence Aligner .............................................................................................. 57 A.4 Giza++ .............................................................................................................................. 59 A.5 Berkeley Aligner .............................................................................................................. 62 A.6 OpenMaTrEx .................................................................................................................... 64 A.7 Subtree Aligner ................................................................................................................ 66 3 Parallel technology tools and resources 1 Introduction The main