PANACEA Project Grant Agreement No.: 248064

PANACEA Project Grant Agreement No.: 248064

SEVENTH FRAMEWORK PROGRAMME THEME 3 Information and communication Technologies PANACEA Project Grant Agreement no.: 248064 Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies D5.1 Parallel technology tools and resources Dissemination Level: Public Delivery Date: July 16th 2010 Status – Version: Final v1.0 Author(s) and Affiliation: Pavel Pecina (DCU), Antonio Toral (DCU), Gregor Thurmair (LinguaTec), Andy Way (DCU) Parallel technology tools and resources Table of contents Table of contents ........................................................................................................................... 1 1 Introduction ........................................................................................................................... 4 2 Terminology .......................................................................................................................... 4 2.1 Definitions ..................................................................................................................... 4 2.2 Acronyms ....................................................................................................................... 5 2.3 Related documents ......................................................................................................... 6 3 Task description .................................................................................................................... 6 3.1 Alignment ...................................................................................................................... 6 3.1.1 Sentential alignment ................................................................................................. 6 3.1.2 Sub-sentential alignment .......................................................................................... 6 3.2 Bilingual dictionary induction ....................................................................................... 7 3.3 Transfer grammar induction .......................................................................................... 8 3.3.1 Problem description .................................................................................................. 8 3.3.2 Scope and limitations ................................................................................................ 8 4 Current state of the art ........................................................................................................... 9 4.1 Alignment .................................................................................................................... 10 4.1.1 Sentential alignment ............................................................................................... 10 4.1.2 Sub-sentential alignment ........................................................................................ 11 4.2 Bilingual dictionary induction ..................................................................................... 14 4.3 Transfer grammar induction ........................................................................................ 15 4.3.1 Learning structural transfer rules ............................................................................ 15 4.3.2 Learning simple lexical transfer: Extracting transfer entries from corpora ............ 15 4.3.3 Selecting the best transfer for a given context ........................................................ 16 5 Existing tools ....................................................................................................................... 26 5.1 Alignment .................................................................................................................... 26 5.1.1 Sentential alignment ............................................................................................... 26 5.1.2 Sub-sentential alignment ........................................................................................ 26 5.2 Bilingual dictionary induction ..................................................................................... 27 5.3 Transfer grammar induction ........................................................................................ 27 5.3.1 Apertium ................................................................................................................. 28 5.3.2 Required tools ......................................................................................................... 28 1 Parallel technology tools and resources 6 Resource description ........................................................................................................... 29 6.1 Alignment .................................................................................................................... 29 6.1.1 Types of parallel corpora ........................................................................................ 29 6.1.2 Domains and languages .......................................................................................... 29 6.2 Bilingual dictionary induction ..................................................................................... 30 6.3 Transfer grammar induction ........................................................................................ 30 7 Solution path and work plan ................................................................................................ 33 7.1 Alignment .................................................................................................................... 33 7.1.1 Strategy ................................................................................................................... 33 7.1.2 Setup ....................................................................................................................... 34 7.1.3 Preprocessing .......................................................................................................... 34 7.1.4 Parallel sentence alignment and extraction from comparable corpora ................... 34 7.1.5 Sub-sentential alignment ........................................................................................ 34 7.1.6 Additional experiments ........................................................................................... 35 7.1.7 Testing .................................................................................................................... 35 7.2 Bilingual dictionary induction ..................................................................................... 35 7.2.1 Strategy ................................................................................................................... 35 7.2.2 Setup ....................................................................................................................... 35 7.2.3 Methodology ........................................................................................................... 36 7.2.4 Testing .................................................................................................................... 36 7.3 Transfer grammar induction ........................................................................................ 36 7.3.1 Strategy ................................................................................................................... 37 7.3.2 Setup ....................................................................................................................... 37 7.3.3 Preprocessing .......................................................................................................... 38 7.3.4 Topic test processing .............................................................................................. 39 7.3.5 Grammatical testing ................................................................................................ 39 7.3.6 Conceptual context determination .......................................................................... 40 7.3.7 Order of tests, entry packages ................................................................................. 40 7.3.8 Testing .................................................................................................................... 41 8 Bibliography ........................................................................................................................ 42 A. Tool Documentation Forms.................................................................................................... 53 A. 1 Hunalign .......................................................................................................................... 53 A.2 Geometric Mapping and Alignment ................................................................................. 55 2 Parallel technology tools and resources A.3 Bilingual Sentence Aligner .............................................................................................. 57 A.4 Giza++ .............................................................................................................................. 59 A.5 Berkeley Aligner .............................................................................................................. 62 A.6 OpenMaTrEx .................................................................................................................... 64 A.7 Subtree Aligner ................................................................................................................ 66 3 Parallel technology tools and resources 1 Introduction The main

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    69 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us