Natural Language Processing

16 Alignment 16.1 Introduction .......................................................... 367 16.2 Definitions and Concepts ........................................... 369 Alignment • Constraints and Correlations • Classes of Algorithms 16.3 Sentence Alignment .................................................. 378 Length-Based Sentence Alignment • Lexical Sentence Alignment • Cognate-Based Sentence Alignment • Multifeature Sentence Alignment • Comments on Sentence Alignment 16.4 Character, Word, and Phrase Alignment .......................... 386 Monotonic Alignment for Words • Non-Monotonic Alignment for Single-Token Words • Non-Monotonic Alignment for Multitoken Words and Phrases 16.5 Structure and Tree Alignment ...................................... 391 Cost Functions • Algorithms • Strengths and Weaknesses of Structure and Tree Alignment Techniques 16.6 Biparsing and ITG Tree Alignment ................................ 394 Syntax-Directed Transduction Grammars (or Synchronous CFGs) • Inversion Transduction Grammars • Cost Functions • Algorithms • Grammars for Biparsing • Strengths and Weaknesses of Biparsing and ITG Tree Alignment Techniques Dekai Wu 16.7 Conclusion ............................................................ 402 The Hong Kong University of Acknowledgments ..........................................................403 Science and Technology References....................................................................403 16.1 Introduction In this chapter, we discuss the work done on automatic alignment of parallel texts for various purposes. Fundamentally, an alignment algorithm accepts as input a bitext and produces as output a bisegmentation relation that identifies corresponding segments between the texts. A bitext consists of two texts that are ∗ translations of each other. Bitext alignment fundamentally lies at the heart of all data-driven machine translation methods, and the rapid research progress on alignment since 1990 reflects the advent of statistical machine translation (SMT) and example-based machine translation (EBMT) approaches. Yet the importance of alignment extends as well to many other practical applications for translators, bilingual lexicographers, and even ordinary readers. ∗ In a “Terminological note” prefacing his book, Veronis (2000) cites Alan Melby pointing out that the alternative term parallel text creates an unfortunate and confusing clash with the translation theory and terminological community, who use the same term instead to mean what NLP and computational linguistics researchers typically refer to as non-parallel corpora or comparable corpora—texts in different languages from the same domain, but not necessarily translations of each other. 367 368 Handbook of Natural Language Processing Automatically learned resources for MT, NLP, and humans. Bitext alignment methods are the core of many methods for machine learning of language resources to be used by SMT or other NLP applications, as well as human translators and linguists. The side effects of alignment are often of more interest than the aligned text itself. Alignment algorithms offer the possibility of extracting various sorts of knowledge resources, such as (a) phrasal bilexicons listing word or collocation translations; (b) translation examples at the sentence, constituent, and/or phrase level; or (c) tree-structured translation patterns such as transfer rules, translation frames, or treelets. Such resources constitute a database that may be used by SMT and EBMT systems (Nagao 1984), or they may be taken as training data for further machine learning to mine deeper patterns. Alignment has also been employed to infer sentence bracketing or constituent structure as a side effect. Biconcordances. Historically, the first application for bitext alignment algorithms was to automate the production of the cross-indexing for bilingual concordances (Warwick and Russell 1990; Karlgren et al. 1994; Church and Hovy 1993). Such concordances are consulted by human translators to find the previous contexts in which a term, idiom, or phrase was translated, thereby helping the translator to maintain consistency with preexisting translations, which is important in government and legal documents. An additional benefit of biconcordances is a large increase in navigation ease in bitexts. Bitext for readers. Aligned bitexts, in addition to their use in translators’ concordances, are also useful for bilingual readers and language learners. Linked biconcordances and bilexicons. An aligned bitext can be automatically linked with a bilexicon, providing a more effective inter- face for lexicographers, corpus annotators, as well as human translators. This was implemented in the BICORD system (Klavans and Tzoukermann 1990). Translation validation. A word-level bitext alignment system can be used within a translation checking tool that attempts to automatically flag possible errors, in the same way that spelling and style checkers operate (Macklovitch 1994). The alignment system in this case is primed to search for deceptive cognates (faux amis)suchaslibrary/librarie in English and French. A wide variety of techniques now exist, ranging from the most simple (counting characters or words) to the more sophisticated, sometimes involving linguistic data (lexicons) that may or may not have been automatically induced themselves. Some techniques work on precisely translated parallel corpora, while others work on noisy, comparable, or nonparallel corpora. Some techniques make use of apparent morphological features, while others rely on cognates and loan-words; of particular interest is work done on languages that do not have a common writing system. Some techniques align only shallow, flat chunks, while others align compositional, hierarchical structures. The robustness and generality of different techniques have generated much discussion. Techniques have been developed for aligning segments at various granularities: documents, para- graphs, sentences, constituents, collocations or phrases, words, and characters, as seen in the examples in Figure 16.1. In the following sections, we first discuss the general concepts underlying alignment techniques. Each of the major categories of alignment techniques are considered in turn in subsequent sections: document-structure alignment, sentence alignment, alignment for noisy bitexts, word alignment, constituent and tree alignment, and biparsing alignment. We attempt to use a consistent notation and conceptual orientation throughout. Particularly when discussing algorithms, we stick to formal data constructs such as “token,” “sequence,” and “segment,” rather than linguistic entities. Many algorithms for alignment are actually quite general and can often be applied at many levels of granularity. Unfortunately, this is often obscured by the use of linguistically Alignment 369 10 30 15 12 ? 1 to 1 1 to 2 1 to 1 1 to 1 0 to 1 ? 11 7 22 14 15 4 (a) Sentence/paragraph alignment Kâtip Uyku dan Uyan miş, gösleri mahmur The scribe had woken up from his sleep , with sleepy eyes (b) Turkish–English character/word/phrase alignment FIGURE 16.1 Alignment examples at various granularities. loaded terms such as “word,” “phrase,” or “sentence.” For these reasons, some techniques we discuss may appear superficially quite unlike the original works from which our descriptions were derived. 16.2 Definitions and Concepts 16.2.1 Alignment The general problem of aligning a parallel text is, more precisely, to find its optimal parallel segmentation or bisegmentation under some set of constraints: Input. A bitext (e, f). Assume that the vector e contains a sequence of T tokens e0, ..., eT−1 and E is the set {0, 1, 2, ..., T − 1}. Similarly, the vector f contains a sequence of V tokens f0, ..., fV−1 ∗ and F is the set {0, 1, 2, ..., V − 1}. When the direction of translation is relevant, f is the input language (foreign) string, and e is the output language (emitted) string.† ∗ We use the following notational conventions: bold letters are vectors, calligraphic letters are sets, and capital (non-bold) letters are constants. † The e and f convention dates back to early statistical MT work where the input foreign language was French and the output language emitted was English. 370 Handbook of Natural Language Processing The authority will be accountable to the financial secretary . (c) English–Chinese character/word/phase alignment S΄ S V˝ V΄ PP N΄ N The authority will be accountable to the financial secretary . (d) English–Chinese character/word/phase tree alignment FIGURE 16.1 (continued) Output. A bisegmentation of the bitext, designated by a set A of bisegments, where each bisegment (p, r) couples an emitted segment p to a foreign segment r and often can instead more con- veniently be uniquely identified by its span pair (s, t, u, v). Any emitted segment p has a span span(p) = (s, t) that bounds the passage es..t formed by the subsequence of tokens es, ..., et−1 where 0 ≤ s ≤ t ≤ T. Similarly, any foreign segment r has a span span(r) = (u, v),which bounds the passage fu..v formed by the subsequence of tokens fu, ..., fv−1 where 0 ≤ u ≤ v ≤ V. Conversely, we write p = seg(s, t) and r = seg(u, v) when the span uniquely identifies a segment. Otherwise, for models that permit more than one segment to label the same span, we instead write segs(s, t) or segs(u, v) to denote the set of segments that label the span. Note that a bisegmentation inherently defines two monolingual segmentations, on both of the monolingual texts e and f. Alignment 371 Where is thesecretary

Load more