Guidelines for BOLT Chinese-English Word Alignment
Total Page:16
File Type:pdf, Size:1020Kb
Guidelines for BOLT Chinese-English Word Alignment Version 2.0 – April 10, 2014 Linguistic Data Consortium Created by: Xuansong Li [email protected] With contribution from: Niyu Ge [email protected] Stephanie Strassel [email protected] BOLT_WAguide_V2.0 Guidelines for BOLT Word Alignment Annotation Page 1 of 35 Version 2.0 – April 10, 2014 TABLE OF CONTENTS 1 INTRODUCTION............................................................................................................................... 4 2 DATA ................................................................................................................................................... 4 3 TASKS AND CONVENTIONS ......................................................................................................... 4 3.1 TASKS ............................................................................................................................................ 4 3.2 CONVENTIONS ............................................................................................................................... 5 4 CONCEPTS AND GENERAL APPROACH ................................................................................... 5 4.1 TRANSLATED VERSUS NOT-TRANSLATED ...................................................................................... 6 4.1.1 Translated ............................................................................................................................ 6 4.1.2 Not-translated ...................................................................................................................... 7 4.2 MINIMUM-MATCH .......................................................................................................................... 8 4.2.1 Minimum-match in literal translations ................................................................................. 8 4.2.2 Maximum-match in idioms and non-literal translations ...................................................... 9 4.3 ATTACHMENT APPROACH ............................................................................................................ 10 5 ALIGNMENT AND ATTACHMENT RULES .............................................................................. 11 5.1 ANAPHORA (PRONOUN) ............................................................................................................... 11 5.2 DEMONSTRATIVE WORDS............................................................................................................. 12 5.3 MEASURE WORDS ........................................................................................................................ 12 5.4 COPULAR BE (LINK VERB) ........................................................................................................... 13 5.5 PROPER NOUN ............................................................................................................................. 13 5.6 DETERMINERS (ARTICLES IN ENGLISH) ........................................................................................ 15 5.7 AUXILIARY VERBS ....................................................................................................................... 16 5.8 INFINITIVE "TO" ........................................................................................................................... 17 5.9 EXPLETIVES ................................................................................................................................. 17 5.10 CONJUNCTION .............................................................................................................................. 17 5.10.1 Conjunctions with “and” ................................................................................................... 17 5.10.2 Conjunctions without “and” .............................................................................................. 18 5.11 PREPOSITIONS .............................................................................................................................. 18 5.12 VERB PARTICLES ......................................................................................................................... 19 5.13 POSSESSIVES ................................................................................................................................ 19 5.14 PASSIVE SENTENCES .................................................................................................................... 20 5.15 SUBORDINATE CLAUSES .............................................................................................................. 21 5.16 PUNCTUATIONS ............................................................................................................................ 23 5.17 CONTEXTUALLY ATTACHED WORDS ............................................................................................ 23 5.18 RHETORICALLY ATTACHED WORDS ............................................................................................. 24 6 UNMATCHED/UNATTACHED WORDS .................................................................................... 25 7 SPECIAL FEATURES IN CHINESE ............................................................................................. 27 7.1 TENSE IN CHINESE ....................................................................................................................... 27 7.2 DUPLICATION .............................................................................................................................. 28 7.2.1 Noun duplication ................................................................................................................ 28 7.2.2 Verb duplication ................................................................................................................. 28 7.2.3 Adjective duplication .......................................................................................................... 28 7.2.4 Measure word duplication.................................................................................................. 29 7.3 SEPARATED VERBS ....................................................................................................................... 29 7.4 “的” “地” “得” ............................................................................................................................ 29 7.5 PREFIX AND SUFFIX ..................................................................................................................... 31 7.5.1 Prefix .................................................................................................................................. 31 7.5.2 Word Suffix ......................................................................................................................... 32 BOLT_WAguide_V2.0 Guidelines for BOLT Word Alignment Annotation Page 2 of 35 Version 2.0 – April 10, 2014 7.5.3 Sentence suffix .................................................................................................................... 32 7.6 WORD SEGMENTATION ................................................................................................................ 33 7.6.1 General Principles ............................................................................................................. 33 7.7 CONFLICTING RULES .................................................................................................................... 33 7.7.1 Possessive versus co-reference .......................................................................................... 33 8 ALTERNATE TRANSLATIONS ................................................................................................... 33 9 INFORMAL LANGUAGE FEATURES AND ALIGNMENT EXAMPLES .............................. 34 BOLT_WAguide_V2.0 Guidelines for BOLT Word Alignment Annotation Page 3 of 35 Version 2.0 – April 10, 2014 1 Introduction This version of word alignment guidelines used for the BOLT project was developed based on the guidelines for the GALE word alignment project. The task of word alignment consists of finding correspondences between words, phrases or groups of words in a set of parallel texts. The resulted annotated data can be used as gold standard training data for machine translation. With references to Blinker project guidelines and ARCADE project guidelines, this guideline is especially designed to suit the task of Chinese-English word alignment, and a visualized tool is developed by LDC to facilitate the task. In this guideline, the data used for word alignment is first presented in the beginning section. In the section followed, the tasks are specified and the conventions adopted in this guideline are explained for better understanding. In section 4, the general strategies of annotation are addressed to deal with universal language features in word alignment. Then more detailed specifications and rules are elaborated with examples in section 5. Section 6 handles unaligned words. Section 7 describes approaches toward distinctive features of Chinese language. Section 8 handles alternate translations. Last section discusses discussion forum features with some examples. 2 Data The data type is discussion forum. Tokenization is done automatically without human corrections. Tokenization of English follows the same guidelines used in Penn English Treebank: split words by white spaces, separate punctuations from the preceding/following words, apostrophe S (‘s) is treated as separate