A CORPUS LINGUISTICS STUDY of TRANSLATION CORRESPONDENCES in ENGLISH and GERMAN by ALEKSANDAR TRKLJA

Total Page:16

File Type:pdf, Size:1020Kb

A CORPUS LINGUISTICS STUDY of TRANSLATION CORRESPONDENCES in ENGLISH and GERMAN by ALEKSANDAR TRKLJA A CORPUS LINGUISTICS STUDY OF TRANSLATION CORRESPONDENCES IN ENGLISH AND GERMAN by ALEKSANDAR TRKLJA A thesis submitted to The University of Birmingham for the degree of DOCTOR OF PHILOSOPHY School of English The University of Birmingham November 2013 University of Birmingham Research Archive e-theses repository This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder. ABSTRACT This thesis aims at developing an analytical model for differentiation of translation correspondences and for grouping lexical items according to their semantic similarities. The model combines the language in use theory of meaning with the distributional corpus linguistics method. The identification of translation correspondences derives from the exploration of the occurrence of lexical items in the parallel corpus. The classification of translation correspondences into groups is based on the substitution principle, whereas the distinguishing features used to differentiate between lexical items emerge as a result of the study of local contexts in which these lexical items occur. The distinguishing features are analysed with the help of various statistical measurements. The results obtained indicate that the proposed model has advantages over the traditional approaches that rely on the referential theory of meaning. In addition to contributing to lexicology the model also has its applications in practical lexicography and in language teaching. ACKNOWLEDGEMENTS I would like to thank my supervisors Professor Wolfgang Teubert and Dr. Paul Thompson for their thoughtful and constructive supervision. I am indebted to Biman Chakraborty, Gabriela Saldanha, Geoff Barnbrook, Mary Snell-Hornby, Nick Groom, Oliver Mason, Pernilla Danielsson, Susan Hunston and my fellow research students for many discussions and useful advices. I also want to thank my family and friends and in particular my wife Barbara N. Wiesinger for her continued support during this long project. Contents List of Tables v List of Figures vi List of Tables in Appendix vii List of Figures in Appendix vii CHAPTER 1 INTRODUCTION 1 1.1 This thesis 1 1.2 Bilingual lexicography and onomasiological dictionaries 4 1.3 Contrastive corpus studies 8 1.4 Language in use theory of meaning 9 1.5 Research questions 11 1.6 Outline of the thesis 11 CHAPTER 2 PREVIOUS APPROACHES 13 2.1 Introduction 13 2.2 Componential approaches to semantic fields 14 2.2.1 Adrienne Lehrer’s analysis of cooking words 15 2.2.2 Karcher’s study of water words in English and German 20 2.3 Frame Semantics 23 2.4 Corpus approaches to semantic fields 28 2.4.1 Core words in semantic fields 28 2.4.2 Semantic mirrors 31 2.5 Studies beyond single words 35 2.5.1 Contextually defined translation equivalents 35 2.5.2 The Bilexicon project 37 2.6 Conclusion 40 CHAPTER 3 THEORETICAL BACKGROUND 41 3.1 Introduction 41 i 3.2 Theory and methodology 42 3.2.1 Language in use theory of meaning 42 3.2.2 Distributional approach 44 3.2.3 Translation correspondences 46 3.2.4 Sublanguages and local grammars 49 3.2.5 Corpus categories and corpus tools 52 3.2.6 Probability and differences between lexical items 56 3.2.7 Conclusion 59 3.3 Corpora 61 3.4 Data analysis procedure 63 3.5 Terms explained and conventions 65 CHAPTER 4 IDENTIFICATION OF TRANSLATION CORRESPONDENCES 67 4.1 Introduction 67 4.2 Identification of the TLD {CAUSE PROBLEM} and {PROBLEM BEREITEN} 67 4.3 Conclusion 75 CHAPTER 5 LEXICAL ITEMS FROM THE TLD {CAUSE PROBLEM} AND {PROBLEM BEREITEN} 77 5.1 Introduction 77 5.2 TLD {CAUSE PROBLEM} 78 5.2.1 Grammar structures 78 5.2.1.1 A local grammar of the lexical items from the TLD {CAUSE PROBLEM} 78 5.2.1.2 Conclusion 87 5.2.2 An intralinguistic analysis of items from the TLD {CAUSE PROBLEM} 88 5.2.2.1 General distribution 88 5.2.2.2 Modifiers of verbal elements 91 5.2.2.3 Modifiers of nominal elements 93 5.2.2.4 Unique collocates 108 5.2.2.5 Conclusion 111 5.2.3 An interlinguistic analysis of the items from the TLD {CAUSE PROBLEM} 115 5.2.3.1 General principles 115 5.2.3.2 Correspondence potential of English translation correspondences 117 5.2.3.2 Conclusion 121 5.3 TLD {PROBLEM BEREITEN} 121 5.3.1 Grammar structures 122 5.3.1.1 A local grammar of the lexical items from the TLD {PROBLEM BEREITEN} 122 5.3.1.2 Conclusion 129 5.3.2 An intralinguistic analysis of the items from the TLD {PROBLEM BEREITEN} 131 ii 5.3.2.1 General distributional differences 131 5.3.2.2 Co-occurrence with modifiers of verbal elements 133 5.3.2.3 Co-occurrence with the modifiers of nominal elements 135 5.3.2.4 Unique collocates 151 5.3.2.5 Conclusion 152 5.3.3 An interlinguistic analysis of the lexical items from the TLD {PROBLEM BEREITEN} 156 5.3.3.1 Correspondence potential of the German translation correspondences 156 5.3.3.2 Conclusion 159 CHAPTER 6 IDENTIFICATION OF THE TLD {MANY COLLECTIVES} AND {VIELE KOLLEKTIVA} AND TLSD {MANY PROBLEMS} AND {VIELE PROBLEME} 161 6.1 Introduction 161 6.2 Translation lexical domains {MANY COLLECTIVES} and {VIELE KOLLEKTIVA} 165 6.3 Translation lexical sub-domains {MANY PROBLEMS} and {VIELE PROBLEME} 167 6.4 Conclusion 168 CHAPTER 7 TLD {MANY COLLECTIVES} AND {VIELE KOLLEKTIVA} AND TLSD {MANY PROBLEMS} AND {VIELE PROBLEME} 170 7.1 Introduction 170 7.2 TLD {MANY COLLECTIVES} and TLSd {MANY PROBLEMS} 171 7.2.1 Frequency and the number of collocates 172 7.2.2 Classification of COLLECTIVES 175 7.2.3 Shared collocates 181 7.2.4 Unique collocates 187 7.2.5 Lexical items from the TLSd {MANY PROBLEMS} 188 7.2.6 Correspondence potential 190 7.3 TLD {VIELE KOLLEKTIVA} and TLSd {VIELE PROBLEME} 194 7.3.1 Frequency and the number of collocates 194 7.3.2 Classification of KOLLEKTIVA 197 7.3.3 Shared collocates 201 7.3.4 Unique collocates 206 7.3.5 Lexical items from the TLSd {VIELE PROBLEME} 207 7.3.6 Correspondence potential 209 7.3.7 Conclusion 213 CHAPTER 8 DISCUSSION OF RESULTS AND CONCLUSION 215 8.1 Introduction 215 iii 8.2 Review of findings 217 8.3 Significance of findings 220 8.3.1 Contribution to lexicology 220 8.3.2 Contribution to practical lexicography 224 8.3.2.1 Selection of options 224 8.3.2.2 Bilingual learners’ dictionaries 229 8.3.2.3 Translation dictionaries 236 8.3.3 Contribution to the use of translation in language teaching 238 8.4 Limitations of the study and further research 242 REFERENCES 244 I) APPENDIX A 258 II) APPENDIX B 266 iv List of Tables Table 2.1: The semantic field cooking (adapted from Lehrer, 1974: 31) ................................................. 17 Table 2.2: A semantic and grammatical description of the verb argue (from Boas, 2002: 1367) ........... 25 Table 3.1: Representation of a local grammar of evaluation (adapted from Hunston and Sinclair, 2000: 91) ............................................................................................................................................................ 51 Table 4.1: Typical collocates of the noun <rise> ...................................................................................... 68 Table 5.1: Lexical items from the TLD {CAUSE PROBLEM} ....................................................................... 79 Table 5.2: Grammatical structures for <cause problem> ........................................................................ 80 Table 5.3: Local grammar classes of modifiers that occur with <problem> and <difficulty> .................. 82 Table 5.4: Local grammar structures for lexical items from the TLD {CAUSE PROBLEM} ........................ 88 Table 5.5: Raw frequency of English lexical items in ukWaC ................................................................... 89 Table 5.6: Co-occurrence of verbal elements with the lemmata <problem> and <difficulty> ................ 90 Table 5.7: Co-occurrence of verbal elements with four word forms of the lemmata <problem> and <difficulty> ............................................................................................................................................... 91 Table 5.8: The number of modifiers and frequency of lexical units that collocate with the word form problems .................................................................................................................................................. 98 Table 5.9: Frequency and the number of modifiers for lexical units that collocate with the word form problem .................................................................................................................................................. 104 Table 5.10: The number of modifiers and frequency of lexical units that collocate with the word form difficulties ............................................................................................................................................... 106 Table 5.11: The number of modifiers and frequency of lexical units that collocate with the word form difficulty ................................................................................................................................................. 108 Table 5.12: Distinguishing features for lexical items from the TLD {CAUSE PROBLEM} .......................
Recommended publications
  • Cohesion and Coherence in Indonesian Textbooks for the Tenth Grader of Senior High School
    Advances in Social Science, Education and Humanities Research, volume 301 Seventh International Conference on Languages and Arts (ICLA 2018) COHESION AND COHERENCE IN INDONESIAN TEXTBOOKS FOR THE TENTH GRADER OF SENIOR HIGH SCHOOL Syamsudduha1, Johar Amir2, and Wahida3 1Universitas Negeri Makasar, Makassar, Indonesia, [email protected] 2Universitas Negeri Makasar, Makassar, Indonesia, [email protected] 3 Universitas Negeri Makasar, Makassar, Indonesia, [email protected] Abstract This study aims to describe: (1) cohesion and (2) coherence in Indonesian Textbooks for XI Grade Senior High School. This type of research is qualitative analysis research. The data are the statements or disclosures that contain cohesion and coherence in the Indonesian textbooks. The sources of data are the texts in The Indonesian Textbooks for XI Grade Senior High Schoolpublished by the Ministry of Education and Culture of Republic of Indonesia. The data collection techniques are carried out with documenting, reading,and noting. Data analysis techniques are carried out through reduction, presentation, and inference/verification of data. The results of this study showed that (1) the use of two cohesion markers, grammatical cohesionand lexical cohesion. The markers of grammatical cohesion were reference, substitution, ellipsis, and conjunction. The markers of lexical cohesion were repetition, synonym,antonym, hyponym, and collocation. The later result was (2) the use of coherence markers which are addition, sequencing, resistance, more, cause-consequences, time, terms, ways, uses, and explanations. Those findings showed that the textsof Indonesian Textbooks for XI Grade Senior High School published by the Ministry of Education and Culture of Republic of Indonesia were texts that had cohesion and unity of meaning so that Indonesian Textbooks for XI Grade Senior High Schoolpublished by the Ministry of Education and Cultureof The Republic of Indonesia was declared good to be used as a teaching material.
    [Show full text]
  • Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
    Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative).
    [Show full text]
  • Exploring the Relationship Between Word-Association and Learners’ Lexical Development
    Exploring the Relationship between Word-Association and Learners’ Lexical Development Jason Peppard An assignment for Master of Arts in Applied Linguistics Module 2 - Lexis July 2007 Centre for English Language Studies Department of English UNIVERSITY OF BIRMINGHAM Birmingham B15 2TT United Kingdom LX/06/02 Follow task 123 outlined on page 152 of McCarthy (1990 Vocabulary OUP). You do not have to use students: anyone who has an L2 but has not been brought up as a bilingual will do. Use at least four subjects and test them on their L2 (or L3/ L4 etc.). Report your findings, giving the level(s) of subjects' L2 (L3, etc.) and including the prompt words and responses. Follow McCarthy's list of evaluation points, adding other points to this if you wish. Approx. 4530 words 1.0 Introduction Learning or acquiring your first language was a piece of cake, right? So why then, is it so difficult for some people to learn a second language? For starters, many learners might be wondering why I referred to a dessert in the opening sentence of a paper concerning vocabulary. The point here is summed up neatly by McCarthy (1990): No matter how well the student learns grammar, no matter how successfully the sounds of L2 are mastered, without words to express a wide range of meanings, communication in an L2 just cannot happen in any meaningful way (p. viii). This paper, based on Task 123 of McCarthy’s Vocabulary (ibid: 152), aims to explore the L2 mental lexicon. A simple word association test consisting of eight stimulus words was administered to both low-level and high-level Japanese EFL students as well as a group of native English speakers for comparative reasons.
    [Show full text]
  • Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
    RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by
    [Show full text]
  • Translating Lexical Semantic Relations: the First Step Towards Multilingual Wordnets*
    Translating Lexical Semantic Relations: The First Step Towards Multilingual Wordnets* Chu-Ren Huang, I-Ju E. Tseng, Dylan B.S. Tsai Institute of Linguistics, Preparatory Office, Academia Sinica 128 Sec.2 Academy Rd., Nangkang, Taipei, 115, Taiwan, R.O.C. [email protected], {elanna, dylan}@hp.iis.sinica.edu.tw Abstract Wordnets express ontology via a network of Establishing correspondences between words linked by lexical semantic relations. Since wordnets of different languages is essential these words are by definition the lexicon of each to both multilingual knowledge processing language, the wordnet design feature ensures and for bootstrapping wordnets of versatility in faithfully and comprehensively low-density languages. We claim that such representing the semantic content of each correspondences must be based on lexical semantic relations, rather than top ontology language. Hence, on one hand, these conceptual or word translations. In particular, we define atoms reflect linguistic idiosyncrasies; on the a translation equivalence relation as a other hand, the lexical semantic relations (LSR’s) bilingual lexical semantic relation. Such receive universal interpretation across different relations can then be part of a logical languages. For example, the definition of entailment predicting whether source relations such as synonymy or hypernymy is language semantic relations will hold in a universal. The universality of the LSR’s is the target language or not. Our claim is tested foundation that allows wordnet to serve as a with a study of 210 Chinese lexical lemmas potential common semantic network and their possible semantic relations links representation for all languages. The premise is bootstrapped from the Princeton WordNet.
    [Show full text]
  • Machine Reading = Text Understanding! David Israel! What Is It to Understand a Text? !
    Machine Reading = Text Understanding! David Israel! What is it to Understand a Text? ! • To understand a text — starting with a single sentence — is to:! • determine its truth (perhaps along with the evidence for its truth)! ! • determine its truth-conditions (roughly: what must the world be like for it to be true)! • So one doesn’t have to check how the world actually is — that is whether the sentence is true! • calculate its entailments! • take — or at least determine — appropriate action in light of it (and some given preferences/goals/desires)! • translate it accurately into another language! • and what does accuracy come to?! • ground it in a cognitively plausible conceptual space! • …….! • These are not necessarily competing alternatives! ! For Contrast: What is it to Engage Intelligently in a Dialogue? ! • Respond appropriately to what your dialogue partner has just said (and done)! • Typically, taking into account the overall purpose(s) of the interaction and the current state of that interaction (the dialogue state)! • Compare and contrast dialogue between “equals” and between a human and a computer, trying to assist the human ! • Siri is a very simple example of the latter! A Little Ancient History of Dialogue Systems! STUDENT (Bobrow 1964) Question-Answering: Word problems in simple algebra! “If the number of customers Tom gets is twice the square of 20% of the number of advertisements he runs, and the number of advertisements he runs is 45, what is the number of customers Tom gets?” ! Overview of the method ! . 1 Map referential expressions to variables. ! ! . 2 Use regular expression templates to identify and transform mathematical expressions.
    [Show full text]
  • Hindi Wordnet
    Description Hindi WordNet The Hindi WordNet is a system for bringing together different lexical and semantic relations between the Hindi words. It organizes the lexical information in terms of word meanings and can be termed as a lexicon based on psycholinguistic principles. The design of the Hindi WordNet is inspired by the famous English WordNet. In the Hindi WordNet the words are grouped together according to their similarity of meanings. Two words that can be interchanged in a context are synonymous in that context. For each word there is a synonym set, or synset, in the Hindi WordNet, representing one lexical concept. This is done to remove ambiguity in cases where a single word has multiple meanings. Synsets are the basic building blocks of WordNet. The Hindi WordNet deals with the content words, or open class category of words. Thus, the Hindi WordNet contains the following category of words- Noun, Verb, Adjective and Adverb. Each entry in the Hindi WordNet consists of following elements 1. Synset: It is a set of synonymous words. For example, “वि饍यालय, पाठशाला, कू ल” (vidyaalay, paaThshaalaa, skuul) represents the concept of school as an educational institution. The words in the synset are arranged according to the frequency of usage. 2. Gloss: It describes the concept. It consists of two parts: Text definition: It explains the concept denoted by the synset. For example, “िह थान जहााँ प्राथमिक या िाध्यमिक तर की औपचाररक मशक्षा दी जाती है” (vah sthaan jahaan praathamik yaa maadhyamik star kii aupachaarik sikshaa dii jaatii hai) explains the concept of school as an educational institution.
    [Show full text]
  • Introduction to Lexical Resources Oduc O O E Ca Esou
    Introductio n to le xica l resou r ces BlttSBolette San dfdPddford Pedersen Centre for Language Technology University of Copenhagen [email protected] Outline 1 . Why are lexical semantic resources relevant for semantic annotation? . What is the relation between traditional lexicography and lexica l resou rces fo r a nno ta tio n pu rpo ses? Bolette S. Pedersen, CLARA Course 2011 Outline 2 . WordNet, primary focus on vertical , taxonomical relations . VbNtVerbNet, prifimary focus on hitlhorizontal, synttitagmatic relations, thematic roles . PropBank, primary focus on horizontal, syntagmatic relations, argument structure . FrameNet, primary focus on horizontal, syntagmatic relations, thematic roles (()= frame elements) . PAROLE/SIMPLE - Semantic Information for Multifunctional, PluriLingual Lexicons - based on the Generative Lexicon (Pustejovsky 1995) Bolette S. Pedersen, CLARA Course 2011 Outline 2 . WordNet, primary focus on vertical , taxonomical relations . VbNtVerbNet, prifimary focus on hitlhorizontal, synttitagmatic relations, thematic roles . PropBank, primary focus on horizontal, syntagmatic relations, argument structure . FrameNet, primary focus on horizontal, syntagmatic relations, thematic roles . PAROLE/SIMPLE - Semantic Information for Multifunctional, PluriLingual Lexicons - based on the Generative Lexicon (Pustejovsky 1995) Bolette S. Pedersen, CLARA Course 2011 Outline 2 . WordNet, primary focus on vertical , taxonomical relations . VbNtVerbNet, prifimary focus on hitlhorizontal, synttitagmatic relations, thematic roles
    [Show full text]
  • ALW2), Pages 1–10 Brussels, Belgium, October 31, 2018
    EMNLP 2018 Second Workshop on Abusive Language Online Proceedings of the Workshop, co-located with EMNLP 2018 October 31, 2018 Brussels, Belgium Sponsors Primary Sponsor Platinum Sponsors Gold Sponsors Silver Sponsors Bronze Sponsors ii c 2018 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-948087-68-1 iii Introduction Interaction amongst users on social networking platforms can enable constructive and insightful conversations and civic participation; however, on many sites that encourage user interaction, verbal abuse has become commonplace. Abusive behavior such as cyberbullying, hate speech, and scapegoating can poison the social climates within online communities. The last few years have seen a surge in such abusive online behavior, leaving governments, social media platforms, and individuals struggling to deal with the consequences. As a field that works directly with computational analysis of language, the NLP community is uniquely positioned to address the difficult problem of abusive language online; encouraging collaborative and innovate work in this area is the goal of this workshop. The first year of the workshop saw 14 papers presented in a day-long program including interdisciplinary panels and active discussion. In this second edition, we have aimed to build on the success of the first year, maintaining a focus on computationally detecting abusive language and encouraging interdisciplinary work. Reflecting the growing research focus on this topic, the number of submissions received more than doubled from 22 in last year’s edition of the workshop to 48 this year.
    [Show full text]
  • Using Morphemes from Agglutinative Languages Like Quechua and Finnish to Aid in Low-Resource Translation
    Using Morphemes from Agglutinative Languages like Quechua and Finnish to Aid in Low-Resource Translation John E. Ortega [email protected] Dept. de Llenguatges i Sistemes Informatics, Universitat d’Alacant, E-03071, Alacant, Spain Krishnan Pillaipakkamnatt [email protected] Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA Abstract Quechua is a low-resource language spoken by nearly 9 million persons in South America (Hintz and Hintz, 2017). Yet, in recent times there are few published accounts of successful adaptations of machine translation systems for low-resource languages like Quechua. In some cases, machine translations from Quechua to Spanish are inadequate due to error in alignment. We attempt to improve previous alignment techniques by aligning two languages that are simi- lar due to agglutination: Quechua and Finnish. Our novel technique allows us to add rules that improve alignment for the prediction algorithm used in common machine translation systems. 1 Introduction The NP-complete problem of translating natural languages as they are spoken by humans to machine readable text is a complex problem; yet, is partially solvable due to the accuracy of machine language translations when compared to human translations (Kleinberg and Tardos, 2005). Statistical machine translation (SMT) systems such as Moses 1, require that an algo- rithm be combined with enough parallel corpora, text from distinct languages that can be com- pared sentence by sentence, to build phrase translation tables from language models. For many European languages, the translation task of bringing words together in a sequential sentence- by-sentence format for modeling, known as word alignment, is not hard due to the abundance of parallel corpora in large data sets such as Europarl 2.
    [Show full text]
  • A Massively Parallel Corpus: the Bible in 100 Languages
    Lang Resources & Evaluation DOI 10.1007/s10579-014-9287-y ORIGINAL PAPER A massively parallel corpus: the Bible in 100 languages Christos Christodouloupoulos • Mark Steedman Ó The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora. Keywords Parallel corpus Á Multilingual corpus Á Comparative corpus linguistics 1 Introduction Parallel corpora are a valuable resource for linguistic research and natural language processing (NLP) applications. One of the main uses of the latter kind is as training material for statistical machine translation (SMT), where large amounts of aligned data are standardly used to learn word alignment models between the lexica of two languages (for example, in the Giza?? system of Och and Ney 2003). Another interesting use of parallel corpora in NLP is projected learning of linguistic structure. In this approach, supervised data from a resource-rich language is used to guide the unsupervised learning algorithm in a target language. Although there are some techniques that do not require parallel texts (e.g. Cohen et al. 2011), the most successful models use sentence-aligned corpora (Yarowsky and Ngai 2001; Das and Petrov 2011). C. Christodouloupoulos (&) Department of Computer Science, UIUC, 201 N.
    [Show full text]
  • A Corpus-Based Study of Unaccusative Verbs and Auxiliary Selection
    To be or not to be: A corpus-based study of unaccusative verbs and auxiliary selection Master's Thesis Presented to The Faculty of the Graduate School of Arts and Sciences Brandeis University Department of Computer Science James Pustejovsky, Advisor In Partial Fulfillment of the Requirements for Master's Degree By Richard A. Brutti Jr. May 2012 c Copyright by Richard A. Brutti Jr. 2012 All Rights Reserved ABSTRACT To be or not to be: A corpus-based study of unaccusative verbs and auxiliary selection A thesis presented to the Department of Computer Science Graduate School of Arts and Sciences Brandeis University Waltham, Massachusetts Richard A. Brutti Jr. Since the introduction of the Unaccusative Hypothesis (Perlmutter, 1978), there have been many further attempts to explain the mechanisms behind the division in intransitive verbs. This paper aims to analyze and test some of theories of unac- cusativity using computational linguistic tools. Specifically, I focus on verbs that exhibit split intransitivity, that is, verbs that can appear in both unaccusative and unergative constructions, and in determining the distinguishing features that make this alternation possible. Many formal linguistic theories of unaccusativity involve the interplay of semantic roles and temporal event markers, both of which can be analyzed using statistical computational linguistic tools, including semantic role labelers, semantic parses, and automatic event classification. I use auxiliary verb selection as a surface-level indicator of unaccusativity in Italian and Dutch, and iii test various classes of verbs extracted from the Europarl corpus (Koehn, 2005). Additionally, I provide some historical background for the evolution of this dis- tinction, and analyze how my results fit into the larger theoretical framework.
    [Show full text]