VARTRA: a Comparable Corpus for Analysis of Translation Variation

VARTRA: A Comparable Corpus for Analysis of Translation Variation Ekaterina Lapshinova-Koltunski Universitat¨ des Saarlandes A 2.2 Universitat¨ Campus 66123 Saarbrucken¨ Germany [email protected] Abstract may relate not only to corpora of different lan- This paper presents a comparable trans- guages but also to those of the same language. lation corpus created to investigate trans- The main feature that makes them comparable is lation variation phenomena in terms of that they cover the same text type(s) in the same contrasts between languages, text types proportions, cf. for instance, (Laviosa, 1997) or and translation methods (machine vs. (McEnery, 2003), and thus, can be used for a cer- computer-aided vs. human). These phe- tain comparison task. nomena are reflected in linguistic fea- As our research goal is the analysis of trans- tures of translated texts belonging to dif- lation variation, we need a corpus which allows ferent registers and produced with differ- us to compare translations, which differ in the ent translation methods. For their analysis, source/target language, the type of the text trans- we combine methods derived from translated (genre or register) and the method of trans- 1 lation studies, language variation and ma- lation (human with/without CAT tools, machine chine translation, concentrating especially translation). There are a number of corpus-based on textual and lexico-grammatical varia- studies dedicated to the analysis of variation phe- tion. To our knowledge, none of the ex- nomena, cf. (Teich, 2003; Steiner, 2004; Neu- isting corpora can provide comparable re- mann, 2011) among others. However, all of sources for a comprehensive analysis of them concentrate on the analysis of human trans- variation across text types and translation lations only, comparing translated texts with non- methods. Therefore, the corpus resources translated ones. In some works on machine trans- created, as well as our analysis results will lation, the focus does lie on comparing differ- find application in different research areas, ent translation variants (human vs. machine), such as translation studies, machine trans- e.g. (White, 1994; Papineni et al., 2002; Babych lation, and others. and Hartley, 2004; Popovic,´ 2011). However, they all serve the task of automatic machine transla- 1 Introduction: Aims and Motivation tion (MT) systems evaluation and use the human- Comparable corpora serve as essential resources produced translations as references or training ma- for numerous studies and applications in both terial only. None of them provide analysis of linguistics (contrastive language, text analysis), specifc (linguistic) features of different text types translation studies and natural language process- translated with different translation methods. ing (machine translation, computational lexicog- The same tendencies are observed in the cor- raphy, information extraction). Many compara- pus resources available, as they are mostly built ble corpora are available and have been being cre- for certain research goals. Although there exists ated for different language pairs like (a) English, a number of translation corpora, none of them German and Italian (Baroni et al., 2009); (b) En- fits our research task: most of them include one glish, Norwegian, German and French (Johans- translation method only: EUROPARL (Koehn, son, 2002); (c) written or spoken English and Ger- 2005) and JRC-Acquis (Steinberger et al., 2006) man (Hansen et al., 2012) or (Lapshinova et al., – translations produced by humans, or DARPA- 2012). 94 (White, 1994) – machine-translated texts only. However, comparable corpora may be of the same language, as the feature of ’comparability’ 1CAT = computer-aided translation 77 Proceedings of the 6th Workshop on Building and Using Comparable Corpora, pages 77–86, Sofia, Bulgaria, August 8, 2013. c 2013 Association for Computational Linguistics Moreover, they all contain one register only and, tionese. therefore, cannot be applied to a comprehensive Baker (1995) excludes the influence of the analysis of variation phenomena. source language on a translation altogether, Therefore, we decided to compile our own com- analysing characteristic patterns of translations in- parable corpus which contains translations from dependent of the source language. Within this different languages, of different text types, pro- context, she proposed translation universals – hy- duced with different translation methods (human potheses on the universal features of translations: vs. machine). Furthermore, both human and ma- explicitation (tendency to spell things out rather chine translations contain further varieties: they than leave them implicit), simplification (tendency are produced by different translators (both profes- to simplify the language used in translation), nor- sional and student), with or without CAT tools or malisation (a tendency to exaggerate features of by different MT systems. the target language and to conform to its typi- This resource will be valuable not only for our cal patterns) and levelling out (individual trans- research goals, or for research purposes of further lated texts are alike), cf. (Baker, 1996). Addition- translation researchers. It can also find further ap- ally, translations can also have features of “shining plications, e.g. in machine translation or CAT tool through” defined by Teich (2003) – in this case we development, as well as translation quality asses- observe some typical features of the source lan- ment. guage in the translation. The author analyses this The remainder of the paper is structured as fol- phenomena comparing different linguistic features lows. Section 2 presents studies we adopt as the- (e.g. passive and passive-like constructions) of oretical background for the selection of features originals and translations in English and German. and requirements for corpus resources. In section In some recent applications of translationese 4, we describe the compilation and design of the phenomena, e.g. those for cleaning parallel cor- comparable translation corpus at hand. In section pora obtained from the Web, or for the im- 5, we demonstrate some examples of corpus ap- provement of translation and language models in plication, and in section 6, we draw some conclu- MT (Baroni and Bernardini, 2005; Kurokawa et sions and provide more ideas for corpus extension al., 2009; Koppel and Ordan, 2011; Lembersky and its further application. et al., 2012), authors succeeded to automatically identify these features with machine learning tech- 2 Theoretical Background and Resource niques. Requirements We aim at employing the knowledge (features To design and annotate a corpus reflecting varia- described) from these studies, as well as tech- tion phenomena, we need to define (linguistic) fea- niques applied to explore these features in the cor- tures of translations under analysis. As sources for pus. these features, we use studies on translation and translationese, those on language variation, as well 2.2 Language variation as works on machine translation, for instance MT Features of translated texts, as well as those of evaluation and MT quality assessment. their sources are influenced by the text types they belong to, see (Neumann, 2011). Therefore, we 2.1 Translation analysis and translationese also refer to studies on language variation which As already mentioned in section 1 above, trans- focus on the analysis of variation across registers lation studies either analyse differences between and genres, e.g. (Biber, 1995; Conrad and Biber, original texts and translations, e.g. (House, 1997; 2001; Halliday and Hasan, 1989; Matthiessen, Matthiessen, 2001; Teich, 2003; Hansen, 2003; 2006; Neumann, 2011) among others. Register Steiner, 2004), or concentrate on the properties of is described as functional variation, see Quirk et translated texts only, e.g. (Baker, 1995). How- al. (1985) and Biber et al. (1999). For exam- ever, it is important that most of them consider ple, language may vary according to the activ- translations to have their own specific properties itiy of the involved participants, production va- which distinguish them from the originals (both of rieties (written vs. spoken) of a language or the source and target language), and thus, estab- the relationship between speaker and addressee(s). lish specific language of translations – the transla- These parameters correspond to the variables of 78 field, tenor and mode defined in the framework of number of different phrase types in the source and Systemic Functional Linguistics (SFL), which de- target or difference between the depth of their syn- scribes language variation according to situational tactic trees, etc.). contexts, cf. e.g. Halliday and Hasan (1989), and Halliday (2004). 3 Methodology In SFL, these variables are associated with the Consideration of the features described in the corresponding lexico-grammatical features, e.g. above mentioned frameworks will give us new field of discourse is realised in functional verb insights on variation phenomena in translation. classes (e.g., activity, communication, etc) or term Thus, we collect these features and extract infor- patterns, tenor is realised in modality (expressed mation on their distribution across translation vari- e.g. by modal verbs) or stance expressions, mode ants of our corpus to evaluate them later with sta- is realised in information structure and textual co- tistical methods. hesion (e.g. personal and demonstrative refer- Some of the features described

VARTRA: a Comparable Corpus for Analysis of Translation Variation

Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna

Linguistic Annotation of the Digital Papyrological Corpus: Sematia

Creating and Using Multilingual Corpora in Translation Studies Claudio Fantinuoli and Federico Zanettin

Corpus Linguistics As a Tool in Legal Interpretation Lawrence M

Against Corpus Linguistics

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Corpus Linguistics: a Practical Introduction

A Scaleable Automated Quality Assurance Technique for Semantic Representations and Proposition Banks K

A Corpus-Based Investigation of Lexical Cohesion in En & It

Verb Similarity: Comparing Corpus and Psycholinguistic Data

Helping Language Learners Get Started with Concordancing

Rule-Based Machine Translation for the Italian–Sardinian Language Pair