An empirical study on the influence of suggestions’ provenance metadata

A Thesis Submitted for the Degree of

Doctor of Philosophy

by

Lucía Morado Vázquez

Department of Computer Science and Information Systems,

University of Limerick

Supervisor: Dr Chris Exton

Co-Supervisor: Reinhard Schäler

Submitted to the University of Limerick, August 2012

i

Abstract

In the area of localisation there is a constant pressure to automate processes in order to reduce the cost and time associated with the ever growing workload. One of the main approaches to achieve this objective is to reuse previously-localised data and metadata using standardised translation memory formats –such as the LISA Translation Memory eXchange (TMX) format or the OASIS XML Localisation Interchange File Format (XLIFF).

This research aims to study the effectiveness and importance of the localisation metadata associated with the translation suggestions provided by Computer-Assisted Translation (CAT) tools. Firstly, we analysed the way in which localisation data and metadata can be represented in the current specification of XLIFF (1.2). Secondly, we designed a new format called the Localisation Memory Container (LMC) to organise previously-localised XLIFF files in a single container. Finally, we developed a prototype (XLIFF Phoenix) to leverage the data and metadata from the LMC into untranslated XLIFF files in order to improve the task of the translator by helping CAT tools, not only to produce more translation suggestions easily, but also to enrich those suggestions with relevant metadata.

In order to test whether this “CAT-oriented” enriched metadata has any influence in the behaviour of the translator involved in the localisation process, we designed an experimental translation task with translators using a modified CAT tool (Swordfish II). A pilot study with translation students was carried out in December 2010 to test the validity of our methodology. The main study took place between December 2011 and January 2012 with the participation of 33 professional translators divided into three groups.

The analysis of the gathered data indicated that groups which received the translation memory obtained on average significantly better results (less time and better quality scores) than the group which did not receive any translation memory. In terms of participants’ attitude towards the metadata received, most of the participants did not find it distracting, and the majority of them would prefer a translation memory which

ii contained metadata; finally, half of participants could mention a case where it was helpful for them.

In this thesis we present our research objectives, the methodology and procedures, an analysis of the results of the experiments and finally we extract reasoned conclusions based on evidence of the importance of metadata during the localisation process.

iii Declaration

I hereby declare that this thesis is entirely my own work, and that it has not been submitted as an exercise for a degree at any other university.

The tool developed in the first stage of this research, XLIFF Phoenix, was presented in CNGL live demo presentations and in the LRC XV annual International Localisation in 2010, a demo is also publicly available on the internet 1 . A short summary of its development was also included in a journal paper (Aouad et al 2011) . The pilot study was presented to the International T3L Conference: Tradumàtica, Translation Technologies & Localization in 2011, a paper following that conference has been accepted for publication. A full list of the publications and conference presentations can be consulted in Appendix K.

1 http://www.youtube.com/watch?v=E6b36IHAMgM. iv Acknowledgements

This thesis has been made possible thanks to the support and help of many people; I would like to express my gratitude to some of them:

I would like to start by thanking Dr. Chris Exton, my supervisor, who has wisely advised me from the first meeting we had. I have learnt so much from his wisdom and kind personality. He always had the perfect words to support and guide me on the right path when I was lost. I would also want to thank his wife Geraldine for helping me with my written English.

I would like to thank my other supervisors in the LRC: Dimitra Anastasiou, Reinhard Schäler and David Filip, who have also supported me in my work and advised me.

I cannot remember the first time I heard the word localisation, but I am sure it was in one of the lessons taught by Dr. Jesús Torres. I cannot express my gratitude enough for all the time you spent teaching and advising me in the field of localisation. It is thanks to your support that during my research stays, in the University of Salamanca, I was able to carry out the pilot study.

I would like to express my sincere gratitude to Rodolfo Raya, secretary of the XLIFF Technical Committee and main developer of MaxPrograms, who kindly modified one of his tools (Swordfish) to accommodate the needs of this research. As well as advising me with specific technical issues, he also provided me with the necessary licenses to carry out the experiment and gave three full licenses to raffle between the experiment’s participants. I would also like to thank the rest of the XLIFF TC members.

I want to thank Karl Kelly, manager of the LRC, for all his help and support during these years; especially in the configuration of the server where the translation experiments took place and the continuous support during their execution. I also want to thank all of my LRC colleagues who walked with me all this way and helped me whenever I asked them to. I also want to thank the CNGL industrial partners who contributed to this research either with material or by helping me to find volunteers.

I want to thank my parents for giving me the best inheritance a person can get: my education. I also want to thank my siblings and the rest of my family for their constant support during these years. v A big thank you should also go to my best friend, Laura, who has been there for me since day one; her moral support helped me to continue working during the worst moments and helped me to find a balance between my life and the PhD. I want to thank all of my other friends, who helped me see that there was a life after the PhD.

I want to thank Marisé, María and Marta for making my Irish summers much sunnier.

Finally, I would like to thank all the participants (students and professionals) who donated their time to this research.

This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (www.cngl.ie) at the University of Limerick.

vi Table of Contents

Abstract ...... ii

Declaration ...... iv

Acknowledgements ...... v

List of Tables...... xi

List of Figures ...... xii

List of Appendices ...... xv

List of Abbreviations...... xvi

Chapter 1- Introduction ...... 1

1.1. Introduction ...... 1

1.2. Research Questions – Hypothesis ...... 7

1.3. Methodological Approach ...... 9

1.4. Motivation ...... 16

1.5. Thesis Layout ...... 17

Chapter 2 – Literature review ...... 18

2.1. Introduction ...... 18

2.2. Defining Localisation ...... 18

2.3. Digital Content ...... 23

2.4. Standards of localisation ...... 28

2.4.1. LISA and TMX ...... 29

2.4.2. OASIS XLIFF ...... 31

2.5. Previous research on CAT tools and translation memories ...... 34

Chapter 3 – Methodology I ...... 38

3.1. Introduction ...... 38

3.2. Design and Creation ...... 38

vii 3.2.1. Localisation Memory Container ...... 39

3.2.2. Localisation metadata in XLIFF ...... 42

3.2.3. XLIFF PHOENIX ...... 47

Chapter 4 – Methodology II ...... 81

4.1. Introduction ...... 81

4.2. Distribution of groups ...... 82

4.3. Participants ...... 82

4.4. Experiment environment ...... 83

4.5. Data used ...... 85

4.5.1. Source text ...... 85

4.5.2. Translation Memory ...... 86

4.5.3. Provenance metadata ...... 87

4.6. Logistics ...... 91

4.7. Translation assignment instructions ...... 92

4.8. Data collection methods ...... 94

4.8.1. Design of the demographic questionnaire ...... 94

4.8.2. Design of the task specific questionnaire ...... 99

4.9. Validity ...... 104

Chapter 5 – Results ...... 109

5.1. Introduction ...... 109

5.2. Participants ...... 109

5.2.1. Group A – No TM ...... 109

5.2.2. Group B - TM ...... 110

5.2.3. Group C – TM + Provenance Metadata ...... 110

5.3. Demographic Questionnaire ...... 112

5.3.1. Group A ...... 112

5.3.2. Group B ...... 123

viii 5.3.3. Group C ...... 133

5.3.4. Overall ...... 145

5.4. Task Specific Questionnaire ...... 156

5.4.1. Group A ...... 156

5.4.2. Group B ...... 168

5.4.3. Group C ...... 179

5.4.4. Overall ...... 190

5.5. Task Specific Questionnaire. Group C, additional questions ...... 201

5.6. Review of the Translation ...... 212

5.6.1. Group A ...... 215

5.6.2. Group B ...... 216

5.6.3. Group C ...... 217

5.6.4. Overall ...... 218

5.7. Video observations ...... 221

5.7.1. Analysis of the Keylog information ...... 221

5.7.2. Analysis of the videos ...... 230

5.7.3. Group A ...... 230

5.7.4. Group B ...... 232

5.7.5. Group C ...... 233

5.7.6. Overall ...... 234

Chapter 6 –Analysis & Interpretation ...... 238

6.1. Summary and analysis of the demographic data ...... 238

6.1.1. Personal data ...... 239

6.1.2. Translation Experience...... 239

6.1.3. Experience with CAT tools ...... 240

6.1.4. Correlation between the different variables measured ...... 240

6.2. Quality ...... 245

ix 6.2.1. Correlation with other variables ...... 245

6.3. Time ...... 251

6.3.1. Correlation with other variables ...... 251

6.4. Participants attitudes and opinions towards the metadata ...... 258

Chapter 7 – Conclusions & Recommendations – ...... 260

7.1. Summary of results ...... 260

7.2. Conclusion and recommendations ...... 261

7.3. Impact of the research ...... 262

7.4. Limitations and Future Research ...... 264

Bibliography ...... 267

x List of Tables

Table 1. Zachman Framework applied to XLIFF 1.2 attributes...... 44 Table 2. Attributes allowed in the alt-trans element...... 45 Table 3. Group C. Participants' position...... 111 Table 4. Age. Group A ...... 112 Table 5. Experience Years. Group A ...... 114 Table 6. Hours per day. Group A ...... 114 Table 7. Other translation-related activities. Group A ...... 115 Table 8. Experience Years. Group B ...... 124 Table 9. Hours per day. Group B ...... 125 Table 10. Other translation-related activities. Group B ...... 126 Table 11. Experience years. Group C ...... 134 Table 12. Hours per day. Group C ...... 136 Table 13.Other activities. Group C ...... 137 Table 14. Age. All participants ...... 145 Table 15. Experience years. All participants ...... 146 Table 16. Hours per day. All participants ...... 147 Table 17. CAT tools usage. All participants ...... 151 Table 18. Doubts. Group A...... 167 Table 19. External resources. Group A...... 167 Table 20. Doubts. Group B...... 177 Table 21. External resources. Group B...... 178 Table 22. Linguistic difficulty. Group C...... 186 Table 23. Doubts. Group C...... 188 Table 24. External resources. Group C...... 189 Table 25. Doubts. Overall ...... 199 Table 26. External resources. Overall ...... 200 Table 27. Correlation between the demographic variables ...... 241

xi List of Figures

Figure 1. XLIFF Phoenix. General Filter ...... 52 Figure 2. XLIFF Phoenix. Advanced Filter ...... 53 Figure 3 Enriched XLIFF file in Swordfish II ...... 62 Figure 4. XLIFF Phoenix Architectural Diagram ...... 71 Figure 5. XLIFF Phoenix. Main screen...... 72 Figure 6. XLIFF Phoenix. Loaded XLIFF source file ...... 73 Figure 7. XLIFF Phoenix. Loaded LMC file ...... 74 Figure 8. XLIFF Phoenix. Filtering option in the Wizard ...... 75 Figure 9. XLIFF Phoenix. General filtering options ...... 75 Figure 10. XLIFF Phoenix. Advanced filtering options ...... 76 Figure 11. Enriched file...... 77 Figure 12. XLIFF Phoenix. Saving warning message ...... 78 Figure 13 Enriched XLIFF file in Swordfish ...... 80 Figure 14. Server Environment ...... 84 Figure 15 Main language combinations. Group A ...... 116 Figure 16. Other language combinations. Group A...... 117 Figure 17. CAT tools usage. Group A ...... 118 Figure 18. TM Usage. Group A ...... 119 Figure 19. Swordfish. Group A ...... 120 Figure 20. XLIFF Group A ...... 121 Figure 21. Main language combinations. Group B ...... 126 Figure 22. Other language combinations. Group B ...... 127 Figure 23. CAT tools usage. Group B ...... 128 Figure 24. TM Usage. Group B ...... 129 Figure 25. Swordfish. Group B ...... 131 Figure 26. XLIFF. Group B ...... 132 Figure 27. Main language combinations. Group C ...... 138 Figure 28. Other language combinations. Group C ...... 138 Figure 29. Total number of language combinations. Group C...... 139 Figure 30. CAT tools usage. Group C ...... 140 Figure 31. TM usage. Group C...... 142 Figure 32. Swordfish. Group C ...... 143 xii Figure 33. XLIFF. Group C ...... 144 Figure 34. Main language combination. All participants ...... 149 Figure 35. Other language combinations. All participants ...... 149 Figure 36. TM Usage. All participants...... 153 Figure 37. Swordfish. All participants...... 154 Figure 38. Topic of the text. Group A...... 157 Figure 39. Experience with Microsoft products. Group A...... 158 Figure 40. Experience with Excel. Group A...... 160 Figure 41. Experience with Word. Group A...... 161 Figure 42. Experience with PowerPoint. Group A...... 162 Figure 43. Experience with Access. Group A...... 163 Figure 44. Linguistic difficulty. Group A...... 165 Figure 45. Topic of the text. Group B...... 168 Figure 46. Experience with Microsoft products. Group B...... 169 Figure 47. Experience with Excel. Group B...... 170 Figure 48. Experience with Word. Group B...... 171 Figure 49. Experience with PowerPoint. Group B...... 172 Figure 50. Experience with Access. Group B...... 173 Figure 51. Experience with Outlook. Group B...... 174 Figure 52. Linguistic difficulty. Group B...... 175 Figure 53. Topic of the text. Group C...... 179 Figure 54. Experience with Microsoft products. Group C...... 180 Figure 55. Experience with Excel. Group C...... 181 Figure 56. Experience with Word. Group C...... 182 Figure 57. Experience with PowerPoint. Group C...... 183 Figure 58. Experience with Access...... 184 Figure 59. Experience with Outlook. Group C...... 185 Figure 60. Experience with Microsoft products. Overall...... 191 Figure 61. Experience with Excel. Overall ...... 192 Figure 62. Experience with Word. Overall...... 193 Figure 63. Experience with PowerPoint. Overall...... 194 Figure 64. Experience with Outlook. Overall...... 196 Figure 65. Linguistic difficulty. Overall...... 197 Figure 66. Screenshot of one of participants’ LISA QA Model results ...... 214 xiii Figure 67. LISA QA Model results. Group A...... 215 Figure 68. LISA QA Model results. Group B...... 216 Figure 69. LISA QA Model results. Group C...... 217 Figure 70. LISA QA Model results. Arithmetic mean all groups...... 219 Figure 71. LISA QA Model results. Overall ...... 220 Figure 72. Time all groups ...... 236

xiv List of Appendices

Appendix A – Demographic Questionnaire ...... 274

Appendix B –Task Questionnaire. Groups A and B ...... 283

Appendix C –Task Questionnaire. Group C ...... 289

Appendix D – Answers to the Demographic Questionnaire ...... 297

Appendix E – Answers to the Task Questionnaire. Groups A and B, and first part of Group C...... 309

Appendix F – Answers to the Task Questionnaire. Second part of Group C ...... 320

Appendix G – Translation Evaluation. Results ...... 331

Appendix H – Keylog Information ...... 376

Appendix I – Video Observations ...... 378

Appendix J – Translation Text ...... 380

Appendix K – Publications, conference presentations and other research related activities 408

Appendix L – Call for participation ...... 412

Appendix M – Participant Information Sheet ...... 413

xv List of Abbreviations

AR: Arabic

CA: Catalan

CNGL: Centre for Next Generation Localisation

CSIS: Computer Science and Information Systems

DCM: Digital Content Management

DTP: Desktop Publishing

DU: Dutch

EN: English

ES: Spanish

ETSI: European Telecommunications Standards Institute

FR: French

GALA: Globalization and Localization Association

GL: Galician

IT: Italian

LISA: Localization Industry Standards Association

LMC: Localisation Memory Container

LPS: Language Service Provider

LQA: Language Quality Assurance

LRC: Localisation Research Centre

MT:

OASIS: Organization for the Advancement of Structured Information Standards

xvi OAXAL: Open Architecture for XML Authoring and Localization Reference Model

OSCAR: Open Standards for Container/Content Allowing Re-use

PMD: Provenance Metadata

QA: Quality Assurance

SF: Systems Framework

TM: Translation Memory

TC: Technical Committee

Vkeys: Virtual Keys

XLIFF: XLIFF Localisation Interchange File Format

XML: Extensible Markup Language

XSL: EXtensible Stylesheet Language

xvii

Eu non se esvarío. Vexo o mundo darredor de min e adoezo por entendelo. Vexo sombras e luces, nubeiros que viaxan, lume, árbores. Qué é todo esto?

(Neira Vilas 1961 p.28)

xviii Chapter 1- Introduction

1.1. Introduction

Localisation is the “linguistic and cultural adaptation of digital content” (Schäler 2009 p.157). The term2 is normally used to define the technical processes that are needed to transform digital content into another language and culture, as well as the powerful industry- involving software companies, Language Service Providers (LSP), translators, localisation engineers, etc. - which are behind those processes.

The localisation industry

If we exclude the open source movements from the bigger picture, localisation is an industry driven area, mainly lead by the big software companies, which have based their expansion into foreign markets on this field: they need to localise their products in order to sell them in other areas of the globe where languages other than English are spoken. Ireland has been for years the leading country in this field, this sector is estimated to be worth in this country 680 million Euros annually, employing around 14,000-16,000 professionals (CNGL 2012a p.6). Many of the largest software companies are based here (ibid), as well as the largest localisation companies and the biggest research centre, the Centre for Next Generation Localisation (CNGL). This country was also the birthplace of the XML Localisation Interchange File Format (XLIFF TC 2008a p.9), one of the key data exchange standards in this area.

The CNGL is an academia-industry synergy whose objective is “to produce substantial advances in the basic and applied research underpinning the design, implementation and evaluation of the blueprints for the Next Generation Localisation Factory” (CNGL 2012b). The centre is composed of more than 100 researchers based in four Irish universities (University of Limerick, Dublin City University, University College Dublin and Trinity College Dublin) and also by 10 Software and Localisation industry partners. The research within the CNGL is divided in four big tracks: Integrated Language Technologies (ILT), Digital Content Management (DCM), Localisation (LOC) and Systems Framework (SF). The present research is embedded in the LOC research group based at the Localisation Research Centre (LRC), in the Computer Science Department

2 A discussion on the different definitions of localisation through the time can be found in the section “Literature Review”. 1 and Information Systems (CSIS) at the University of Limerick. In this research the help of the industrial partners was of a key importance: they provided us with real data to be used in our experiments as well as giving us assistance to find volunteers for those experiments.

The localisation process

If we study localisation as a process, we can find several phases or steps that are needed to create a successful new localised product. These phases involve different players, different tools and different levels of technical experience. A good example of the steps that a localisation project can have can be found in Esselink (2000 p.17): “Pre-Sales Phase3, Kick-Off Meeting, Analysis of Source Material, Scheduling and Budgeting, Terminology Setup, Preparation of Source Material, Translation of Software, Translation off Online Help and Documentation, Engineering and Testing of Software, Screen Captures, Help Engineering and DTP of Documentation, Processing Updates, Product QA and Delivery and Project Closure”. As you could observe by the different steps described above, this field has a multidisciplinary nature: it combines the area of linguistics (necessary to transfer the knowledge from one language and culture to the other) and computer science (as it deals with digital content which needs to be manipulated and transformed into other systems). It also has dependencies on other areas such as marketing, desktop publishing or workflow management.

The translation task and CAT tools

After acknowledging the vast area of study that localisation represents and the background of the researcher, we decided to focus specifically on the translation task. The translation task in the localisation process involves professional translators4, and Translation Memory (TM) tools are the standard tool option (García 2008 p.50). Computer Assisted Translation (CAT) tools are computer-based programs that were designed to assist translators in their daily task by automating some of the processes. They can have a wide range of functionalities: terminology management, translation memory systems, spellcheckers, quality assurance tests, etc. The purpose behind these

3 Esselink defines the steps needed to localise a software product. Localisation, as we understand it, does not only involve this kind of products and different steps would be needed for other type of products, i.e. a website. 4 Although, we can see crowdsourcing initiatives like the localisation of the Facebook platform that involved bilingual users. 2 tools is to help translators to automate their work, improve their productivity, reduce costs and assure a more consistent terminology within their output products. In this research we will focus only in CAT tools that have the Translation Memory functionality, and we indistinctively refer to them as CAT tools or TM tools during this dissertation. Although it can also represent an interesting area of research, studying other functionalities or aspects of the CAT/TM tools is out of the scope of this research.

This is an industry that, whatever name it uses, is based on selling lots of translated words, with quality often taken for granted, time-to-market an important constraint, and price paramount. (García 2006 p.15)

The objective of the CNGL project is to improve the localisation process in the three traditional commercial axes: better quality (which García states as normally being taken for granted), less time (the above mentioned “time-to-market”) and with a reduction of cost (the aforementioned “price paramount”); and with an increase in automation, which is perceived by many to be the only means to deal with the future demands of an increasing use of the web by a varied language user base.

Translation Memory tools and formats

The automation and leveraging of previous have been identified as one of the key solutions (TAUS Data Association 2010 p.1) in order to address the above mentioned challenges of this area. Translation Memory tools are CAT tools whose main functionality is the management of translation memories. Translation Memories (TM) are databases composed by previously localised translations, which are segmented and stored for future reuse. Many agree that translation memories have proven to help translators and increase their productivity and the consistency of their texts (Brkić et al 2009). However, TMS are not the panacea for all localisation processes; it has been suggested that certain text types are more inclined to work better with TM systems than others (Christensen 2003; Bowker 2005; Christensen and Schjoldager 2010), and other studies also seem to establish a relation between the translation of technical texts and a higher use of TM systems by translators (Fulford and Granell-Zafra 2005; Lagoudaki 2006a). This relation could be explained by the nature of the technical texts themselves, which use a limited range of terms and normally contain lexical and phraseological repetitions, all of which make this the most suitable text type for TM usage (Bowker

3 2005; Christensen and Schjoldager 2010). Critical voices with the use of TM have also pointed out additional constraints that this kind of tools impose to the translator: possible perpetuation of errors and confusion (de Saint Robert 2008 p.114) and the necessity of spending additional time training translators on the usage of the tool to obtain improvements in terms of productivity and consistency (de Saint Robert 2008 p.115; García 2008 p.55) .

Translation Memory and Data Exchange Standards: TMX and XLIFF

The format in which the translation memories are stored clearly defines their future reuse. In order to avoid being locked-in in proprietary solutions, a standardised TM format was developed by the Open Standards for Container/Content Allowing Re-use (OSCAR) group of the extinct Localisation Industry Standards Association (LISA), the Translation Memory eXchange (TMX). This format is a XML language which allows the storage of translation units for its later leverage in future projects. TMX has proved to be very successful and is supported by the main CAT tool in the industry as well as in the Open Source world (Gough 2010), although, with the dissolution of LISA in February 2011 its future is still uncertain. A complete review on the TMX standard and its support on CAT tools can be found in the section Standards of Localisation in Chapter 2.

Another standardised data exchange format in the localisation area, which was not originally designed to serve as TM, but that can be used for that purpose, is the XML Localisation Interchange File Format (XLIFF). In a nutshell XLIFF is a data container that carries localisation content from one localisation process to the other without loss or corruption of data. The first version of this standard, developed by the Organization of the Advancement of Structured Information Standards (OASIS), was published in 2001. The current version is 1.2 and it is moving towards 2.0. XLIFF was created as a standardised intermediate format for localisation where localisable data extracted from different formats can be stored; this would allow CAT tools developers to work with a unique format instead of having to develop a new filter/parser for every new format and therefore they can concentrate their efforts on improving other functionalities. The main

4 feature of XLIFF is that it enables potential interoperability5 between different tools, thus eliminating vendor lock-in. An XLIFF file is basically composed by two differentiated sections: the where extracted localisable information from another file is segmented and stored into several translation units (); and the

where several metadata items related to the original file and the localisation process are included. One of these items is the element, where information on how to recreate the original file is included. XLIFF files also allow the inclusion of workflow information (in the element), supplementary material (translation memories or glossaries that can be stored in the element) and other customised data through extension points. XLIFF files therefore are created from other native source files (through specialised filters) and could receive complementary localisation data in the process rather than being developed independently. Although XLIFF was not initially designed to be used as TM format, once an XLIFF file is localised, its structure and the information that it contains represents a rich database of localised content and its correspondent metadata that can be leveraged in future projects having the adequate tools in place to recover that information. The first stage of our research answers this question: how can we reuse data and metadata kept in translated XLIFF files for its later reuse? You can find more information about the methodology put in place to answer this question in Chapter 3.

Research on Translation Memories

The influence of the TMs in the translator’s behaviour has been the subject of some recent research efforts (Christensen and Schjoldager 2010). With the exception of Teixeira (2011), little has been studied about the metadata that surrounds the translation suggestions presented to the translators through the TM tools; these metadata can represent provenance information (about the author, date, state, or any other identifying information of the previous translation) or other additional information regarding the translation suggestion like the match percentage information. To narrow the scope of our research we are only concentrating our research on the provenance metadata. Although provenance metadata was not the central object of study of previous research, we did find some references to this kind of information in some of them: in Guerberof

5 In the localisation context, a file is “interoperable” if it can be transported and/or modified between different tools without loss or corruption or data, the XLIFF standard aims to provide a format that would allow to create files with that feature. 5 (2009) an experiment comparing translation suggestion coming from two different sources (translation memories and machine translation) is presented and the provenance metadata that could identify the origin of each of the suggestions is hidden on purpose “because translators would ignore the nature of the source text, be it MT or TM, and thus they would not be biased towards either type of text during the post-editing process” (ibid); this statement clearly implies that provenance information could have an impact on the translator’s behaviour. On another study by de Saint Robert (2008 pp.113–114), the author when talking about CAT tool constraints points out (as a negative factor) that commercial tools do not always guarantee “document traceability” of the translation suggestions they provide. We can again interpret that not having information about “document traceability”, or as we call it “provenance metadata” has an impact (in this case negative) on the translator’s behaviour. Later in her paper, the author states:

“Additional tools are document alignment tools by language pairs. Indexing of large text corpora for retrieval of precedents are felt preferable to tools that provide text segments, be they paragraphs, sentences or sub-units with their respective translation, but6 without any indication of date, source, context, originator, name of translator and reviser to asses adequacy and reliability in an environment where many translators are involved.” (de Saint Robert 2008 p.118)

Focusing on the last part of that paragraph we can imply that provenance information (“indication of date, source, context, originator, name of translator and reviser”) helps translators to assess the adequacy and reliability of a given translation suggestion. Having indentified this gap in the literature review of our field of research – there have not been to our knowledge any research efforts that have focused on the study of the provenance metadata that surrounds translation suggestions – we decided to focus on studying the influence of provenance metadata of translation suggestions during the translation process in human translators.

Provenance metadata

Metadata is generally defined as data about data, in the area of computer science, metadata normally is used to indicate data that describes or carries information about other data (for example, creation date). Provenance metadata is data which contains

6 Emphasis added by the researcher. 6 information about the origin of other data. In our research, our main object of study is the provenance metatada that surrounds translation suggestions, that is, the information about the origin of translation suggestions for example, who translated it or when was it translated. Provenance metadata is not normally exposed to the translator during their work, and data exchange standards do not still offer the possibility of adding most of the metadata items (which we identified as containing provenance information) to the elements that contain the translation suggestions in a standardised form that could be understood by different CAT tools. More information about our study of provenance metadata can be found in Chapter 3.

1.2. Research Questions – Hypothesis

In the very early stage of our research, our initial question was “how can we store, organise and reuse localisation knowledge?”, in order to answer this question: how the localisation knowledge was being stored and transmitted through different process and time, we decided to focus on the existing localisation data exchange standards. These standards have been specifically developed to allow the reuse of previously localised content, and by doing so, they help to increase the productivity of translators. After studying the different standards, we decided to focus only on XLIFF as a vehicle to investigate how data and metadata could be stored and transmitted to future assignments. As is often the case, our research question became more refined as we understood more about the problem domain.

We identified a gap in the reuse of the provenance metadata – most of the provenance information we identified in our analysis of the standard could not be reinserted in the alt-trans element, which is the element which can contain translation suggestions. Therefore we decided to demonstrate that that metadata could be inserted automatically and presented to translators during their work, and we designed and developed a tool prototype (XLIFF Phoenix) for that purpose. After having demonstrating that provenance metadata could be stored, organised and presented to the translators, we decided to redefine and focus our research objectives and study the influence that provenance metadata can have in the translator’s behaviour during their work. A complete description of all the steps we carried out to answer our initial research question could be found in Chapter 3.

7 As previously stated, after an initial broader question, the principal aim of this research became to determine the influence that translation suggestions’ provenance metadata has in the behaviour of human translators during their work when using Computer Assisted Translation Tools. The main question that we want to answer in this research is the following:

How does the provenance that surrounds translation memory suggestion influence the behaviour of translators during their work in a localisation process?

Three hypotheses were considered for this main question.

H0- Provenance metadata has no effect on the behaviour of translators during their work.

H1- Provenance metadata has a positive impact on the behaviour of translators during their work.

H2- Provenance metadata has a negative impact on the behaviour of translators during their work.

Based on the above quoted from de Saint Robert (2008 p.118) we tended to believe that PMD can have a positive impact on the translator’s behaviour (H1). By positive impact we mean: a reduction of time spent on the translation, an improvement in the quality of the translation and by a combination of these two a cost reduction on the task, in comparison with the same situation but with the absence of the provenance metadata. Through the experimental phase of our research, explained in detail in Chapter 4, the impact on the translator’s behaviour was measured using a triangular data collection method, where we combined: questionnaires, recording of the screen, keystroke information and the translated file itself. By negative impact we mean: negative attitudes towards the metadata provided by the translators (which could be obtained through some answers to the questionnaires), an increase in the time spent on the translation, or worse quality in the translation task. Each of these indicators were analysed independently and are presented in chapters 5 and 6.

8 1.3. Methodological Approach

A method triangulation strategy was used in our study: we combine the "design and creation strategy" (Oates 2006 p.108) where first, we studied how localisation knowledge was being stored, organised and reused, which led us to the study of the current data exchange localisation formats. We found a gap in the treatment that provenance metadata was receiving: in some cases it could not be encapsulated along with the translation suggestion without breaking the validity of the file. Therefore, we decided to design and develop a tool prototype (XLIFF Phoenix) that automated the leveraging of localisation data and metadata to new files to be presented to translators within a CAT tool. Once we had obtained an automated way to retrieve localisation provenance metadata and insert it along with the translation suggestions in new files, we decided to test whether these new enriched files (with provenance metadata) had an influence on the translator’s behaviour (and subsequently on his or her output work) during the translation task in the localisation process. We used an “experimental strategy" (Oates 2006 p.126) to answer this second and main question.

Object of study: Tools vs. Standards

Is metadata defined in the standards that will be later implemented in CAT tools? Or, on the contrary: Are CAT tool providers influencing the development of standards to introduce the metadata items that they use or want to use in their tools? Taking this dichotomy into account we identified two possible approaches to study the provenance MD in translation suggestions:

1. Study how current CAT tools show provenance metadata to translators. That is, studying what type of data is presented and how it is presented in the GUI to their potential users (project managers, translators, reviewers, etc.). 2. Study the current standards for exchanging translation and localisation data and observe which provenance metadata items are defined in their specifications, which items are missing, and if that is the case, how they can be incorporated.

The main issue with applying the first approach is the rapid and constant development of CAT tools, which would make the object of study of this research obsolete by the time it were published which represents one of the biggest impediments in the research on TM tools (Pym 2011 p.5). Besides, most of these CAT tools are being developed by 9 private companies, which implies that we do not have direct access to their design and development strategies, and we did not want to have our research relying on something that we do not have control over. Taking into account the estimated duration of our PhD research and the fact that we are not members of any CAT tool development group, we decided to discard this option.

The second approach –concentrating our efforts at the standard level- was the elected one. First of all, because the two main standards considered from the beginning were and are developed in an open way, as they are open standards by definition, this means that we have access to the technical committee documentation and other related official material (like the development Wiki, meeting minutes, etc.). As well as that, I have become a Technical Committee member of the XLIFF standard since the beginning of this research (March 2009), which has provided me with a better insight into of its development, and the possibility of having a direct influence on it. Having said that, CAT tool developers participate actively in the development of translation data Exchange standards – 23% of XLIFF TC members in April 2012 were CAT tool developers (Filip 2012 p.33) – therefore we can also assume that their requirements in terms of metadata inclusion in the standard are also regularly presented to the standard Technical Committee for their approval. On the other hand, standards do also influence the development of CAT tools, as every time a new version of a standard is released a CAT tool might need to be modified if the developer wants to include support for the new standard version. Lastly, another reason for choosing this approach is the much slower rhythm of development that open standards have. Which not only would keep this research valid for more time, but the results of the research itself could also influence directly in the development of the standard.

DESIGN AND CREATION

The XML Localisation Interchange File Format (XLIFF) has been developed to allow the interoperability between CAT tools and the seamless transmission of data and metadata during the localisation process (XLIFF TC 2008a). It is a bilingual document that may contain parallel data inside the source and target elements of the translation or binary units. These parallel texts can be easily transformed into TMX using an XSL template (Raya 2004) and reused in future projects. However, much of the existing data and metadata will be lost during that process, as the TMX format was not designed to

10 capture as much data as an XLIFF document can contain. With our research we use the whole XLIFF document as a memory element. We have developed a tool prototype – XLIFF Phoenix – which allows the leverage of existing translation into unlocalised XLIFF files.

The Localisation Memory Container (LMC) is an XML vocabulary that was developed as a data descriptor to allow the storage of previously localised XLIFF documents within a single file. XLIFF Phoenix was developed to obtain information from the LMC and enrich unlocalised XLIFF documents with translation suggestions.

XLIFF Phoenix compares the XLIFF documents contained in the LMC with an unlocalised XLIFF file introduced into the system, and leverages the coincident translation units along with their correspondent metadata (origin, source and target language, etc.); both the data and the metadata are introduced in the alt-trans element inside its primary trans-unit element. Then, the tool exports a valid XLIFF file that can be read by most of the CAT tools available in the market7 and subsequently can be used by translators and or localisers, who would benefit from the introduced data.

EXPERIMENTS

Few experimental studies have been carried out in the research field of translation memories over the last years (Bowker 2005; Christensen and Schjoldager 2010). There are various reasons for this shortage such as the difficulty in obtaining valid participants and useful research objects and supporting data (valid documents and TMs), the different rates of development between academia and industry (Pym 2011 p.5) and the difficulty in controlling all the different variables that can lead to an externally valid experiment. Experiment variables, as far as CAT procedures are concerned, include profiles of participants, workflow, time pressure, text types, translation instructions, TM/MT programs and language combinations chosen (ibid, p.7) (as we encountered in the design of our approach). However, none of these constraints discouraged us from continuing with the experimental approach, and we tried to overcome them with all the means and resources we had available. To test the validity of our methodology we carried out a pilot study in December 2010 with translation students. One year later, in

7 The result file is a valid XLIFF 1.2 file, therefore those tools that support this version of the standard would be able to work with it. 11 December 2011 and January 2012 the main experiment took place with the participation of professional translators.

Pilot Study

The experiment was carried out in a computer room in the Faculty of Translation and Documentation of the University of Salamanca, Spain. Although all participants worked simultaneously in the same room, they were not allowed to communicate among themselves. The participants of this experiment were ten translation students (nine females and one male) in their final year of their Translation and Interpreting undergraduate degree course.

The text given to the translators was donated by Microsoft, one of the CNGL's industrial partners; the text was part of the official Ms Excel help documentation. We converted the original html file into XLIFF and we leveraged past translation into it with our tool (XLIFF Phoenix) as well as their correspondent metadata. We created three versions of the document including different levels of data and metadata in different points. We split the participants into three groups: A, B and C and each group received a different version of the master enriched file.

Prior to the translation task, general instructions were given to the participants for their assignment. They were asked to complete the translation from English into Spanish (es- ES) of one file using Swordfish II plus any other useful resource they might access via a browser and internet connection. They were told to behave as they would do in a real translation assignment scenario, with the only restriction being the forbidding of communication between them.

In order to obtain the maximum amount of data from the experiment and subsequently increase the validity of its results we decided to use a triangular data collection mechanism: questionnaires, recording of the screen and the output translated document. Two questionnaires were completed by the participants: the first one aimed to obtain background information of the participant and was completed before the translation task; the second questionnaire was focused on getting the translator's impression on the fulfilled task and was filled after the completion of the task. We obtained different types of data from the various data collection methods that needed to be analysed in a different way: the quality of the output translation task was analysed using the LISA

12 QA Model which is based on an error base system, which deducts points to an initial 100% rate depending on the type of error found and its severity; the video recordings were observed using a video player program and information about the translator behaviour and the time spent on each of the segments was extracted and annotated; and finally, the data obtained from the questionnaire was analysed depending on the nature of each question: quantitative or qualitative.

Although our focus was not on studying the results of the experiments, but on studying the validity of the methods used, we did extract some findings from this preliminary study. It was deduced by the participants’ answers to the retrospective questionnaire that some metadata elements were more taken into consideration than others (contact-name, target-language and date are the most consulted metadata items). And there was an observed increase in the quality (better LISA QA Model rates) and productivity (less time spent) in the sections that contained metadata information. More detailed information about the design of the pilot study can be found in Chapter 4.

Main Study

The main study was carried out one year later than the pilot study. In this new iteration our target participants were professional translators. Translators traditionally work in two modes in the localisation industry: as in-house translators, working in the offices of a translation company or within the translation department of a bigger company (such as a software company); or as freelance translators, working from their own homes receiving projects from different companies (commonly translation companies). Firstly, we explored the possibility of conducting our main experiment in a translation company with in-house translators, this option was soon discarded for two main reasons: first because after consulting our CNGL partners we acknowledged that none8 of them had a sufficient number of translators with the language combination (EN-ES) in a single place and secondly because doing the experiments in an external translation company would be economically unfeasible with our resources. Then, we decided to target freelance translators, who work for translation companies from their own homes and usually using their own CAT tools in their own computers (although there are already some cloud based translation platforms which are accessible via a web browser). Participants were recruited through the CNGL partners and through an open call for

8 Or at least the ones that answered to our request. 13 participation that was sent through several translation and localisation distribution mailing lists.

A free webinar on the CAT tool Swordfish was offered to all the participants. This free webinar attracted some of the participants and also allowed us to inform the participants about the basic functionality of the tool and the process of the experiment itself. Participants did not receive monetary compensation for their time, but three full licenses for Swordfish III were raffled between the participants who completed the experiment.

A controlled environment was designed to carry out the translation task. A server was setup in the Localisation Research Centre and 100 user accounts were created. The server was remotely accessible. The Operating System was Windows 7 and it had Spanish as main input language. After carrying out our own internal testing, the system was technically tested with seven participants in November 2012 with positive results. The server could only be accessed by one person at a time; therefore we used the doodle9 platform to allow participants to choose the timeslot that better suited their needs. An average of four participants completed the experiments per day. A total of 59 valid experiments were successfully recorded.

We used the same data method gathering process as in the pilot study (questionnaires, video recording and output translation file). We decided to use a different recording screen program on this occasion that allowed us the introduction of a new data gathering process: the keystroke movements. The keystroke data obtained through an XML file produced by the recording program allowed us to obtain automatically the typing efforts of each of the participants.

A different distribution of participants and data was put in place for this experiment. Instead of having three documents that contained each of the three sections (one without TM, another with TM, and another with TM plus provenance MD) which were in three different positions depending on each of the groups; we decided to simplify our choices and subsequently its later analysis: Group A (received a text without TM), Group B (received the same text with TM) and Group C (received the same text with TM and its correspondent provenance MD). A different text created from even segments (in terms of difficulty and length) from the “Microsoft Excel help documentation” was created.

9 http://www.doodle.com/ 14 You can find more detailed information about the design of the experiment of the main experiment and how the text was composed in section 2 of Chapter 4.

Although our target participant was a professional translator, we did not state that in the call for participation, and we left the experiment open for translation students as well. In Group A (the group that had the text without TM) that represented our control group, we had ten participants, seven of which were professional translators. In Group B (the group that had the text with TM), we had also ten participants, with seven professional translators among them. Finally, in Group C (the group that had the text with TM with provenance MD), we had 39 participants, with 25 professional translators among them. The bigger number of participants in Group C was due to our desire of obtaining a significant amount of qualitative data on the use of provenance metadata (through the retrospective questionnaire), and Group C was the only group exposed to that information.

One of the main threats to the validity of our experiments was the lack of representativeness of the participants and the possible subject variability between the groups that could distort the results. We measured different personal variables that could affect our results: personal data (age and gender), translation experience (years of experience, current position, translation working hours per day), and experience with CAT tools. After doing a descriptive analysis of our dataset (which can be consulted in Chapter 5) we determined that there were not significant differences between the average measurements of the three groups. Therefore, changes observed in between the groups in the two factors measured in the experiments (time and quality) could not be directly associated to differences in the profile of the participants.

As stated before, we measured the time and the quality that each of the participants had in their translation task. In terms of quality, groups B and C obtained significantly better quality scores than Group A, which indicates that the use of a translation memory (with or without metadata) implies an improvement in the quality of the translation. Group B (the one without metadata) obtained slightly better results than Group C, however, the difference was not significant enough to infer any cause-effect relationship. In terms of time we obtained similar results: Group A spent more time than Groups B and C, which indicates also that the use of a translation memory (with or without metadata) implies a reduction of the time. Group B spent less time than Group C, however, again the

15 difference was not enough significant to extract any conclusion. A complete analysis and interpretation of the data obtained in the experiments can be found in Chapter 6.

We also studied the attitude that translators from Group C had towards the metadata they received, the majority of them would prefer to receive translation memories with provenance metadata. However, not all the participants declared that the metadata helped them during their work. Two key concepts aroused from their answers: trust and reliability. They stated that the quantity of the metadata was not important, but what was of a key importance for them was the meaning that those metadata items had for them. A complete analysis and interpretation of the participant’s attitude can be found in Chapter 6.

To sum up, with our experiments, we did not find any significant difference in the behaviour of the translators in terms of quality and time due to absence or not of the metadata. However, we found significant evidence that translation memory (with or without metadata) can improve translators’ output (better quality and in less time). We did not find negative attitudes in the participants that would imply a negative impact in their behaviour, although not all the participants stated that the metadata actually helped them.

1.4. Motivation

There have been some efforts in the area of TMs and how they affect the behaviour of the translator during his or her work (Christensen and Schjoldager 2010). These studies analyse the effect of the translation memories and only take into account the translation suggestion that is given to the translator. With the exception of the initial work of Teixeira (2011) discussed in the literature review section, provenance metadata that surrounds translation suggestions was never the central focus in any of those studies.

The present research aims to fill that gap and set a solid base for future research in this field applying different research strategies and methods, such as eye tracking.

16 1.5. Thesis Layout

The structure of this dissertation is as follows:

Chapter 2 – Literature Review. This chapter contains an historical literature review of the localisation field with a specific stress on localisation standards and previous research on translation memory. Chapter 3 – Methodology I. In this chapter the general triangular methodology strategy is presented. Then the first part of this methodology “Design and Creation” is explained in detail: the development of the LMC language and the XLIFF Phoenix Prototype tool. Chapter 4 – Methodology II. This second methodology chapter explains in detail the design and implementation of the pilot study and the main experiment: recruitment of the participants, data preparation, issues encountered, setup of the server, tools used and data collection methods). Chapter 5 – Findings. A descriptive analysis data obtained in the experiments is presented in this chapter. Chapter 6 - Analysis & Interpretation. An analysis of the data is presented in this chapter. An interpretation on the results of the analysis is provided along with the analysis. Chapter 7 – Conclusions and Recommendations. A summary of the research results can be found in this section as well as recommendations for future developments on this research.

17 Chapter 2 – Literature review

Writing about software localization is like fighting against time. (Esselink 2000 p.preface)

2.1. Introduction

Localisation10 and its industry is a relatively new area that goes back to the 1980s, since then has grown dramatically thanks also to the development of the internet and the so- called globalisation of the world’s economy (Esselink 2000 pp.5–6). It is an industry- oriented discipline, and in terms of academia there have been little research efforts, especially when compared with other related areas, in fact only one specific research centre for localisation (the Localisation Research Centre at the University of Limerick) was present until very recently. The first academic journal specific to the topic was also edited by the LRC. In 2008 the CNGL was born in Ireland and four Irish universities (University of Limerick, Trinity College Dublin, University College Dublin and Dublin City University) are leading the research in the field with more than 300 specialised publications since its establishment (CNGL 2012c).

This chapter presents a review and discussion of the different definitions of localisation through the recent history of the field. Then a review of the main standards of localisation is presented along with an overview of the existing research in CAT tools and translation memories.

2.2. Defining Localisation

Localisation is a new term and concept, as Esselink (2000 p.1) explains “[t]he term ‘localization’ is derived from the word ‘locale’, which traditionally means a small area or vicinity. Today, locale is mostly used in a technical context, where it represents a specific combination of language, region, and character encoding.”

Many authors and institutions have tried to define the concept of “localisation”, we present and discuss in the following paragraphs some of their most representative contributions:

10 We followed in this dissertation British English spelling conventions. However, in the quotes by other authors we respect their writing preferences. 18 The Localization Industry Standards Association (LISA)11 has defined localisation as follows:

Localization is the process of modifying products or services to account for differences in distinct markets. (Lommel 2006 p.11)

This definition is too generic for the purposes of this research and does not specify the type of products that are “localised”. A more extended definition can be found in the same organisation webpage:

Localization refers to the actual adaptation of the product for a specific market. It includes translation, adaptation of graphics, adoption of local currencies, use of proper forms for dates, addresses, and phone numbers, and many other details, including physical structures of products in some cases. If these details were not anticipated in the internationalization phase, they must be fixed during localization, adding time and expense to the project. In extreme cases, products that were not internationalized may not even be localizable. (LISA 2009)

Again, the type of product is not clearly defined in this definition. This might be done on purpose to cover as many products as possible. However, for the purposes of this research we need to narrow our field.

After examining LISA’s definition, which is by far the most accepted and quoted one in our area, and as it does not fit with the research objectives we will make a historical review of the different attempts to define localisation, we will discuss them and finally we will come up with our own definition.

In the year 2000 Robert C. Sprung edited the book Translating into Success, and despite its title, its main topic is the localisation process. In the introduction of the book Sprung defines the term localisation as follows:

[L]ocalization—taking a product (ideally, one that has been internationalized well) and tailoring it to an individual local market (e.g., Germany, Japan). “Localization” often refers to translating and adapting software products to local markets.

(Sprung and Jaroniec 2000 p.x)

11 See the section “Standards of localisation” for more information. 19 The definition does not differ much from the previous LISA one, in fact, there is a relation between them, as Michael Anobile, founding member and managing director of LISA at that time, wrote the foreword. However, in the second sentence of Sprung’s definition the product is referred as “software”, and it is the first time that software is mentioned when talking about localisation.

Also in the year 2000, Bert Esselink wrote one of the most influential and quoted books in localisation —A practical guide of Localization— where the author defines the term “localisation” in the introduction as follows:

Generally speaking, localization is the translation and adaptation of a software or web product, which includes the software application itself and all related product documentation. (Esselink 2000 p.1)

In this definition the product that LISA mentions has been narrowed into “software or web product”, that is a more specific concrete solution and more adequate definition to our research. It is not only “software” but also “web products”, and this is something that Esselink has added from a previous version of the book that was titled “Practical Guide to Software Localization”, from 1998. The author acknowledges that with the rise of the internet, the term localisation should cover other areas, “such as web sites or “traditional” documentation” (Esselink 2000 p.preface).

Four years later, in 2004, Anthony Pym, an academic in the field of translation studies, wrote the book The Moving Text: Localization, translation and distribution. In this book Pym does not talk about products, but “texts”. In his theory “text” is the key concept, and “translation” is not disregarded as one of the steps of the localisation process, but occupies a central position. There is an attempt to define the concept of localisation in the first part of the book:

(…). In this very practical sense, localization is the adaptation and translation of a text (like a software program) to suit a particular reception situation. (Pym 2004 p.1)

We could include Pym in what Morado et al (2009) defined as TOLS (Translation- Oriented Localisation Studies), where the central object of study in the localisation area is the translation process, and it is from that point of view that we would like to take our research. 20 In 2006, the book Perspectives on Localization was published, addressing most of the topics and challenges that the growing discipline was encountering: terminology management; localisation education; and localisation standards. Dunne himself defined localisation as:

[T]he process by which digital content and products developed in one locale (defined in terms of geographical area, language and culture are adapted for sale and use in another locale. (Dunne 2006 p.115)

In this book the concept of “digital content” was introduced for the first time, when talking about the localisation process. The general concept of “products” is abandoned, and a more specific “digital products” term is introduced, but the author is not specifying which type of digital content he is referring to (e.g. software or web products). However, the idea of product is still present and the purpose of localisation here is the sale of the product. Localisation of Open Source Software, for example, would be excluded from this definition.

In 2009 the second edition of the Routledge Encyclopedia of Translation Studies edited by Mona Baker and Gabriela Saldanha was published. In the previous edition (from 1998), “localisation” was not present as an entry in the encyclopedia. In the second edition the entry for localisation was written by Reinhard Schäler, director of the Localisation Research Centre at the University of Limerick. Schäler defines localisation as follows:

Localization can be defined as the linguistic and cultural adaptation of digital content to the requirements and locale of a foreign market, and the provision of services and technologies for the management of multilingualism across the digital global information flow. (Schäler 2009 p.157)

Dunne’s “digital content” is also present in this definition. In fact, Schäler explains in the same article that “what makes localization, as we refer to it today, different from previous, similar activities, [is] namely that it deals with digital material. To be adapted or localized, digital material requires tools and technologies, skills, processes and standards that are different from those required for the adaptation of traditional material such as paper-based print or celluloid (…).” [Emphasis in the original].

21 The commercial purpose is still present in this definition with the word “market”, and gives the impression that localisation cannot exist without an economic objective. This is not always the case, as many open source localisation efforts have been carried out successfully in recent times (Diaz Fouces and García González 2008). Instead, we prefer to use Pym's ambiguous “particular reception situation” which does not contain commercial connotations but helps us to understand the idea of the change. Taking into account what has been said above, we should admit that it is understandable that most of the authors use the words “product” or “market”, because localisation since its beginning has been a very business oriented discipline, and this aspect cannot be avoided.

Schäler’s definition and his idea of the digital content suit the objectives of this current research perfectly. However, what is digital content? Is it software and web products as Esselink said? Or software alone? The answer to these questions will be discussed in the next section, under the title “Digital content”.

In summary, for the purposes of this research we understand localisation as the “the linguistic and cultural adaptation of digital content.”

(Adapted from Schäler 2009 p.157)

22 2.3. Digital Content

A bit has no color, size, or weight, and it can travel at the speed of light. It is the smallest atomic element in the DNA of information. It is a state of being: on or off, true or false, up or down, in or out, black or white. For practical purposes we consider a bit to be a 1 or a 0. The meaning of the 1 or the 0 is a separate matter. In the early days of computing, a string of bits most commonly represented numerical information. (Negroponte 1995 p.14)

It has been stated before that localisation implies the linguistic and cultural adaptation of digital content (adapted from Schäler 2009 p.157). In this section the terms “digital products” and “digital texts” will be used interchangeably to refer to “digital content”. The discussion about the purposes of the content—for commercial purposes or otherwise, will not be raised in this section.

In this section we will specify what digital is and the consequences this has for our field, because we can predict that this digital nature will determine the way the content will be created, stored, modified and reused, which were the three main axes that we started to study in our research: how can we store, organise and reuse localisation knowledge?

Nicholas Negroponte, co-founder and former director of the MIT Media Laboratory, wrote a visionary book—Being digital— in 1995, which tried to explain the new digital era and its implications for our lives. In the book, Negroponte distinguishes between atoms and bits: atoms are things that have been always around us and that we can touch (for example, a book or a coffee machine); on the other hand, bits are everything that can be represented in a digital form (in binary code) and cannot be substantially touched (for example, a software program or a document created with a word processor.).

In the following lines, we present six definitions of “digital” from computer science- specialised books and dictionaries:

[D]igital[:] Pertaining to digits or to showing data or physical quantities by digits.

(Marotta 1986 p.123)

Digital[:] A term describing devices such as a computer that process and store data in the form of ones and zeros. In a positive logic representation, 'one' might be +5 volts and zero 0 volts. This lowest of levels at which computers operate is known as machine code. Binary arithmetic and Boolean algebra (named after Irish mathematician George Boole) permit mathematical representation. Boolean algebra and Karnaugh maps are

23 used widely for minimisation of logic algebraic expressions. Though digital signals exist at two levels (one or zero), an indeterminate state is possible.

(Botto 1999 p.105)

[D]igital (adj.) Describes anything that uses discrete numerical values to represent data or signals (as opposed to a continuously fluctuating flow of current or voltage). A computer processes digital data representing text, sound, pictures, animations, or video content.

(Hansen 2004 p.82) What all these definitions have in common is the numeric nature of describing the information. As the definitions get closer to the present, the concept develops from having the meaning of “pertaining to digits” (Marotta 1986), through to becoming more specialised to refer to “binary numbers” (Botto 1999), before taking on the value of “representing data” like “text, sound, pictures, animations, or video content" (Hansen 2004). This new form of representing material and texts has many implications for its manipulation.

Today, it is not only about transforming atoms into bits, i.e. digitalising a printed book. The tendency is to create texts, products, and material directly in digital form that, in most cases, will not be transformed into atoms. A blog, for example, is a personal diary written in digital form, which is rarely transformed into atoms unless some entries are printed or turned into a book (in the case of a very successful blog).

One of the first questions that was addressed when the digital era started was, how do we digitalise in a way that other devices can interpret the same data in the same way? The answer came with the signature of conventions, guidelines, standards, and ad hoc proprietary solutions, the latter being the less desirable option. In other words: an agreement between two or more parties to create, interchange, use or store data in a specific way (a specific binary sequence).The different layers (from the machine language to the user interface) are constructed on the basis of standards and conventions that facilitate the interchange and interoperability of data between devices and tools and also ensure that the information can be sent in a secure mode and without loss or corruption of data.

The localisation industry has also gone through this process of standardisation; this topic will be covered in the next section of this chapter. However, it should be said that

24 localisation itself was born along with the digital era, as it works with digital material. So standards are an intrinsic part of localisation. It could be stated that localisation would be impossible without digital standards or conventions.

What does “digitalise” mean to texts and other material?

One of the implications that the digitalisation of texts has is its new linked nature. In a digital world the interrelation of elements is a simple and sometimes necessary operation:

digitalisation allows and incites the interrelation and the tagging of each element in a multidimensional way. This means that in a digital environment, as well as it was in the oral nature (section1), all the elements (text, image, sound, buttons…) can do (function) and mean something. [Emphasis in the original].

(Torres del Rey 2003, section 5) [Translated by Lucía Morado Vázquez]

Jesús Torres, lecturer and researcher in the field of localisation and translation studies, also states the consequences of the digitalisation of texts:

The text could lose its strictly linear nature:

 The hyperlinks allow you to surf/go associatively through related ideas and enter and go out whenever and wherever you want, even randomly.

 Texts are adapted to the user´s needs: for example, a manual (contained in a single text or hypertext) can be consulted linearly, by sections, or by key words.

In order to manipulate that flexibility the meaning and function of the textual unities needs to be scrupulously tagged and organised.

(Torres del Rey 2003, section 5) [Translated by Lucía Morado Vázquez]

This idea of non-linear text is also addressed in Negroponte's book:

In a printed book, sentences, paragraphs, pages, and chapters follow one another in an order determined not only by the author but also by the physical and sequential construct of the book itself. While a book may be randomly accessible and your eyes may browse quite haphazardly, it is nonetheless forever fixed by the confines of three physical dimensions. (Negroponte 1995 p.69)

Negroponte follows the idea that the new digital components are no longer in our atoms’ three dimensional reality, but are multidimensional:

25 In the digital world, this is not the case. Information space is by no means limited to three dimensions. An expression of an idea or train of thought can include a multidimensional network of pointers to further elaborations or arguments, which can be invoked or ignored. The structure of the text should be imagined like a complex molecular model. Chunks of information can be reordered, sentences expanded, and works given definitions on the spot (something I hope you have not needed too often in this book). These linkages can be embedded either by the author at "publishing" time or later by readers over time.

Think of hypermedia as a collection of elastic messages that can stretch and shrink in accordance with the reader’s actions.

(Negroponte 1995 p.70)

We also see this new multidimensional nature as a main feature of the digital texts. We can see a digital text, for example, the manual that Torres (2003) mentioned earlier and we can see it as a linear structure, that can be read from the first to the last page, or we can see it as a modular structure, and navigate through the points that interest us depending on our own preferences in a particular moment (for example, when we want to see how can we change the contrast of our new tv screen).

This new form of understanding and presenting information has many implications from both the linguistic and technical points of view, such as maintaining the consistency of terminology which is required through the whole product.

Another example that can be pointed out to clarify this multidimensional nature is a software program. Each user will navigate in a different way through the different menus and options of the program depending on their specific needs. It is almost impossible to predict their exact behaviour. As a result of that, the terminology used needs to be consistent as well as the style and register of the language.

Following the same idea, Jesús Torres in his book “La Interfaz de la Traducción” stated that two of the main characteristics of the localisation process are “dynamism” and “interactivity”. He states that:

(...) [T]he continuous linking of segments and unities of different size (from letters to texts or sets of texts) with databases categorised with different criteria (semantic, structural, etc.), causes the changes introduced in one point of the connectively linked chain to affect not only the rest of the elements of the chain, but the whole process.

(Torres del Rey 2005 p.95) [Translated by Lucía Morado Vázquez]

In order to address these new challenges that digital texts and content have, different techniques and tools are needed, which are different from the ones used in the pre- 26 digital area. The tools that have been specifically developed to assist translators and localisers during their task are commonly named Computer Assisted Translation (CAT) tools; we will be discussing them later in this chapter. These tools aim to speed up the whole process and ensure consistency. Most of these tools make use of standards: for example, in order to maintain consistency of terminology, the use of the TBX (Term- Base eXchange) standard will help not only to follow the preferred terminology in one specific text or product, but also it can be reused in future versions of the product or similar texts from the same company. A broader view on the topic of standards of localisation is presented in the following section.

27 2.4. Standards of localisation

The initial research question of this research was “How can we capture, organise and use localisation knowledge?” Several industrial initiatives were created and developed to try to address this problem and others that affect the localisation process, therefore, our first step to answer our question was then to study how current systems store and transmit localisation data from past projects to new ones. This leaded us to the study of standards of localisation.

By standards of localisation we understand those standards that were created ad hoc to enable, facilitate and improve the various localisation efforts. There are other computer science standards and conventions that have a great influence in the localisation process (e.g. UNICODE fonts, IANA language codes, etc), but they are out of the scope of the current research and will not be discussed.

We only focused our research on standard organisations that allowed the creation and development of “open” standards, which means that their work should be transparent and a royalty-free use should be guaranteed (Filip 2012 p.29). A good example of an open standard initiative is the work within the Technical Committee (TC) of the OASIS XLIFF standard: all the work carried out by its members is documented (regular working meetings minutes, mailing lists, and work on the wiki) and open to the general public and interested parties to consult at any time. The TC offers also the possibility for non-members to participate with their feedback in the development process through a special mailing list created with that specific purpose in mind.

In the recent history of localisation and its standards, two organisations have been leading the standardisation efforts: the Organization for the Advancement of Structured Information Standards (OASIS)12, which includes under its umbrella two localisation standards (OAXAL - Open Architecture for XML Authoring and Localization Reference Model- and XLIFF) and the Localization Industry Standards Association (LISA) with a committee formed by a group of experts (OSCAR –Open Standards for Container/content Allowing Reuse) which was devoted uniquely to the creation and development of standards of localisation.

12 OASIS develops many other standards that are not localisation related. 28 At the time of writing this dissertation, we are experiencing a changing momentum in the field of standards of localisation: on the one hand, LISA was declared insolvent in March 2011 and subsequently stopped operations; and on the other hand, LISA’s disappearance lined up in time with the flowering of several new standards initiatives (some of them more strictly open than others): the Unicode Localization Interoperability technical committee within the Unicode Consortium, the GALA (Globalization and Localization Association) Standards Initiative and The Interoperability Now! group (Filip 2012 pp.32–35). The XLIFF standard (developed by OASIS), is the main localisation standard that is currently enjoying a steady and vigorous moment: it has attracted new and influential members (ibid 2012, p.33) and its work is reaching a bigger audience through several promotion initiatives coordinated through a new subcommittee created within the main TC which is devoted to promotion and liaison activities.

2.4.1. LISA and TMX

As stated below, the main standard organisation in the field of localisation was LISA (Localization Industry Standards Association). As its name implies, LISA had a clear commercial nature, it had more than 100 industrial members in the field (localisation companies, software developers, translation agencies, etc.) (LISA 2011a). Unfortunately, in March 2011, the association had to be closed due to financial problems, and its standards specifications were donated to another standard association, the European Telecommunications Standards Institute ETSI. Since then, the former LISA standards are being moving forward through this organisation (Trillaud and Guillemin 2012 pp.39–41).

LISA carried out its standard activities through the OSCAR (Open Standards for Container/content Allowing Reuse) committee which was responsible for the development of the following standards:

 Translation Memory eXchange (TMX). OSCAR’s oldest standard, TMX supports the exchange of translation memory (TM) data between applications.  Segmentation Rules eXchange (SRX). A companion standard to TMX, SRX allows application developers to define how their tools segment text.  Term-Base eXchange (TBX). Based on ISO standards, TBX is the standard for the exchange of structured terminological data.  XML Text Memory (xml:tm). xml:tm allows text memory (including translations) to be embedded within XML documents.

29  Global Information Management Metrics eXchange - Volume (GMX-V). GMX-V provides a standard way to count characters and words in documents for information management purposes, including estimating translation costs.  Term Link (proposed). Term Link is an XML namespace that enables XML elements to be linked to termbases.

(LISA 2011b)

For the aims of this research we were particularly interested in TMX as its purpose “is to provide a standard method to describe translation memory data that is being exchanged among tools and/or translation vendors, while introducing little or no loss of critical data during the process” (OSCAR LISA 2005). This objective partially coincided with our initial research objectives (capture, organise and use), however it was only restricted to “translation data”, and our scope was broader as it involved a more generic process, localisation, where translation is only a part of it, as the localisation academic Ignacio García (2006 p.15) states: “[i]t is worth noting that the localisation industry does not translate – it localises.” We needed to look to a broader standard that could encapsulate not only translation data but the whole procedural data. That standard is called XLIFF (XML Localization Interchange File Format) and will be discussed in the next section. The main difference between both formats is that a TMX file is formed by a collection of translation unit elements with no specific order and without a mechanism to rebuild the original file (XLIFF TC 2007) whereas in an XLIFF file the translation units are ordered and identified, and from them, the original file could be rebuilt. Moreover, more information about the original file is included in specific elements of the file which eventually can also be reused in future processes, as we demonstrated with the first stage of our methodology.

The XLIFF format can be used for different purposes and can be adjusted to ad hoc requirements. The information that it is kept in the XLIFF file format can also be reused for future projects; “in modern localisation operations, TM is the primary enabler for cost reductions by maintaining multiple representations of information.” (Lommel 2006 p.228). However, XLIFF was not originally designed to act as a translation memory in itself. The use of XLIFF as TM was first associated to its conversion to TMX (Raya 2004; Raya 2005; XLIFF TC 2007), but that conversion from XLIFF into TMX tags, by mapping the former items with the equivalent ones in the latter format, results in the disposal of many pieces of information that cannot be matched in TMX. Among other reasons, this is because both standards (that share some similarities) were created

30 with different goals and fulfill different necessities during the localisation process (XLIFF TC 2007). Instead of using that approach we wanted to maintain the whole data set that an XLIFF document contains by storing the document itself inside a bigger container.

2.4.2. OASIS XLIFF

The XML Localisation Interchange File Format is an open standard developed by OASIS (Organization for the Advancement of Structured Information Standards). It is a data container that carries localisable data from one localisation process to the other. Its main feature is that it allows the interoperability between different tools.

The XLIFF Technical Committee (TC) is in charge of the development and maintenance of the standard. The current TC is formed by 21 voting members (including the author of this dissertation) coming from localisation-related industries and institutions: 33.3% from software corporations, 23.8% from tool vendors, 14.3% from associations, 9.5% from service tool providers, 9.5 % from academic institutions and 9.5% of the members are registered as individuals (Filip 2012 p.33). The TC is currently developing the new version XLIFF 2.0 (the current specification is XLIFF 1.2) and there are two subcommittees: XLIFF Inline Markup SC, which deals with the developing and description of inline elements within the standard, and the XLIFF Promotion and Liaison, which works on promotion and liaison activities. Users and other interested people can also participate by sending their comments, questions or suggestions to an open mailing list devoted to that purpose. All the technical documents produced by the TC are open to the general public and can be downloaded from its web page: https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xliff.

The basic principle behind XLIFF is the extraction of the localisation-related data from its original format, which can be manipulated by those tools that support the standard, and its conversion to the original format once the whole localisation process is finished.

31

XLIFF extracting/merging principle schema (XLIFF TC 2008a p.11)

XLIFF ARCHICTETURE

The XLIFF 1.2 specification includes 37 elements, 80 attributes and 269 pre-defined values. That makes a total of 386 items, which makes it a complex XML vocabulary.

The main structure of a document is composed by the following elements: , , and . The element is the root element. The element describes the file that is being encapsulated in the XLIFF document, there can be one or more XLIFF files inside an XLIFF document. The element should include information about the name and location of the original file, its source language and the type of data that it contains. The

element contains the metadata of the file. The element contains the extracted localisation information that needs to be treated and organised in translation or binary units (XLIFF TC 2008b).

Inside the translation units we find three elements: the element where the source text is introduced; the element, where the translated version of the source element should be introduced; and the element where we could introduce alternative translations. The latter element “alt-trans” it is the most relevant one for our research, because this is where we could save the information about the translation suggestion and its related metadata.

32 The attributes that are present in the current specification can contain rich metadata that can be reused in future projects. An analysis of the relevant attributes for our research could be found in the methodology chapter.

The core parts in an XLIFF document are the translation unit elements. If we see them closely, they are similar to a standardised TMX document (XLIFF TC 2007). However, an XLIFF document is much more than a Translation Memory. It can contain much more information about the process, the tools used, the people involved and the current state of the document.

The XLIFF standard will be central in our research. We will analyse the metadata that it can contain, how it is actually reused, and we will propose a new way of reusing more metadata and organise it in a way that can be helpful for the use of the localisation professionals. All these processes are explained in detail in the first methodology chapter.

LANGUAGES IN XLIFF:

An XLIFF document is by default a bilingual document, which contains the original file in one language and the target language that is being localised. The languages of the original and target elements should be stated inside the element in the attributes “source-language” and “target-language” at the beginning of the document.

INVOLVEMENT IN THE XLIFF TC

The author of this research joined the XLIFF Technical Committee in March 2009 and has contributed to its development since then –participating in the fortnightly meetings, collaborating in the development of the new specifications, being one of the main organisers of the First XLIFF International Symposium13, and member of the scientific committee in the two following symposium editions. In March 2010 she joined the XLIFF Inline Markup Subcommittee14, a group that it is devoted to the definition and specification of the inline elements of the XLIFF standard and its relation and possible interoperability with related standards such TMX and ITS. The author has also joined the Promotion and Liaison since its establishment in 2011 and has been the main editor

13 You can visit the official webpage of the event at www.localisation.ie/xliff. 14 You can visit the official subcommittee webpage at http://www.oasis- open.org/committees/tc_home.php?wg_abbrev=xliff-inline 33 of the XLIFF support in CAT tools Survey, which obtains information about the XLIFF support in CAT tools from the main CAT tool developers of the industry and the open source world (Morado Vázquez and Filip 2012).

2.5. Previous research on CAT tools and translation memories

Computer Assisted Translation (CAT) tools have been designed to support human translators to fulfil their job, although TM technology had started to be developed in the late 1960s and early 1970s, it was not until the 1990s that commercial CAT tools became to spread (Benito 2009). They should not be confused with Machine Translation systems, which ultimate objective is to substitute human translators. CAT tools were designed to help translators to eliminate repetitive tasks, to automate terminology lookups, and to reuse previously translated material (Esselink 2000 p.359). Some of the principal characteristics that CAT tools share are: text extraction and format filters (which separate the translatable text from the rest of the code), translation memory and terminology management, project management functionalities, quality controls, and preview options (Torres del Rey 2011). In our research, we were only focused in the translation memory functionality and how it affects the translation behaviour (along with the provenance metadata that we introduced).

Translation memories are databases made of previous translations (organised by source and target language) that can be reused in future projects. They can also be of much help when working in the same file or project, as when repetitive text appears, its translation can be automatically suggested or translated by the tool; and when working with other translators, having into account that a remote translation memory accessible simultaneously is into place. The format in which these translation memories are stored will determine their future reuse. As seen in the previous section, a specific standard (TMX) was developed by LISA to allow the interoperability of translation memories between different CAT tools. Other data exchange formats, like XLIFF, which was not originally created to be used as a translation memory format, can be used for that purpose.

The main advantages of the use of translation memories for translators are an increase of productivity (less time spent on the translation task), and an improvement on the 34 quality of the translation by increasing consistency (Bowker 2005 p.15). The penetration rate of translation memory systems within professional translators has been reported as very high (Lagoudaki 2006b p.15) with a penetration rate of 81% among freelance translators in that year.

There have not been many research efforts on CAT tools and translation memories and their impact on translators. Two recent academic papers (Christensen & Schjoldager 2010 and García 2009) compiled and reviewed the recent studies on the field. García divides the research studies into descriptive studies (surveys, reports and case studies) and (quasi) experimental studies. Our research would clearly fit into the later one. García also states that one of the risks for researchers working with TMs is that, due to the rapid technological change, their work might get obsolete or not relevant by the time is published. In the paper by Christensen and Schjoldager, they focused on compiling and describing existing research on TM research carried out with empirical methods after the year 2000; they identified nine studies which fulfilled those criteria. In their conclusion Christensen and Schjoldager stated that “[m]ost practitioners seem to take for granted that TM technology speeds up production time and improve translation quality, but there are not studies that actually document this”, our experiment will also help to fill that gap in the field of translation research using CAT tools and translation memories, because, after analysing the results of our experiments, we determined that, in average, the two groups that received a translation memory, obtained significant better results in terms of reduction of time and better quality.

Apart from the risk of relevancy, mentioned by García and also present in Pym (2011), we have found through our research, what we believe, is one of the main constraints of this type of research has: the difficulty to obtain a large and representative set of participants. Translators can work as in-house professionals or as freelancers. Reaching both groups involves difficulties: on the one hand, reaching in-house translators, who work in offices, can be a difficult task if only a language combination is picked for the study, in our particular case, the industrial partners we consulted did not have a sufficient number of English-Spanish translators in place; on the other hand, freelance translators represent also many difficulties to reach to: they cannot be easily physically approached (as they work from their own homes), and they work using their own and customised work environment with their own preferred translation tools, all those characteristics impose a high subject variability and a lousy control over working 35 environments that can influence negatively the validity of the results of an empirical study. A deeper discussion in the limitations of our particular experiment and its validity can be found in Chapter 4. Some researchers (Bowker 2005; Christensen and Schjoldager 2011; Mesa-Lao 2011) used translation students, which helped them to have a bigger dataset of participants. We also used translation students in our pilot study, this study helped us to better define the boundaries of the experiment and to redefine and adjust some aspects of its design (like the distribution of the groups). The biggest set of participants used in a empirical study involving translation memories, was achieved by the TRACE project (conducted by several researchers based in the Autonomous University of Barcelona), which involved the participation of 90 15 professional translators and which main objective was to measure the impact of CAT tools on translated texts (Torres-Hostench 2010). To the best of our knowledge, our experiment represents the second biggest data set in terms of participants in the field of empirical research using translation memories, with a total of 33 valid participants.

Empirical research on translation memories has been focused on studying the following specific topics: their impact on speed and quality (Bowker 2005; Yamada 2011); their impact on the cognitive processes (Christensen and Schjoldager 2011); the productivity and quality in post-editing Machine Translation (O’Brien 2007; Guerberof 2009); consistency in translation memories (Moorkens 2011); translation research using eye- tracking (Doherty et al 2010; O’Brien 2010) which was also used to measured the translator’s attitudes to UI design (O’Brien et al 2010); and the impact of CAT tools on translated texts, carried out by the above mentioned TRACE project in which different researchers worked on studying three specific aspects: explicitation, linguistic interference and textuality (Torres-Hostench 2010).

Teixeira (2011) is the only research effort that explicitly relates to the study of provenance metadata in translation memories, his object of study is the combination of translation suggestions coming from human created translation memories and from machine translation systems. In his study, he tried to identify whether the absence or presence of provenance metadata (the independent variable of his study) had any influence in the translators work in terms of translation speed and quality (the dependent variables). In his initial study, which he intends to extend, he used two professional

15 We are not sure of the final number of participants, as they stated in the paper that some of the initial 90 recruited participants were discarded. 36 translators. Due to the limitations of his initial study, definitive conclusions could not be drawn. Although, there are similarities with our research, our study, did not make use of translations suggestions coming from machine translation systems, we only used translation memory suggestions coming from the official target translation, which were carried out by human translators. We also investigated a full range of provenance metadata items that surrounded the translation suggestion (name of the translator, date, topic of text, target language and name of original file), not only its origin (being created by a human translator or by a machine translation system). Our research aims to contribute to the field of empirical research on translation memories; by studying the influence that provenance metadata that surrounds translation suggestions has on the translator's behaviour.

37 Chapter 3 – Methodology I

3.1. Introduction

In the first stage of our research our main objective was to answer the following question How can we capture, organise and use localisation knowledge? In order to answer this question we studied the current data exchange standards present in our research field –explained in detail in the literature review section– then we concentrated our efforts on one of those standards (the XLIFF standard) and we identified the gaps where possible useful metadata is not being captured, organised and reused. Therefore we decided to follow a “Design and Creation”16 research strategy that allowed us to design and develop a tool prototype that demonstrates how metadata that surrounds translation suggestions can be captured, organised and reused by translators in future projects. This chapter explains the process we followed to do so. In the next chapter we follow a different research strategy: “Experimental Research” where we conducted research experiments with translators to study the influence that the metadata that we are introducing, along with the translation suggestions, has on the translators during their task. We will discuss the two sequential strategies separately. 3.2. Design and Creation

The design and creation research strategy focuses on developing new IT products, also called artefacts.

(Oates 2006, p.106)

The first part of our methodology is the design and creation of a data container (later called “Localisation Memory Container” or LMC) that would allow us to encapsulate localisation data and metadata, and the second part is the development of a localisation tool that will automate the process of recovering localisation data and metadata.

The purpose of these two steps is to answer the initial question of our research, and that means to demonstrate how can we organise (in a Localisation Memory Container file), capture (with an extracting and merging tool), and use (with a traditional Computer

16 We have followed the terminology of the book “Researching Information Systems and Computing” by Briony J Oates (2006) and its research guidelines in this section.

38 Assisted Translation tool- or CAT tool). The resulting files from this strategy will be: a) Enriched XLIFF files that will be used in CAT tools, and b) LMC files where XLIFF files could be stored for its latter reuse. In the second part of our methodological strategy we will test whether this new localisation metadata that we are introducing in the enriched files will actually be helpful for the translators and subsequently improve their work and productivity.

This chapter is divided into three sections: Localisation Memory Container, which describes the repository that we developed to organise XLIFF files for its later reuse; the Localisation Metadata in XLIFF, which describes the process to identify possible meaningful metadata within the current XLIFF 1.2 specification; and XLIFF Phoenix, which describes the design and development of our tool prototype.

3.2.1. Localisation Memory Container

The Localisation Memory Container is an XML vocabulary that was developed as a data descriptor that allows the storage of previous XLIFF documents in a single file.

Instead of using extracted translated data like is done in the TMX standard, we want to use complete XLIFF files as localisation memory components that could be reused in future projects. It is possible to encapsulate several XLIFF files into one XLIFF file (XLIFF TC 2008b), however this possibility did not allow us to have our desirable self description document, that is to say, to have the information about what we have in the container in the same document. Our approach was to develop a new XML language where we could store previous XLIFF files; the language should also be self descriptor, which means that it has information about the number of files that it contains, its author, its creation and last modification date.

As you can see defined in its XML Schema below, the main structure of an LMC document is composed by the elements

and . In the header there are the following attributes: author (which should contain the information about the author of the LMC), last-mod (this attribute should contain the date of the last modification of the LMC), total-files (this attribute should contain the exact number of files that the LMC has), started (this attribute should indicate the creation date of the LMC).

In the body, any XML that is correctly specified with a namespace can be included. For the purposes of our research we will only use XLIFF files. However, we left this option 39 open to other XML languages, like the TMX that could be incorporated if necessary in the future.

This is the XML schema of the LMC:

And this is a LMC sample file containing two XLIFF files:

40

This is a note Hello World Hola Mundo Goodbye World Adiós Mundo

41 3.2.2. Localisation metadata in XLIFF

Localisation Metadata connects the data present at different stages of the localisation process, from digital content creation, annotation, and maintenance, to content generation, and process management. The usage of open, rich, and flexible metadata is an effective step towards the aggregation and sharing of data between localisation sub- processes.

(Anastasiou & Morado Vázquez 2010, p.258)

As mentioned above, metadata can be very important in various phases of the localisation process, particularly when data is shared among people, in similar or different roles, along the globalisation cycle. However, not all metadata has the same relevance for localisation professionals, and depends very much on the computer- assisted localisation tool and the type of content being localised. XLIFF provides many mechanisms to add metadata related to the product, the process, the supplementary material, or any other localisation factors. However, little use is made of such opportunity to improve the task of the localiser by enriching the data that they need to process.

Following the above definition, we carried out a complete analysis of the current XLIFF specification to identify the attributes that could encapsulate the relevant localisation metadata for our research. An explanation of how we decided the relevancy is given below. After that, we investigated which of our selected attributes could be introduced validly in the alt-trans element, and finally we proposed a new form of introducing the new metadata in the “alt-trans” element (an XLIFF 1.2 element explained later) without breaking the validity of the XLIFF file.

The relevant metadata (for the localisation process) we have selected is those attributes that can self-describe the localisation process by answering the traditional five W questions: why, who, what, where and when; and the sixth one: how.

42 Choice of metadata

In order to identify potentially relevant localisation metadata for our research, we first carried out a complete analysis of the current XLIFF 1.2 specification and looked for attributes that could describe the localisation process by answering the traditional Zachman Entreprise Framework (Warren 2007) WH questions: why, who, what, where, when and how. See the next table for a schematic view of our analysis:

What (data) is it? original

is being localised? source-language

target-language

category

version

does it do? datatype

Who is localising/translating? contact-name (people) localised similar files earlier? (matches) contact-email

is the product for translate

Where is it localised [company/country]? company-name (network) target-language

is it localised [tool]? tool-id

tool-name

did the translation matches come from? state-qualifier

When were the matches obtained? date (date) was the product made? date

are we in the localisation phase state

Why is it being localised? job-id (motivation)

How is it localised [tool, process, auxiliary material]? tool-id (function) tool-name

is it being localised? state

43 state-qualifier

Table 1. Zachman Framework applied to XLIFF 1.2 attributes.

The next step was to check whether these attributes are already present in the element. The element is the place where translation suggestions can be introduced in a XLIFF 1.2 file.

We summarise in this table the results from the metadata obtained:

Attribute Allowed in the alt- Included in our trans element analysis

alttranstype Yes No

approved No Yes

category No Yes

company-name No Yes

contact-email No Yes

contact-name No Yes

coord Yes No

crc Yes No

css-style Yes No

datatype Yes Yes

date No Yes

exstyle Yes No

extradata Yes No

extype Yes No

font Yes No

format No Yes

help-id Yes No

job-id No Yes

match-quality Yes No

44 menu Yes No

menu-name Yes No

menu-option Yes No

mid Yes No

origin Yes No

original No Yes

phase-name Yes No

resname Yes No

restype Yes No

source-language No Yes

state No Yes

state-qualifier No Yes

style Yes No

target-language No Yes

tool Yes No (deprecated in XLIFF 1.2)

tool-company No Yes

tool-id Yes Yes

tool-name No Yes

translate No Yes

ts Yes No

version No Yes

xml:lang Yes No

xml:space Yes No

Table 2. Attributes allowed in the alt-trans element. The place of metadata

In XLIFF 1.2, only a set of predefined attributes are allowed to be introduced in the element (these are: mid, match-quality, tool, tool-id, crc, xml:lang, datatype,

45 xml:space, ts, restype, resname, extradata, help-id, menu, menu-option, menu-name, coord, font, css-style, style, exstyle, extype, origin, phase-name, and alttranstype). Therefore, the next problem we encountered was to decide where and how we could introduce the metadata that is not currently allowed in the alt-trans element. We considered the following possibilities:

Introduce a new element

This solution implies the introduction of a hypothetical element called metadata where we could encapsulate all the metadata that surrounds the translation suggestions; however this cannot be done without breaking the validity of the XLIFF file. There is a similar possibility that would not break the validity: designing a new XML language and embedding it through the XML namespace mechanism, however, this means that we are proposing a non-standardised language and the data contained on it will be lost between tools that will not recognise the new language. Extending the XLIFF format through XML namespace mechanism has been a topic of much debate during the last few years within the XLIFF TC (Fredrik 2012; Savourel 2012), in the new version 2.0 which is under development, it will be included as well.

Introduce the metadata in the extradata attribute

This attribute which can be included in the element “stores the extra data of properties of an item” (XLIFF TC 2008b), it could be a good option to store the metadata information, however you could only introduce plain text inside the value of the attribute, and that reduces the possibilities of a future reuse of that data. That means that a more complex system would need to have been put into place to extract the data and present it separately and in a meaningful way to the translators.

Introduce the information through XML processing instructions

The XML processing instructions “allow documents to contain instructions for applications” (W3C 2008). With this mechanism we could maintain our “attribute- value” structure intact for its later reuse and we will not be breaking any validity of the XLIFF specification. This solution was proposed by Rodolfo Raya (localisation and XLIFF expert17), during an interchange of private emails, where he also offered us to

17 As well as working as a tool developer in MaxPrograms, Rodolfo Raya has been actively involved in the two main localisation data exchange formats (TMX and XLIFF) during the last years. He was the 46 modify his commercial localisation tool for the purposes of this research (see the section “Swordfish II” for more information). We decided to use this solution to introduce our metadata information because it did not break the validity of our output XLIFF files and our data could be displayed in a meaningful way in at least one CAT tool which was later used for the experiments with translators.

3.2.3. XLIFF PHOENIX

As its end approached, the phoenix fashioned a nest of aromatic boughs and spices, set it on fire, and was consumed in the flames. From the pyre miraculously sprang a new phoenix, which, after embalming its father’s ashes in an egg of myrrh, flew with the ashes to Heliopolis (“City of the Sun”) in Egypt, where it deposited them on the altar in the temple of the Egyptian god of the sun, Re. (Britannica Online Encyclopedia 2010)

The next step in our research methodology was the design and implementation of a tool prototype that could extract localisation data and metadata (see previous section for definition of the metadata) from an LMC file and introduce it into a untranslated XLIFF document (in the element). We decided to name our tool “XLIFF Phoenix” because our tool aims to recover data and metadata from a completed localisation process and give it a new “life” in another localisation process.

3.2.3.1. Definition

XLIFF PHOENIX is a Computer-Assisted Translation (CAT) tool that allows the reuse of previously localised XLIFF documents. This is achieved by filtering the information included in them and matching it with a new document introduced into the system. The resulting document is an enriched file that contains translation recommendations (in the alt-trans element) and embedded metadata.

3.2.3.2. Development

In the first stages of the development of the tool, we have stated the initial technical requirements and a draft design of the GUI. The design process did not finish in the

editor of the latest TMX specification (2.0) and the latest XLIFF specification (1.2). He is currently the secretary of the XLIFF TC Committee. 47 early stage, but it continued during the whole implementation process, by designing new features and modifying others to improve the whole system.

The technical implementation of the tool was carried out by CSIS 2nd year intern Seán Mooney. He worked as an intern for 8 weeks with a grant from the CNGL. He was supervised mainly by Lucía Morado Vázquez, Chris Exton and Dimitra Anastasiou.

The tool was developed using the Netbeans platform. The programming language chosen for this implementation was Java. While C# was initially considered it was discounted for the following reasons:

 C# while faster than java, failed to meet the requirement of universal cross platform compatibility. While mono is a promising solution to this issue it was not mature enough to use at the time of implementation.  As the implementation deadline for this project was 7 weeks C# was also less favorable as the programmer has greater experience with GUI development in the java environment.

Due to the short time scale involved for the entire project an Agile approach was followed during the implementation process. This proved to be beneficial to the implementation and design of the tool, as additional functionality could be accessed and implemented as required throughout the entire implementation process.

This process was facilitated by daily updates and weekly assessment of the development of the tool. This frequent interaction also ensured that each phase of development was in line with the final goal for the project.

Phase 1:

Phase 1 comprised of the implementation of the core functionality of the tool.

Features implemented:

 Load and read valid XLIFF files.  Load and read LMC (Localisation Memory Container) files.  Filter the LMC by 6 general attributes (File Name, Source Language, Target Language, Format, Date and Tool Name).  Leverage the exact matches into the alt-trans element

48  Export Enriched XLIFF.

Phase 2:

Phase 2 involved extending the filter to accommodate all 18 parameters and resulted in an initial beta of the tool.

Additional features implemented:

 Simple fuzzy matching with adjustable threshold.  Filter the LMC by: o 6 general attributes (File Name, Source Language, Target Language, Format, Date and Tool Name). o 12 specialised/advanced attributes (Locked, Tool Id, Domain, XLIFF Version, TU Status, TU State, Approved Translation, Tool Company, Company Name, Contact Name, Email and Job Id)

Phase 3:

Phase 3 involved the completion of the initial requirement of the project and marked the point at which the tool began to be extended.

Additional features implemented:

 Leverage the sentences into the alt-trans element along with the percentage and the origin information.  Leverage the metadata surrounding the retrieved sentence.  Transformation and exporting into valid TMX.  Export in a “readable” html document.  Refinement of the fuzzy matching algorithm for greater accuracy and performance.

Phase 4:

Phase 4 involved the development of a companion tool to construct LMC documents named the LMC Builder. This was not part of the initial project requirements but was added to allow for the easy adoption of LMC format and the XLIFF Phoenix tool.

 Additional features implemented: 49  Leverage Statistics.  Initial branding of the tool.  Initial Wizard prototype created.  LMC Builder External tool. Which included: o TMX to XLIFF converter. o Build an LMC from previous XLIFF documents. o Build an LMC from previous TMX documents. o Append TMX or XLIFF documents to an LMC.

Phase 5:

Phase 5 involved the extension of the tool to facilitate interoperability with both the CNGL Solas Platform 18 and Swordfish. In addition to this final quality assurance checking was undertaken in this phase.

Additional features implemented:

 Ability to load and save to the CNGL Solas Platform.  Ability to get jobs from the CNGL Solas Platform.  Metadata stored as XML processing instructions to allow tool such as Swordfish to make use of the metadata that is added.  Addition of option to disable watermark and server futures at runtime.  Enrichment of user interface to allow for easier use by adding hyperlinks and additional tool time to explain the features of the filters in greater detail.  Final branding.

3.2.3.3. Functionality:

Here there is a summary of our tool’s functionality:

 Load and read valid XLIFF files.  Load and read LMC (Localisation Memory Container) files.  Filter the LMC by: o 6 general attributes (File Name, Source Language, Target Language, Format, Date and Tool Name).

18 XLIFF Phoenix is one of the components of the SOLAS platform, developed by the CNGL (Aouad et al 2011) . 50 o 12 specialised/advanced attributes (Locked, Tool Id, Domain, XLIFF Version, TU Status, TU State, Approved Translation, Tool Company, Company Name, Contact Name, Email and Job Id).  Fuzzy Matching (also selection of the percentage)  Leverage the sentences into the alt-trans element along with the percentage and the origin information.  Leverage the metadata surrounding the retrieved sentence, this information is introduced via processing instructions.  Transformation and exporting into valid TMX.  Export in a “readable” html document.

3.2.3.4. Filtering

We use a filter that allows the user to define his/her preferences to a fine point of granularity. The filter also allows a better performance of the tool, because it produces a smaller version of the LMC that will be processed faster in the later operations of the tool. The filtering options correspond with XLIFF 1.2 attributes (the ones that we have selected as required metadata, see previous section), however some names do not correspond exactly to the specific attribute name in the specification, because the latter was not very explicit (for example: we use the field “name” that corresponds to the XLIFF attribute "origin”) or because it was an abbreviation form.

The filter is divided into two sections: general and advanced. In the general options we encounter the following options: File Name, Source Language, Target Language, Format, Date and Tool Name. If the XLIFF 1.2 specification indicates predefined values for any attribute specify in this filter, those predefined attributes are shown in an editable dropdown list.

51

Figure 1. XLIFF Phoenix. General Filter In the advanced options we encounter the following options: Locked, Tool Id, Domain, XLIFF Version, TU Status, TU State, Approved Translation, Tool Company, Company Name, Contact Name, Email and Job Id. If the XLIFF 1.2 specification indicates predefined values for any attribute specify in this filter, those predefined attributes are shown in an editable dropdown list.

52

Figure 2. XLIFF Phoenix. Advanced Filter If the XLIFF 1.2 specification indicates predefined values for any attribute specify in this filter, those predefined attributes are shown in an editable dropdown list.

XLIFF Phoenix was developed successfully during the summer of 2010 and it was presented in several international scientific events (see Appendix K for more information), a video with a demo presentation of the tool is available in this webpage: http://www.youtube.com/watch?v=E6b36IHAMgM.

3.2.3.5. Fuzzy Matching

The fuzzy matching of this tool consists of comparing source elements from translation units of the LMC with source elements from translation units of the XLIFF source file. The user can define their fuzzy matching percentage.

The fuzzy matching is done using the following algorithm (adapted from Cohen et al 2003):

53

Given the strings S1(Search text) and S2(LMC Text):

For a give weight K

K*( SSrJaroWinkle  ,(),( SSJaccard ))  21 21 MatchQuality( 1, SS 2 ) K 1

P'  SSJaroSSrJaroWinkle ),(),(  ,(1(* SSJaro )) 21 21 10 21

Where P length of the longest common

prefix of S1 and S2 gives P’ =max(P; 4)   1  M M TM  SSJaro 21 ),(     ss 3  S S 21 M  21 ssJaccard 21 ),(   ss 21

Where M= the number of matching characters in S1 and S1

T= the number of transpositions between S1 and S1

Hence for a threshold value Ω, S1 matches S2 if

MatchQuali ,ty(S 21 )S 

When L1 the length of S1 is less than L (32) characters in length S1 and S2 said to matches if

 L  MatchQuality(S , S 1)  2*  1 2  L   1

The result from the comparison is included in the alternative translation element (alt- trans), in the source and target elements. There is also other information included: the percentage of the fuzzy matching (in the match-quality attribute) and the attributes (included in the filter) that surrounded the translation unit. These attributes and their values are stored as processing instructions in the alt-trans element.

54 3.2.3.6. Enrichment (leveraging)

If the tool finds matches between the LMC and the XLIFF file it will include the translation suggestion and its metadata in an element inside the correspondent . As explained in detail earlier in this chapter we decided to include the metadata information through processing instructions.

For example, if we have the following LMC file containing one XLIFF file:

2010 1st XLIFF International Symposium - Limerick, Ireland2010 Primer Simposio Internacional sobre XLIFF - Limerick, Irlanda

55

In sections 1 (datatype="html", original="Symposium.htm", source-language="en" target-language="es", tool-id="Swordfish", date="2010-05-01T12:00:00Z", category="standards" and job-id="SymposiumWebpage"), 2 (company-name="LRC" contact-name="Lucia Morado" and contact-email="[email protected]") and 3 (tool- company="Maxprograms" tool-id="Swordfish" tool-name="Swordfish II" and tool- version="2.0-1 7DA-4-C") we encounter different metadata items that could be retrieved using our tool.

This is the untranslated file that we want to enrich which contains 4 translation units (see section 4):

56

text/html; charset=UTF-8 2010 1st XLIFF International Symposium - Limerick, Test The first XLIFF Symposium 2010 is taking place in Limerick, Ireland on 22nd September Test. It will be part of the LRC preconference Test.

57

And this is the enriched file after XLIFF Phoenix has found only one match (section 5) and has leveraged correspondent translation suggestion with its correspondent metadata information (section 6):

58
text/html; charset=UTF-8 2010 1st XLIFF International Symposium - Limerick, Test 2010 1st XLIFF International Symposium - Limerick, Ireland 2010 Primer Simposio Internacional sobre XLIFF - Limerick, Irlanda Section6 The first XLIFF Symposium 2010 is taking place in Limerick, Ireland on 22nd September Test. It will be part of the LRC preconference Test.

59

You can see in the following extraction of the code more clearly the single match that the tool has found match, it added inside the within a element:

2010 1st XLIFF International Symposium - Limerick, Ireland 2010 Primer Simposio Internacional sobre XLIFF - Limerick, Irlanda

60

Inside the opening tag you can see that it has the following attributes: alttranstype, which indicates the type of match that is offered to the translator (in this case a proposal); the origin, which indicates that it comes from a LMC repository, the match-quality, which indicates the grade of correspondent between the two sentences, in this case is 94% similar; and the datatype, which indicates the format of the original file, in this case an html file. The source and target element are introduced and after them, in processing instructions, we have the metadata that we recover from the XLIFF document: category, which indicates information about the topic of the text, in this case “standards”, company-name, which indicates the name of the company in charge of the translation, in this case "LRC"; the contact-name and contact-email, which indicate the name and email of the person in charge of a specific localisation task, in this case “Lucia Morado” and “[email protected]", the state, which indicates the status of a translation, in this case "translated", the state-qualifier, which indicates the particular state of a translation, in this case it is empty because the tool has not found that information in the LMC; the tool-company, tool-id, tool-name, and tool-version which indicate those specific characteristics of the tool used to create or modify the file.

All those metadata items were recovered from the different elements within the XLIFF file: , and .

And this is how the file would be displayed to the translator in Swordfish II, you can see that the editing section is located in the right and central part, and on the right side you can consult the translation suggestion (or TM match as the tool has name it), and the provenance metadata can be consulted in a box below the translation suggestion (or Match Properties, as the tool has named it):

61 Translation Suggestion

Provenance metadata

Figure 3 Enriched XLIFF file in Swordfish II

62 3.2.3.7. Exporting

The tool can save the document with the translation recommendations in a new XLIFF file or it can overwrite the original one. Moreover, the tool gives you the option to export the file in an HTML format (to have a quick and human readable version of the file) and in a TMX format, the TMX conversion is offered because some CAT Tools do not still support the alt-trans element (Morado Vázquez & Filip 2012, p.11), but they do support TMX files. For obtaining a summary of the leveraging process the tool can also export a statistical breakdown of the elements contained in the new XLIFF document. The generated statistics can be exported in either xml or html format.

Exporting as an XLIFF format

This file is the main purpose of our tool, an enriched XLIFF file which contains translation suggestions with its corresponding information. In the previous section you can see an example of an XLIFF Phoenix output file.

Exporting as html

An html version of the file can also be exported to provide the translator with a human readable version of the file. The html contains a table with five columns: Source Text, where the source text of each of the translation units is included; Alt-Target Text, where the target text of the alt-trans is included19; Match%, where the match percentage is included; and Source Lang and Target Lang, where the information about the source and target language is included.

Reuseage

This data has been generated by Xliff Phoenix

19 If the tool has not found any match, this cell would be empty. 63
The file processed was:

2010 1st XLIFF International Symposium - Limerick, Ireland.htm

Source Text Alt-Target Text Match % Source Lang Target Lang
text/html; charset=UTF-8 en
2010 1st XLIFF International Symposium - Limerick, Test 2010 Primer Simposio Internacional sobre XLIFF - Limerick, Irlanda 94.0 en es
The first XLIFF Symposium 2010 is taking place in Limerick, Ireland on 22nd September Test. en
It will be part of the LRC preconference Test. en

64

© 2010 UNIVERSITY OF LIMERICK.
All rights reserved.
This material may not be reproduced, displayed, modified or distributed without
the express prior written permission of the copyright holder.
This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (www.cngl.ie) at University of Limerick .

The same html file displayed in a web browser:

Exporting as TMX

All the matches found in the processing action can be exported in a TMX format. If a tool does not support the element, it would always be possible to include this

65 TMX to it and leverage its information during the translation task20. The metadata information is included as a comment in the code, which does not break the validity of the format.

94.0 2010 1st XLIFF International Symposium - Limerick, Ireland 2010 Primer Simposio Internacional sobre XLIFF - Limerick, Irlanda

Exporting statistics information

It is possible to export the information regarding the leveraging process in an html or XML file. The file obtained would include a count of all the segments where a fuzzy- match was obtained, it would also classify the match depending on the percentage of the fuzzy-match: 100%, 95%-99%, 85%-94%, 75%-84%, 50%-74% and <50%. Here is an example of an html file with count information:

20 If that TMX file is included within the active translation memories to be used in a specific translation task. 66 Reuseage

This data has been generated by XLIFF Phoenix


67
Match Range Segments Percent of Document
100% 0 0
95%-99% 0 0
85%-94% 1 100
75%-84% 0 0
50%-74% 0 0
<50%0 0
Total 1 100