OPUS-CAT: Desktop NMT with CAT Integration and Local Fine-Tuning

EACL 2021 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. 000 050 001 OPUS-CAT: Desktop NMT with CAT integration and local fine-tuning 051 002 052 003 053 004 Tommi Nieminen 054 005 University of Helsinki, Yliopistonkatu 3, 00014 University of Helsinki, Finland 055 006 [email protected] 056 007 057 008 058 009 059 010 060 011 061 Abstract translation professionals use MT only occasionally. 012 One of the factors slowing down the adoption of 062 013 OPUS-CAT is a collection of software which MT are risks related to confidentiality and security. 063 014 enables translators to use neural machine trans- There are well-known risks involved with using 064 lation in computer-assisted translation tools 015 web services, which also concern the web-based 065 without exposing themselves to security and 016 066 confidentiality risks inherent in online ma- NMT services available to translators and organi- 017 chine translation. OPUS-CAT uses the public zations: data sent to the service may be intercepted 067 018 OPUS-MT machine translation models, which en route, or it may be misused or handled care- 068 019 are available for over a thousand language lessly by the service provider. These security and 069 020 pairs. The generic OPUS-MT models can be confidentiality risks (even if they are unlikely to ac- 070 021 fine-tuned with OPUS-CAT on the desktop us- tualize) hinder MT use by independent translation 071 ing data for a specific client or domain. 022 professionals, since their clients often specifically 072 023 1 Introduction forbid or restrict the use of web-based MT (Euro- 073 024 pean Commission, 2019). Even if using web-based 074 025 Neural machine translation (NMT) has brought MT is not expressly forbidden, translators may con- 075 026 about a dramatic increase in the quality of ma- sider it unethical or they may fear it might expose 076 027 chine translation in the past five years. The re- them to unexpected legal liabilities (Kamocki et al., 077 028 sults of the latest European Language Industry Sur- 2016). 078 029 vey (FIT Europe et al., 2020) confirm that NMT is Producing MT directly on the translator’s com- 079 030 now routinely used in professional translation work. puter without any communication with external 080 031 NMT systems used in translation work are devel- services eliminates the confidentiality and security 081 oped by specialized machine translation vendors, 032 risks associated with web-based MT. This requires 082 translation agencies, and organizations that have an optimized NMT framework which is capable of 033 083 their own translation departments. Translators use running on Windows computers (as most CAT tools 034 084 NMT either at the request of a client, in which case are only available for Windows), and pre-trained 035 085 the client provides the NMT, or independently, in NMT models for all required language pairs. The 036 086 which case they usually rely on web-based services Marian NMT framework (Junczys-Dowmunt et al., 037 087 offered by large tech companies (such as Google or 2018) fulfills the first requirement, as it is highly op- 038 Microsoft) or specialized machine translation ven- timized and supports Windows builds. Pre-trained 088 039 dors. These web-based services are mainly used NMT models are available from the OPUS-MT 089 040 through machine translation plugins or integrations project (Tiedemann and Thottingal, 2020), which 090 041 that are available for all major computer-assisted trains and publishes Marian-compatible NMT mod- 091 042 translation (CAT) tools, such as SDL Trados and els with the data collected in the OPUS corpus 092 043 memoQ. (Tiedemann, 2012). OPUS-CAT is a software col- 093 044 Even though MT has been extensively used in lection which contains a local MT engine for Win- 094 045 the translation industry for over a decade (Doherty dows computers built around the Marian frame- 095 046 et al., 2013), there is still considerable scope for work and OPUS-MT models, and a selection of 096 047 growth: according to FIT Europe et al.(2020), plugins for CAT tools. OPUS-CAT is aimed at pro- 097 048 78 percent of language service companies plan to fessional translators, which is why it also supports 098 049 increase or start MT use, and most independent the fine-tuning of the base OPUS-MT models with 099 288 Proceedings of the 16th Conference of the European Chapter of the Association1 for Computational Linguistics: System Demonstrations, pages 288–294 April 19 - 23, 2021. ©2021 Association for Computational Linguistics EACL 2021 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. 100 OPUS TMX file or Trados 150 101 corpora parallel text files plugin 151 102 152 103 153 train models extract fine-tuning 104 install material 154 105 models 155 OPUS-MT 106 locally OPUS-CAT API CAT tool memoQ 156 model 107 MT Engine integration plugin 157 repository 108 158 translate 109 finetune 159 with 110 models 160 111 models 161 112 Locally installed Wordfast (as OmegaT 162 113 Marian NMT custom MT) plugin 163 114 164 115 Figure 1: Diagram of the software and models used in OPUS-CAT. 165 116 166 117 project-specific data. ing the text while newer models use SentencePiece 167 118 (Kudo and Richardson, 2018). Both Subword NMT 168 119 2 OPUS-CAT MT Engine and SentencePiece are coded in Python, but they 169 120 are distributed as standalone Windows executables 170 121 The main component of OPUS-CAT is the OPUS- with OPUS-CAT MT Engine, as requiring the users 171 122 CAT MT Engine, a locally installed Windows appli- to install Python in Windows would complicate the 172 123 cation with a graphical user interface. OPUS-CAT setup process. 173 MT Engine can be used to download NMT mod- 124 OPUS-CAT MT Engine user interface provides 174 els from the OPUS-MT model repository, which 125 a simple functionality for translating text, but the 175 contains models for over a thousand language pairs. 126 translations are mainly intended to be generated via 176 127 an API that the OPUS-CAT MT Engine exposes. 177 128 This API can be used via two protocols: net.tcp 178 129 and HTTP. net.tcp is used with plugins for the SDL 179 130 Trados and memoQ CAT tools, while HTTP is used 180 131 for other plugins and integration. The motivation 181 132 for using net.tcp is that exposing a net.tcp service 182 133 on the local Windows computer does not require 183 administrator privileges, which makes setting up 134 184 the OPUS-CAT MT Engine much easier for non- 135 185 technical users. However, Trados and memoQ are 136 186 the only CAT tools with sufficiently sophisticated 137 187 Figure 2: Install OPUS-MT models locally (1,000+ lan- plugin development kits to allow for net.scp con- 138 188 guage pairs available) nections, so the API can also be used via HTTP 139 with some extra configuration steps, so that it can 189 140 Once a model has been downloaded, OPUS-CAT be used from other tools. The API has three main 190 141 MT Engine can use it to generate translations by functionalities: 191 142 invoking a Marian executable included in the in- 192 • Translate: Generates a translation for a 143 stallation. Before the text is sent to the Marian 193 source sentence (or retrieves it from a cache) 144 executable, OPUS-CAT MT Engine automatically 194 and returns it as a reply to the request. 145 pre-processes the text using the same method that 195 146 was originally used for pre-processing the train- • PreorderBatch: Adds a batch of source sen- 196 147 ing corpus of the model. Pre-processing is model- tences to the translation queue and immedi- 197 148 specific, as the older OPUS-MT models use Sub- ately returns a confirmation without waiting 198 149 word NMT (Sennrich et al., 2016) for segment- for the translations to be generated. 199 289 2 EACL 2021 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. 200 • Customize: Initiates model customization us- 250 201 ing the fine-tuning material included in the 251 202 request. 252 203 253 The OPUS-CAT MT Engine stores the local 204 254 NMT models in the user’s application data folder 205 255 in order to avoid file permission issues. Local ap- 206 256 plication data folder is used, as saving the models 207 257 in the roaming application data folder could lead 208 to unwanted copying of the models, if same user 258 209 profile is used on multiple computers. The user 259 210 260 interface of the OPUS-CAT MT Engine contains Figure 3: Trados plugin settings. Note the preordering 211 261 functionalities for managing models installed on function and model tag. 212 the computer, such as deletion of models, packag- 262 213 ing of models for migration to other systems, and 263 214 tagging the models with descriptive tags. The tags in the plugin settings) to the OPUS-CAT for trans- 264 215 can be used to select specific models in CAT tool lation. This means that when the translator moves 265 216 plugins, e.g. a model fine-tuned for a specific cus- to the next segment, it has already been translated 266 217 tomer can be tagged with the name of the customer. and can simply be retrieved from the OPUS-CAT 267 218 translation cache. 268 219 3 CAT tool plugins and integration 269 4 Local fine-tuning of models 220 OPUS-CAT contains plugins for three CAT tools: 270 221 SDL Trados, memoQ and OmegaT.

Load more