Translation Memory (TM) Software (+ Text Alignment Software, Etc.)

USING THEINTERNET FOR SPECIALISEDTRANSLATION

“muchtranslationworkis carried out in a computer-assisted translation (CAT)environment, which may vary from a standard desktopequippedwithwordprocessing softwareanda browserto a full-blowntranslatorworkstation consistingofa multiplicityoftools specificallycreatedfortranslators of technicaltextsand localizers."

“Translation agencies organize their workflow around project managementsystemsthatdistribute translationtasks,memoriesand terminologiestoand around individual translators.”

(F.Zanettin 2014, “Corpora inTranslation”) Translation technologies

• Electronic dictionariesand terminologicaldatabases, the arrivalof the Internet with its numerous possibilitiesfor research, documentation and communication, and the emergence of computer- assistedtranslationtools.

Alcina A. (2008) «Translation technologies - Scope, tools and resources». Target20:1,79–102 Degrees of Translationautomation Degrees of Translationautomation

• Theterm traditional human translationis understood to referto translation without any kind of automation • Fullyautomatichigh qualitytranslation(FAHQT)means translationthat is performed wholly by the computer,without any kindof human involvement,and is of “highquality” • Human-aidedmachinetranslation (HAMT)refersto systems in whichthe translationis essentiallycarriedout by the program itself, but aid required from humans • Machine-aided human translation (MAHT) comprises any process or degree of automation in the translationprocess, provided that the mechanical intervention provides some kind of linguisticsupport. Toolsvs.Resources

• The word tool refers to computer programs that enable translators to carry out a series of functions or tasks with a set of data that they have prepared and, at the same time, allows a particular kind of results to be obtained. • Internet search engines • Word processor • Trados, Wordfast, Déjà Vu, Across, OmegaT,… • Antconc, Wordsmith…

• By resourceswe refer to all sets of previously gathered linguistic data which are organized in a particular manner and made available in some electronic format so that they can be used or looked up or usedby translators used in the course of some phase of processing. Terminological databases (e.g. IATE), glossaries,… • (online) dictionaries • British National Corpus, … why and how can we mine the web? Extended units of meaning

• the study of words “by presenting them in thecompany they usually keep - that is to say, an element of their meaning is indicated when their habitual word accompaniments are shown”

• “Extended units of meaning” at work in language (Sinclair, 1996)

• collocation Words must be studied in • colligation context rather than in • semantic preference isolation • semantic prosody Extended units of meaning

• Differences in Italian between (from Taylor,1998: 61): ◦ “pressione alta” = “high (blood) pressure”[medical] ◦ “altapressione” = “(banks of) high pressure” [meteorological]

• collocation Words must be studied in context rather than in isolation Collocation

• “Tendency of certain words to co-occur regularly in a given language” (Mona Baker,1992: 47) • As observed in actual texts (vs. intuition)

• Key features of collocations o language-specific (collocations vary from language tolanguage)

• Collocationsare not stable or fixed o they may change diachronically (overtime) in generallanguage o they may change in LSP vs. generallanguage o they may change across LSPdomains

11 • “Provide”tends to occur with “information”,“service(s)”, “support”,“help”,“money”,“protection”,“food”,“care”

• “Cause” tends to occur with “pain”, “damage”,“harm” Semantic prosody •“A consistentauraof meaning with which a form is imbued by its collcates” (Louw1993) • “Feeling” or “aura” that is evoked by using certain words (reinforced by collocates, due to co-selectional implications and restrictions) •Usually this feeling is “positive or negative” • “Provide”tends to occur with wordsdenoting things which are desirable,necessary or good, such as “information”,“service(s)”, “support”,“help”,“money”,“protection”,“food”,“care” • cf. Italian “fornire” and“elargire” • “Cause” tends to occur with words denoting negative repercussions/consequences, such as “pain”,“damage”,“harm” • cf. Italian “causare” • “Commit” is used with e.g. “murder”, “crime”,“suicide”

•“Revoke”is used with e.g. “licence”, “permit”,“authorization”

14 Semantic preference

•Relation between a lemma and a set of semantically related words (Stubbs, 2001: 65) • Lemma: dictionary entry of a word • “Commit” is used with a group of semantically similar words, e.g. “murder”,“crime”,“suicide” (cf.Italian “commettere”) •“Revoke” is used with e.g. “licence”,“permit”,“authorization”

•Semantic prosody → positive/negativeevaluation •Semantic preference→ relation to words belongingto a particular,definable semantic field

15 Colligation

• We heard the visitorsleave/leaving. • We noticed him walk away/walkingaway. • We heard Pavarottising/singing. • We saw it fall/falling.

16 Colligation

hear, notice, see, watch enters into colligation with the sequence of object + either the bare infinitive or the -ing form; e.g.

• We heard the visitorsleave/leaving. • We noticed him walk away/walkingaway. • We heard Pavarottising/singing. • We saw it fall/falling.

•Relation between a pair of grammatical categories or a pairing of lexis and grammar (Stubbs, 2001:65) 17 Conclusionon using theWeb for specialised translation – Main advantages

• massive amount of texts and multi-sourceinformation can besearched • content is constantly “refreshed” (i.e. updatedand extended)

How to friend and unfriend someone on Facebook - ComputerHope 1.https://www.computerhope.com › ... › Facebook Help 24 gen 2018 - Before you can connect with another person on Facebook and view their full profile, you must first become friends. Below are the steps on how to find new friends on Facebook, add friends, and how to unfriend any of your current friends. How tofind friends on Facebook; How to friend someone on... Conclusionon using theWeb for specialised translation – Main advantages

• a lot of sources,text types and domains/topics are represented • manylanguages (Englishis dominant, good presenceof Italian) • replicable searchtechniques across(your working/target)languages • it is availableat anytime, at virtuallyno cost! Main disadvantages andproblems o need to differentiate good/reliablesourcesfrom questionableinformation ▪for facts (limited control over user-generatedcontent likeWikipedia) ▪for linguistic usage (badly translated, non-native texts, poorauthors) ▪it may be difficult to identify differences between expert/non-expert use o data/resultsstill need to be interpreted Main disadvantages and problems o Googlefocuseson content/information,rather than linguistic forms • the rankingand sortingof results are performed according to criterialike

• “popularity” of the websites, orgeographic relevance

• the same search can yield different numbers of hits, depending on unpredictable and uncontrollable factors as the time of the day,or the location from which the query is made -- wordcountsare not reliable+it is difficult to compare frequencies to verify translation hypothes

• data on which searches areperformed isunstable/changes Main disadvantages and problems

Particularly relevant to linguists/translators: ▪ no possible/meaningfulsortingof hits/results(esp.L/R-handcollocates) - e.g.alphabetical sorting of collocates,from least to most frequent,etc. - think of e.g.the “a * range/array of”, “on the vergeof” exercises ▪ punctuationanduppercase(capitals)areignored,e.g.“aids” vs.“AIDS” ▪ impossibleto search partsof words,e.g.start with “geo…”,endin “-itis” ▪ no lemmatised searches - hardto calculatefrequenciesof specificwordcombinations - e.g.to calculatehow frequentisthe combination “tirare l’acquaalproprio mulino”, all inflected forms must be searchedfor ▪ no POS-sensitivesearches - e.g.to searchfor ‘spot’as anoun vs.adverb ▪ no possibility to specifythe spano ccuringbetweentwosearchterms - i.e. the * wildcard can include zero to nwords «Googleology is bad science» (Kilgarriff 2007) MACHINE TRANSLATION (MT)

1 Machine translation (MT): 24 definition and key terms • Definition of machinetranslation:

“computerised systems responsible for the production of translations from one natural language into another, with or without human assistance” (Hutchins & Somers, 1992: 3)

• Some key terms:

o MT system / engine / service = the software that produces the translation

o input = the source text

o [raw] output = [unedited] target text

“L'inglese di Expo non sembra Google Translate, è Google Translate”

From http://www.linkiesta.it/it/blog-post/2015/02/12/linglese-di-expo-non-sembra- google-translate-e-google-translate/22476/ MT – popular conceptions

Probably the translation technology that attracts the most public attention, esp. among non-translators. Machine translation (MT): main architectures of MT systems 28 Parallel corpora: a collection of original texts in language L1 and their translations into a give L2

Texts in SL Texts in TL 29 Machine translation (MT):

why is MT so difficult? Or why is translation difficult for computers? • So why is translation difficultfor computers?

o Some blame the computer’s lack of “real-world knowledge”

legno bosco foresta IT

wood forest EN

bois forêt FR Machine translation (MT): 30 why is MT so difficult? Or why is translation difficult for computers? • Partly because the translation often depends on the context / situation, which the computer is not able to take into account

“The ball is in your court”

“Il pallone è nella vostra metà campo” “Il ballo è nella vostra corte” (the manager to the players) (the chamberlain to the king) Machine translation (MT): 31 why is MT so difficult? Or why is translation difficult for computers? • Lexical ambiguities (gramm. category <-> meaning <-> translation) for example, in EN:round

j) My team was eliminated in the first round (Noun: girone)

k) The cowboy started to round up the cattle (Verb: radunare)

l) We can use the round table for dinner (Adjective: rotondo)

m) Maggie is going on a cruise round the world (Preposition: intornoal)

• Machine translation (MT): 32 some linguistic phenomena that are particularly difficult forMT

1) The chimp eats the banana because it is greedy.

2) The chimp eats the banana because it is ripe.

3) The chimp eats the banana because it is lunchtime. ? • The case / example of pronominal anaphora (resolution), difficult for MT 33

MT post-editing MT post-editing 34

• The aim of post-editing is to make the revised output usable or understandable

• The priority is to save time andmoney • The extent and the accuracy of post-editing are negotiated/specified on a case by case basis, depending on the needs and requirements

• Different “types” and levels of post-editing (in companies, organisations): • no post-editing • internal circulation, almost never external publication • minimum post-editing • internal circulation, rarely external publication • full/complete post-editing (but… is it worth it?) • very rarely internal circulation, mostly external publication MT post-editing: introduction 35

• new skill that is acquired with experience, different from translation • in this scenario one has to balance and optimise quality-speed-cost, in relation to the intended use/durationof the translation

• ability of the readers/addresseesto makeuse of the doc. • length ofuse of the document • available and viable options • needs and expectations of the enduser(s) • type, length and “visibility” ofthe document Aims and level of PE (vs.translation/proofreading!) 36

• (minimum/full-complete) are decided specifically

• Factors to be considered(prioritised)

• save time and money (quality is less relevant) • understandability and correctness of general meaning are key

• Factors to be ignored (irrelevant in PE)

• any detail or nuance •elegance, fluency, naturalness of expression, etc. on average PE is paid roughly 50%of the “real/proper” translation 37

MT pre-editing 38

Limit input domain / topic

•There are two possibilities to limit the texts / language in / for MT: • adopt a controlled language (restricted input) • use the sublanguageapproach

• Common aims with both options (to the advantage of MT): • limited vocabulary • more certainty on interpretation • reduce syntacticvariation Controlled language 39

• Prescriptive rules aimed at normalising the style of the input (ST), e.g.

• do not write sentences with more than 20 words (general, language-neutral)

• avoid passive constructions, use only active verb forms

• avoid anaphoras, make all subjects and pronominal references explicit

• in EN: do not omit “that” in relative clauses (language-specific)

• in IT: do not use “solo” as an adverb, but use “soltanto/solamente” • in IT: use the word “minuto” only as a noun (i.e. to mean 60 seconds);

for the adjectival meaning, use only “piccolo”

Etc…… The result of controlled language is restricted input Sublanguage (1/2) 40

• Natural/normal behaviour of language within a well-defined domain (~ LSP, specialised language, jargon, etc.) • “sub-” in the mathematical sense as in “subset”, not derogatory! • referred to very well-defined, enclosed, limited domains and texts

• A sublanguage exists and is used regardless of MT, but one can design an MT system that takes advantage of this sublanguage • vocabulary • limited (relatively few concepts to be covered/expressed) • finite/closed (innovation/deviation tend to be avoided) • a few homographs, in general limited use of synonyms and coreferences • syntax • limited range of structures and constructions (regularity + repetitiveness) • usually sublanguages are very similar cross-linguistically between SL/TL(s) Machine translation (MT): 41 restrictions to the use of MT

• Input must be in (or converted into) electronic format

• Correct formatting and layout of the input are very important

o the word “e r r o r” (spaced letters) would not be recognised / translated

o spelling and typos are crucial: THEYBOOKS AROOM … (anybody would understand banal mistakes, but not an MT system!)

• Limited availability of language combinations (improving with SMT)

o coverage mostly limited to “usual” big languages with commercial interest COMPUTER-ASSISTED TRANSLATION (CAT) TOOLS

1 Computer-assisted translation (CAT)tools

• Computer-assisted translation or computer(machine)-aided translation (CAT) refers to a variety of tools, a family of software products designed to support professional translators in their work. • CAT is a “recent” development, derived from MT over the last 20 years

• The actual development of commercial CAT tools started in the 1990’s –the so-called“translator’s workstation / workbench”, which includes • terminology managementpackages • translation memory (TM) software (+ text alignment software, etc.)

• CATtools are pieces of software designed to enhance the work of translators: • maximise speed → higher productivity • improve coherence and precision → higher quality

43 CAT tools, example 1: terminology management packages

• Used to create, store, retrieve and manipulate bi-/multilingual termbases/glossaries

• As searching for terminology can be highly time-consuming (even up to 75% of translators’ time), setting up a database which gathers the terminology you come across is vital.

• Lists in word processors / spreadsheets (e.g. Excel) → limited options for presenting and sorting data

• The terminology covered is usually that of a given (sub-)discipline or the terms needed for a specific translation project.

• Terminology records consist of a number of flexible fields

CAT tools, example 2: translation memory (TM) software

• Translation memory (TM): “multilingual text archive containing […] multilingual texts, allowing storage and retrieval of aligned text segments against various search conditions” (EAGLES* 1995) * Evaluation of Natural Language Processing Systems •This roughly means: a “filing cabinet” (i.e. a database) of old translations whose bits can be retrieved and used when / as needed by the translator • essentially a textual database that can be searched • pairs of source-text and target-text segments

Note: Translation memory indicates both the software tool and the contents of the46 database, i.e. the whole set of aligned text segments that it includes Translation memory (TM) software

• Key idea: recycle similar past translations, never translate the same (or a similar) text twice • How it works: • TM tools divide the source text – which must be in (or turned into, e.g. with OCR) electronic/digital format –into segments, which translators can translate one-by-one in the traditional way. • These segments (usually sentences, or even phrases) are then sent to a built-in database. When there is a new source segment equal or similar to one already translated, the memory retrieves the previous translation from the database. • When is this most useful: • for the translation of any text that has a high degree of repeated terms and phrases which must be translated consistently, as is the case with e.g. user manuals, computer products and subsequent versions of the same document (e.g. website updates). • mostly relevant to technical/specialised translation (not l4i7terature) Using translation memory (TM) software

• Scenario ◦you have to translate the user manual of a printer (new model) from English into Italian ◦ a lot of repetition within the documentitself ◦overlap and repetitions across updated (old-new) versions of the documentation ◦ you have a relevant TM (similar topic / domain / texts / clients) ◦ you translated the previous manual(s) ◦ TM provided by client / translation agency / colleague

48 Using translation memory (TM) software

• Translation of a printer manual English (A) → Italian (B)

Source text (in language A) ST: There are 4 ways to change print settings for this printer

Exact/Perfect match (everything in the segment is exactly the same) A: There are 4 ways to change print settings for this printer B: Ci sono 4 modi per cambiare le impostazioni di stampa di questa stampante

Full match (only figures, dates and similar small details are different)

A: There are 2 ways to change print settings for thisprinter

B:49 Ci sono 2 modi per cambiare le impostazioni di stampa di questa stampante Using translation memory (TM) software

Source text (in language A) ST: “There are 4 ways to change print settings for thisprinter”

Fuzzy match 85% similar (a few words in translation unit are different) A: “There are several ways to change print settings for the printer” B: “Ci sono vari modi per cambiare le impostazioni di stampa alla stampante”

Fuzzy match 60% similar (some words in translation unit are different)

A: “There are several waysto modify the default setting of your printer” B: “Ci sono vari modi per modificare l’impostazione standard della tua stampante”

• With the acceptibility threshold of the TM tool set at 75%, no candidate translation unit under that level of similarity is retrieved a50nd shown to thetranslator!! • CAT tools - Advantages • can speed up the translation process and increase productivity •can improve translation quality (by enhancing terminological and phraseological coherence) • can help translatorsprovide quotations • allow for collaboration over large projects • TMs/termbases can be shared by several translatorsand updated in realtime • Useless for some text types (e.g.literature) • Essential for many specialized/technicaldomains • Translation agencies require translatorsto use (specific types of) CAT tools 16 Some issues about TMs

• Technical / practicalissues • different approaches: some CAT tools have a proprietary, stand-alone text editor, others are «integrated»(e.g. to Word processor), some recent ones are fully online • proprietary vs. interchange formats • no matches calculated below sentence-level (e.g. at phrase level) • but Concordancefunction is becomingstandard • criteriaused to define similarity / matches • matching is calculatednot on the basis of sentence or word meaning, but on the basis of character-string similarity TP: I bambini giocano in gruppo con il pallone FM1: I pampini giovano il grullo con il tallone (94% match) FM2: I bimbi si divertono giocando a calcio insieme (42% match) 16 Some issues about TMs

• Language / translation issues • segmentation implies that overall perception of the ST/TT is lost → ST structure tends to be reproduced in TT • cross-linguisticdifferencesin e.g. cohesive patternsmight be overlooked • using TMs limits the translator’s creativity, as s/he is usually expected to use the terminology and phraseology included in the TM • TMs can sometimes be reversed, as if translation direction did not matter… • need to control the reliability of translations within TM CORPORA AND TRANSLATION

1 What is a corpus? Some (authoritative) definitions

•“a collection of naturally-occurringlanguagetext, chosen to characterize a state or variety of a language” (Sinclair, 1991:171) •“a collection of texts assumed to be representative ofa given language, dialect, or other subset of a language, to be used for linguistic analysis” (Francis, 1992:7) • “a closed set of texts in machine-readable form establishedfor general or specificpurposes by previously definedcriteria”(Engwall,1992:167)

•“a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration” (McEnery& Wilson, 1996:23) •“a collection of (1) machine-readable(2) authentictexts […] which is (3) sampled to be (4) representative of a particular language or language variety”(McEneryet al., 2006:5) What is / is not a corpus…?

A newspaper archive on CD-ROM? The answer is An online glossary? always “NO” A digital library (e.g. Project Gutenberg)? (see All RAI 1 programmes (e.g. for spoken TV definition) language) Corpora vs. web •Corpora: – Usually stable •searches can be replicated – Control overcontents •we can select the texts to be included, or have control over selection strategies – Ad-hoclinguistically-awaresoftwareto investigate them •concordancers can sort / organise concordance lines • Web (as accessed via Google or other search engines): – Very unstable •results can change at any time for reasons beyond our control – No controlover contents •what/how many texts are indexed by Google’s robots? – Limited control oversearchresults •cannot sort or organise hits meaningfully; they are presented randomly

Click here for another corpus vs. Google comparison What types of corpora exist?Abrief overview

• A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable dynamic An example of planned balance: the British National Corpus

100 m words of contemporary spoken and writtenBritish English Representative of British English “as a whole” Designed to be appropriate for a variety of uses: lexicography, education, research, commercial applications (computational tools) Balanced with regard to genre, subject matter andstyle Sampling and representativeness very difficult toensure Dynamic (Monitor) vs static (Finite)

A static corpus will give a snapshot of language use at a given time Easier to control balance of content May limit usefulness, esp. as time passes

A dynamic corpus is ever-changing Called “monitor” corpus because allows us to monitor language change overtime

Concordance for nodeword “eyes” (sorted 1L) generated from the BNC Parallel vs. comparable multilingual corpora Parallel (translational) corpora Comparable corpora •contain translationally •texts originally produced (not “equivalent” texts: STsand their translated) in the respective corresponding TTs languages •need to be aligned, usually at •consist of independent texts the sentence level, i.e. SL which are “similar” according to sentence X matched to TL some pre-determined criteria sentence X’ •the various language •context is provided to account components share a set of for “equivalence” and common features,e.g. text type, “translation shifts” between ST genre, publication span,domain, and TT topic •translation direction needs to •parameters definingthis be clear, i.e. which are SL and TL similarity vary widely components of the corpus 63 Bilingual parallel corpora on the web

• OPUS corpus, opus.lingfil.uu.se

• A variety of multilingual parallel corpora • European Parliament debates (EuroParl corpus) • European Central Bank corpus • UN documents • Subtitles (open subtitle project) • Software manuals (PHP, OO) • …

64 http://opus.lingfil.uu.se/ → EuroParl v7 search interface

help

Choose SL

Query Other usefulfunctions

ChooseTL(s) Sort + Launch the query Comparable Eng/Ita corpus on botany

66 Summing up: corpus use in translation Main uses: Test/generate hypotheses as to interpretation of the source text, and as to appropriate translations helpful when you’re dealing with little known text-types / domains helpful when you’re dealing with a little knownlanguage Improve quality – capture subtleties of source text, produce translations which read like native speaker texts More precisely, Reference corpora provide insights onphraseological regularities in discourse Comparable corpora (automatic and manual) can be used for (contrastive) specialised/genre-controlled text analysis Parallel corpora provide equivalents in context/evidence of translation strategies (and are more versatile than TMs) Question on MT

Discuss the most important milestones in the history of

Machine Translation, also with regard to the evolution of MT approaches.

8 Sample answer on MT

• Important points, to mention: ☺

– “Birth” of MT, idea by W. Weaver and Memorandum (1949) – Innovativeidea of usingcomputersto translate – First demonstrationGeorgetown-IBM(1954) – US governmentfunding, 1st generationdirect rule-basedapproaches – FAHQMT ideal (with short explanation) – ALPAC report (1966): gist of contentsand negativerepercussionsin the US

9 Sample answer onMT

• Important points, to mention: [continued]: ☺ – ’80s: PCs more and more widespread,commercialsystems in US and Europe – 2nd generation, rule-based transferapproach – rule-basedinterlinguaapproach (complex,disappointing) – Excessive ideal of FAHQMT graduallyreplacedby more realistic alternatives – Statistical approach replaces rule-basedapproach (advantages) – Increase in the numberof texts in digital format to be translated (Internet,email) – End ’90s: TA systems availableonline (evenfor free) – MT integrationwith CAT tools

– L1a0test developments: neural MT Sample answer onMT

• Less relevant, non essential points:

– It is not necessary to provide definitions (for MT, input, output, etc.), you can take them for granted

– Civil applications of MT (e.g. for warcryptography) – Individuals (apart from W. Weaver) – Conferences and journals, specific researchcentres – Not important to differentiate between countries (apart from US beginning)

– Not necessary to mention single MT tools or firms

11 Sample answer onMT

• FOR THIS QUESTION,avoid : 

– Explain pre-editingvs post-editing – Detailed explanation of sublanguageand controlled language – Detailed explanation of differentrule-basedapproaches – Detailed explanationof statistical MT approaches,the resources needed to developthem and their advantages

– Details on the ALPAC committee – Reviewof availableonlineMTsystems – Detailed explanation of how MT is integratedinto CAT tools