USING THE INTERNET FOR SPECIALISED

1 Translation Technology

“much translation work is carried out in a computer-assisted translation (CAT) environment, which may vary from a standard desktop equipped with word processing software and a browser to a full-blown translator workstation consisting of a multiplicity of tools specifically created for translators of technical texts and localizers."

“Tra ns la tion agencies organize their workflow around project management systems that distribute translation tasks, memories and terminologies to and around individual translators.”

(F. Zanettin 2014, “Corpora in Translation”) Translation technologies

• electronicdictionaries and terminologicaldatabases, the arrival of the Internet with its numerous possibilities for research, documentation and communication, and the emergence of computer-assisted translationtools.

Alcina A. (2008) «Translation technologies - Scope, tools and resources». Targ et 20:1, 79–102 Degrees of Translation automation Degrees of Translation automation

• The term traditional human translation is understood to refer to translation without any kind of automation • Fully automatichigh quality translation (FAHQT) means translation that is performed wholly by the computer, without any kind of human involvement, and is of “high quality” • Human-aided machine translation (HAMT) refers to systems in which the translation is essentially carried out by the program itself, but aid required from humans • Machine-aided human translation (MAHT) comprises any process or degree of automation in the translation process, provided that the mechanical intervention provides some kind of linguistic support. Tools vs.Resources

• The word tool refers to computer programs that enable translators to carry out a series of functions or tasks with a set of data that they have prepared and, at the same time, allows a particular kind of results to be obtained. • Internet search engines • Word processor • Trados, , Déjà Vu, Across, OmegaT, … • Antconc, Wordsmith…

• By resources we refer to all sets of previously gathered linguistic data which are organized in a particular manner and made available in some electronic format so that they can be used or looked up or used by translators used in the course of some phase of processing. Te r m i n o l o g i c a l databases (e.g. IATE), glossaries, … • (online) dictionaries • British National Corpus, … why and how can we mine the web? Extended units of meaning

• the study of words “by presenting them in thecompany they usually keep - that is to say, an element of their meaning is indicated when their habitual word accompaniments are shown”

• “Extended units of meaning” at work in language (Sinclair, 1996)

• collocation Words must be studied in • colligation context rather than in • semantic preference isolation • semantic prosody Extended units of meaning

• Differences in Italian between (from Taylor, 1998: 61): ◦ “pressione alta” = “high (blood) pressure”[medical] ◦ “alta pressione” = “(banks of) high pressure” [meteorological]

• collocation Words must be studied in • colligation context rather than in • semantic preference isolation • semantic prosody Collocation

• “Tendency of certain words to co-occur regularly in a given language” (Mona Baker, 1992: 47) • As observed in actual texts (vs. intuition)

• Key features of collocations o language-specific (collocations vary from language tolanguage)

• Collocations are not stable or fixed o they may change diachronically (over time) in generallanguage o they may change in LSP vs. generallanguage o they may change across LSPdomains

11 Semantic prosody •“A consistent aura of meaning with which a form is imbued by its collcates” (Louw1993) • “Feeling” or “aura” that is evoked by using certain words (reinforced by collocates, due to co-selectional implications and restrictions) •Usually this feeling is “positive or negative” • “Provide” tends to occur with words denoting things which are desirable, necessary or good, such as “information”, “service(s)”, “support”, “help”, “money”, “protection”, “food”,“care” • cf. Italian “fornire” and “elargire” • “Cause” tends to occur with words denotingnegative repercussions/consequences, such as “pain”, “damage”, “harm” • cf. Italian “causare”

•Not20necessarily accessible to intuition. Semantic prosody •“A consistent aura of meaning with which a form is imbued by its collcates” (Louw1993) • “Feeling” or “aura” that is evoked by using certain words (reinforced by collocates, due to co-selectional implications and restrictions) •Usually this feeling is “positive or negative” • “Provide” tends to occur with words denoting things which are desirable, necessary or good, such as “information”, “service(s)”, “support”, “help”, “money”, “protection”, “food”,“care” • cf. Italian “fornire” and “elargire” • “Cause” tends to occur with words denotingnegative repercussions/consequences, such as “pain”, “damage”, “harm” • cf. Italian “causare” Semantic preference

•Relation between a lemma and a set of semantically related words (Stubbs, 2001: 65) • Lemma: base form (lexeme) or dictionary entry of a word • “Commit” is used with a group of semantically similar words, e.g. “murder”, “crime”, “suicide” (cf. Italian “commettere”) •“Revoke” is used with e.g. “licence”, “permit”,“authorization”

•Semantic prosody à positive/negativeevaluation •Semantic preference à relation to words belonging to a particular, definable semantic field

14 Semantic preference

•Relation between a lemma and a set of semantically related words (Stubbs, 2001: 65) • Lemma: dictionary entry of a word • “Commit” is used with a group of semantically similar words, e.g. “murder”, “crime”, “suicide” (cf. Italian “commettere”) •“Revoke” is used with e.g. “licence”, “permit”,“authorization”

•Semantic prosody à positive/negativeevaluation •Semantic preference à relation to words belonging to a particular, definable semantic field

15 Colligation

•Relation between a pair of grammatical categories or a pairing of lexis and grammar (Stubbs, 2001: 65) • hear, notice, see, watchenters into colligation with the sequence of object + either the bare infinitive or the -ing form; e.g.Corr

• We heard the visitors leave/leaving. • We noticed him walk away/walkingaway. • We heard Pavarotti sing/singing. • We saw it fall/falling.esponding collocations and colligations in Italian for “break the law”? 16 Colligation

•Relation between a pair of grammatical categories or a pairing of

• hear, notice, see, watch enters into colligation with the sequence of object + either the bare infinitive or the -ing form; e.g.

• We heard the visitors leave/leaving. • We noticed him walk away/walkingaway. • We heard Pavarotti sing/singing. • We saw it fall/falling.esponding collocations and colligations in Italian for •Relation “break betweenthe law”? a pair of grammatical categories or a pairing of lexis and grammar (Stubbs, 2001: 65) 17 Conclusion on using the Web for specialised translation – Main advantages

• massive amount of texts and multi-source information can be searched • content is constantly “refreshed” (i.e. updated and extended) • a lot of sources, text types and domains/topics are representedHow to friend and unfriend someone on Facebook - Computer Hope 1.https://www.computerhope.com › ... › Facebook Help • many24 gen languages2018 - Before (Englishyou can is dominant,connect with anothergood presenceperson on of Italian)Facebook and view their full profile, you must first become friends. Below are the steps on how to find new friends on Facebook, add • replicablefriends, and search how to techniquesunfriend any of across your current (yourfriends. How to find working/target)friends on Facebooklanguages; How to friend someone on ... • it is available at any time, at virtuallyno cost! Conclusion on using the Web for specialised translation – Main advantages

• massive amount of texts and multi-source information can be searched • content is constantly “refreshed” (i.e. updated and extended) • a lot of sources, text types and domains/topics are represented • many languages (English is dominant, good presence of Italian) • replicable search techniques across (your working/target) languages • it is available at any time, at virtuallyno cost! Main disadvantages andproblems o need to differentiate good/reliable sources from questionableinformation §for facts (limited control over user-generated content like Wikipedia) §for linguistic usage (badly translated, non-native texts, poorauthors) §it may be difficult to identify differences between expert/non-expert use o data/results still need to be interpreted Main disadvantages andproblems o Google focuses on content/information, rather than linguistic forms • the ranking and sorting of results are performed according to criteria like

• “popularity” of the websites, or geographic relevance

• the same search can yield different numbers of hits, depending on unpredictable and uncontrollable factors as the time of the day, or the location from which the query is made -- word counts are not reliable +it is difficult to compare frequencies to verify translation hypothes

• data on which searches are performed isunstable/changes Main disadvantages and problems

Particularly relevant to linguists/translators: § no possible/meaningful sorting of hits/results(esp. L/R-hand collocates) - e.g. alphabetical sorting of collocates, from least to most frequent,etc. - think of e.g. the “a * range/array of”, “on the verge of” exercises § punctuation and upper case (capitals) are ignored, e.g. “aids” vs.“AIDS” § impossible to searchparts of words,e.g. start with“geo…”, end in “-itis” § no lemmatisedsearches - hard to calculate frequencies of specific wordcombinations - e.g. to calculatehow frequent is the combination “tirarel’acquaal proprio mulino”, all inflected forms must be searchedfor § no POS-sensitivesearches - e.g. to search for ‘spot’ as a noun vs. as averb § no possibility to specify the span occuring between twosearch terms - i.e. the * wildcard can include zero to nwords «Googleology is bad science» (Kilgarriff 2007) MACHINE TRANSLATION (MT)

1 Machine translation (MT): 24 definition and key terms • Definition of machine translation:

“computerised systems responsible for the production of from one natural language into another, with or without human assistance” (Hutchins & Somers, 1992: 3)

o Human intervention is not necessarily excluded, but if it does occur it is subordinated to the prevailing action of the computer

• Some key terms:

o MT system / engine / service = the software that produces the translation

o input = the source text (i.e. original that we are trying to translate)

o [raw] output = [unedited] target text (i.e. the translation that we obtain) MT – popular conceptions

Probably the translation technology that attracts the most public attention, esp. among non-translators. Two extreme positions about MT: 1.MT is totally useless and a waste of time and money, as the quality o the output is generally very low (funny anedoctes) Underestimates possibilities 2.MT will bring down language barriers; in a few years’ time MT will be as good as human translation, no more need for translators Underestimate limitations

Quality varies according to language pairs, integrated tools (MT that learns) and pre- editing There will be more pre-editing and post-editing jobs, for which human expertise is required à new spheres of activity for translators/language professionals “L'inglese di Expo non sembra , è Google Translate”

From http://www.linkiesta.it/it/blog-post/2015/02/12/linglese-di-expo-non-sembra- google-translate-e-google-translate/22476/ MT – popular conceptions

Probably the translation technology that attracts the most public attention, esp. among non-translators. Two extreme positions about MT: 1.MT is totally useless and a waste of time and money, as the quality o the output is generally very low (funny anedoctes) Underestimates possibilities 2.MT will bring down language barriers; in a few years’ time MT will be as good as human translation, no more need for translators Underestimate limitations and post-editing jobs, for which human expertise is required à new spheres of activity for translators/language professionals Machine translation (MT): main architectures of MT systems 28 Parallel corpora: a collection of original texts in language L1 and their translations into a give L2

Texts in SL Texts in TL Machine translation (MT): 29 why is MT so difficult? Or why is translation difficult for computers? • So why is translation difficult for computers?

o Some blame the computer’s lack of “real-world knowledge”

o Focus on potential translation problems for EN-IT (with a computer!!)

o A simple example: lexical gaps and lexical asymmetries (concrete nouns) § legno / bosco / foresta in IT (+ EN, FR, DE and your other languages…)

legno bosco foresta IT

wood forest EN

bois forêt FR Machine translation (MT): 30 why is MT so difficult? Or why is translation difficult for computers? • Partly because the translation often depends on the context / situation, which the computer is not able to take into account

“The ball is in your court”

“Il pallone è nella vostra metà campo” “Il ballo è nella vostra corte” (the manager to the players) (the chamberlain to the king) Machine translation (MT): 31 why is MT so difficult? Or why is translation difficult for computers? • Lexical ambiguities (gramm. category <-> meaning <-> translation) for example, in EN: round

j) My team was eliminated in the first round (Noun: girone)

k) The cowboy started to round up the cattle (Verb: radunare)

l) We can use the round table for dinner (Adjective: rotondo)

m) Maggie is going on a cruise round the world (Preposition: intorno al)

• These sentences are ambiguous and very complex (for MT!):

Time flies like an arrow

Gas pump prices rose last time oil stocks fell Machine translation (MT): 32 some linguistic phenomena that are particularly difficult for MT

1) The______chimp eats the banana because __it is greedy.

2) The chimp eats ______the banana because __it is ripe.

3) The chimp eats the banana because __it is lunchtime. ? • The case / example of pronominal anaphora (resolution), difficult for MT 33

MT post-editing MT post-editing 34

• The aim of post-editing is to make the revised output usable or understandable, with the least possible effort (quickly)

• The priority is to save time and money • The extent and the accuracy of post-editing are negotiated/specified on a case by case basis, depending on the needs and requirements

• Different “types” and levels of post-editing (in companies, organisations): • no post-editing • internal circulation, almost never external publication • minimum post-editing • internal circulation, rarely external publication • full/complete post-editing (but… is it worth it?) • very rarely internal circulation, mostly external publication MT post-editing: introduction 35

• new skill that is acquired with experience, different from translation • in this scenario one has to balance and optimise quality-speed-cost, in relation to the intended use/duration of the translation

• ability of the readers/addressees to make use of the doc. • length of use of the document • available and • needs and viable options expectations of the end user(s) • type, length and “visibility” of the document Aims and level of PE (vs. translation/proofreading!) 36

• (minimum/full-complete) are decided specifically

• Factors to be considered (prioritised)

• save time and money (quality is less relevant)

• understandability and correctness of general meaning are key

• Factors to be ignored (irrelevant in PE)

• any detail or nuance

•elegance, fluency, naturalness of expression, etc.

on average PE is paid roughly 50% of the “real/proper” translation 37

MT pre-editing Limit input domain / topic 38

•There are two possibilities to limit the texts / language in / for MT: • adopt a controlled language (restricted input) • use the sublanguage approach

• Common aims with both options (to the advantage of MT): • limited vocabulary • more certainty on interpretation • reduce syntactic variation Controlled language 39

• Prescriptive rules aimed at normalising the style of the input (ST), e.g.

• do not write sentences with more than 20 words (general, language-neutral)

• avoid passive constructions, use only active verb forms

• avoid anaphoras, make all subjects and pronominal references explicit

• in EN: do not omit “that” in relative clauses (language-specific)

• in IT: do not use “solo” as an adverb, but use “soltanto/solamente”

• in IT: use the word “minuto” only as a noun (i.e. to mean 60 seconds);

for the adjectival meaning, use only “piccolo” Etc…… The result of controlled language is restricted input Sublanguage (1/2) 40

• Natural/normal behaviour of language within a well-defined domain (~ LSP, specialised language, jargon, etc.) • “sub-” in the mathematical sense as in “subset”, not derogatory! • referred to very well-defined, enclosed, limited domains and texts

• A sublanguage exists and is used regardless of MT, but one can design an MT system that takes advantage of this sublanguage • vocabulary • limited (relatively few concepts to be covered/expressed) • finite/closed (innovation/deviation tend to be avoided) • a few homographs, in general limited use of synonyms and coreferences • syntax • limited range of structures and constructions (regularity + repetitiveness) • usually sublanguages are very similar cross-linguistically between SL/TL(s) Machine translation (MT): 41 restrictions to the use of MT

• Input must be in (or converted into) electronic format

• Correct formatting and layout of the input are very important

o the word “e r r o r” (spaced letters) would not be recognised / translated

o spelling and typos are crucial: THEY BOOKS A ROOM … (anybody would understand banal mistakes, but not an MT system!)

• Limited availability of language combinations (improving with SMT)

o coverage mostly limited to “usual” big languages with commercial interest COMPUTER-ASSISTED TRANSLATION (CAT) TOOLS

1 Computer-assisted translation (CAT) tools • Computer-assisted translation or computer(machine)-aided translation (CAT) refers to a variety of tools, a family of software products designed to support professional translators in their work. • CAT is a “recent” development, derived from MT over the last 20 years

• The actual development of commercial CAT tools started in the 1990’s – the so-called “translator’s workstation / workbench”, which includes • terminology management packages • (TM) software (+ text alignment software, etc.)

• CAT tools are pieces of software designed to enhance the work of translators: • maximise speed à higher productivity • improve coherence and precision à higher quality

43 CAT tools, example 1: terminology management packages

• Used to create, store, retrieve and manipulate bi-/multilingual termbases/glossaries

• As searching for terminology can be highly time-consuming (even up to 75% of translators’ time), setting up a database which gathers the terminology you come across is vital.

• Lists in word processors / spreadsheets (e.g. Excel) à limited options for presenting and sorting data

• The terminology covered is usually that of a given (sub-)discipline or the terms needed for a specific translation project.

• Terminology records consist of a number of flexible fields

44

CAT tools, example 2: translation memory (TM) software

• Translation memory (TM): “multilingual text archive containing […] multilingual texts, allowing storage and retrieval of aligned text segments against various search conditions” (EAGLES* 1995) * Evaluation of Natural Language Processing Systems • This roughly means: a “filing cabinet” (i.e. a database) of old translations whose bits can be retrieved and used when / as needed by the translator • essentially a textual database that can be searched • pairs of source-text and target-text segments

Note: Translation memory indicates both the software tool and the contents of the46 database, i.e. the whole set of aligned text segments that it includes Translation memory (TM) software

• Key idea: recycle similar past translations, never translate the same (or a similar) text twice • How it works: • TM tools divide the source text – which must be in (or turned into, e.g. with OCR) electronic/digital format –into segments, which translators can translate one-by-one in the traditional way. • These segments (usually sentences, or even phrases) are then sent to a built-in database. When there is a new source segment equal or similar to one already translated, the memory retrieves the previous translation from the database. • When is this most useful: • for the translation of any text that has a high degree of repeated terms and phrases which must be translated consistently, as is the case with e.g. user manuals, computer products and subsequent versions of the same document (e.g. website updates). • mostly relevant to technical/specialised translation (not literature)47 Using translation memory (TM) software

• Scenario ◦ you have to translate the user manual of a printer (new model) from English into Italian ◦ a lot of repetition within the document itself ◦ overlap and repetitions across updated (old-new) versions of the documentation ◦ you have a relevant TM (similar topic / domain / texts / clients) ◦ you translated the previous manual(s) ◦ TM provided by client / translation agency / colleague

48 Using translation memory (TM) software

• Translation of a printer manual English (A) à Italian (B)

Source text (in language A) ST: There are 4 ways to change print settings for this printer

Exact/Perfect match (everything in the segment is exactly the same) A: There are 4 ways to change print settings for this printer B: Ci sono 4 modi per cambiare le impostazioni di stampa di questa stampante

Full match (only figures, dates and similar small details are different)

A: There are 2 ways to change print settings for this printer

B:49Ci sono 2 modi per cambiare le impostazioni di stampa di questa stampante Using translation memory (TM) software

Source text (in language A) ST: “There are 4 ways to change print settings for this printer”

Fuzzy match 85% similar (a few words in translation unit are different) A: “There are several ways to change print settings for the printer” B: “Ci sono vari modi per cambiare le impostazioni di stampa alla stampante”

Fuzzy match 60% similar (some words in translation unit are different)

A: “There are several ways to modify the default setting of your printer” B: “Ci sono vari modi per modificare l’impostazione standard della tua stampante”

• With the acceptibility threshold of the TM tool set at 75%, no candidate translation unit under that level of similarity is retrieved and50 shown to the translator!! • CAT tools - Advantages • can speed up the translation process and increase productivity • can improve translation quality (by enhancing terminological and phraseological coherence) • can help translators provide quotations • allow for collaboration over large projects • TMs/termbases can be shared by several translators and updated in real time • Useless for some text types (e.g. literature) • Essential for many specialized/technical domains • Translation agencies require translators to use (specific types of) CAT tools 16 Some issues about TMs

• Technical / practical issues • different approaches: some CAT tools have a proprietary, stand-alone text editor, others are «integrated» (e.g. to Word processor), some recent ones are fully online • proprietory vs. interchange formats • no matches calculated below sentence-level (e.g. at phrase level) • but Concordancefunction isbecomingstandard • criteria used to define similarity / matches • machingis calculated noton the basis of sentence or word meaning, but on the basis of character-string similarity TP: I bambini giocano in gruppo con il pallone FM1: I pampini giovano il grullo con il tallone (94% match) FM2: I bimbi si divertono giocando a calcio insieme (42% match) 16 Some issues about TMs

• Language / translation issues • segmentation implies that overall perception of the ST/TT is lost à ST structure tends to be reproduced in TT • cross-linguistic differences in e.g. cohesive patterns might be overlooked • using TMs limits the translator’s creativity, as s/he is usually expected to use the terminology and phraseology included in the TM • TMs can sometimes be reversed, as if translation direction did not matter… • need to control the reliability of translations within TM CORPORA AND TRANSLATION

1 What is a corpus? Some (authoritative) definitions

• “a collection of naturally-occurring language text, chosen to characterize a state or variety of a language” (Sinclair, 1991:171) • “a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis” (Francis, 1992:7) • “a closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria” (Engwall, 1992:167)

• “a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration” (McEnery & Wilson, 1996:23) • “a collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety” (McEnery et al., 2006:5) What is / is not a corpus…?

A newspaper archive on CD-ROM? The answer is An online glossary? always “NO” A digital library (e.g. Project Gutenberg)? (see All RAI 1 programmes (e.g. for spoken TV definition) language) Corpora vs. web •Corpora: – Usually stable •searches can be replicated – Controlover contents •we can select the texts to be included, or have control over selection strategies – Ad-hoc linguistically-aware software to investigate them •concordancers can sort / organise concordance lines • Web (as accessed via Google or other search engines): – Very unstable •results can change at any time for reasons beyond our control – No control over contents •what/how many texts are indexed by Google’s robots? – Limited control over search results •cannot sort or organise hits meaningfully; they are presented randomly

Click here for another corpus vs. Google comparison What types of corpora exist? A brief overview

• A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable dynamic An example of planned balance: the British National Corpus 100 m words of contemporary spoken and written British English Representative of British English “as a whole” Designed to be appropriate for a variety of uses: lexicography, education, research, commercial applications (computational tools) Balanced with regard to genre, subject matter and style Sampling and representativeness very difficult to ensure Dynamic (Monitor) vs static (Finite)

A static corpus will give a snapshot of language use at a given time Easier to control balance of content May limit usefulness, esp. as time passes A dynamic corpus is ever-changing Called “monitor” corpus because allows us to monitor language change over time

Concordance for nodeword “eyes” (sorted 1L) generated from the BNC Parallel vs. comparable multilingual corpora Parallel (translational) corpora Comparable corpora • contain translationally • texts originally produced (not “equivalent” texts: STs and their translated) in the respective corresponding TTs languages • need to be aligned, usually at • consist of independent texts the sentence level, i.e. SL which are “similar” according to sentence X matched to TL some pre-determined criteria sentence X’ •the various language • context is provided to account components share a set of for “equivalence” and common features, e.g. text type, “translation shifts” between ST genre, publication span, domain, and TT topic • translation direction needs to • parameters defining this be clear, i.e. which are SL and TL similarity vary widely components of the corpus 63 Bilingual parallel corpora on the web

• OPUS corpus, opus.lingfil.uu.se

• A variety of multilingual parallel corpora • European Parliament debates (EuroParl corpus) • European Central Bank corpus • UN documents • Subtitles (open subtitle project) • Software manuals (PHP, OO) • …

64 http://opus.lingfil.uu.se/ à EuroParl v7 search interface

help

Choose SL

Query Other useful functions

Choose TL(s) Sort + Launch the query Comparable Eng/Ita corpus on botany

66 Summing up: corpus use in translation Main uses: Test/generate hypotheses as to interpretation of the source text, and as to appropriate translations helpful when you’re dealing with little knowntext-types / domains helpful when you’re dealing with a little knownlanguage Improve quality – capture subtleties of source text, produce translations which read like native speaker texts More precisely, Reference corpora provide insights on phraseological regularities in discourse Comparable corpora (automatic and manual) can be used for (contrastive) specialised/genre-controlled text analysis Parallel corpora provide equivalents in context/evidence of translation strategies (and are more versatile than TMs)