<<

UNIVERSITÀ DEGLI STUDI DI MACERATA Dipartimento di Studi Umanistici – Lingue, Mediazione, Storia, Lettere, Filosofia Corso di Laurea Magistrale in Lingue Moderne per la Comunicazione e la Cooperazione Internazionale (ClasseLM-38)

Traduzione per laComunicazione Internazionale – inglese -mod. B

STRUMENTI E TECNOLOGIE

PER LA TRADUZIONESPECIALISTICA

1 What is a corpus? Some (authoritative) definitions

• “a collection of naturally-occurring text, chosen to characterize a state or variety of a language” (Sinclair, 1991:171) • “a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis” (Francis, 1992:7) • “a closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria” (Engwall, 1992:167)

• “a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration” (McEnery & Wilson, 1996:23) • “a collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety” (McEnery et al., 2006:5) What is / is not a corpus…?

• A newspaper archive on CD-ROM? The answer is • An online glossary? always “NO” • A digital library (e.g. Project (see Gutenberg)? definition) • All RAI 1 programmes (e.g. for spoken TV language) Corpora vs. web •Corpora: – Usually stable •searches can be replicated – Control over contents •we can select the texts to be included, or have control over selection strategies – Ad-hoc linguistically-aware software to investigate them •concordancers can sort / organise lines • Web (as accessed via Google or other search engines): – Very unstable •results can change at any time for reasons beyond our control – No control over contents •what/how many texts are indexed by Google’s robots? – Limited control over search results •cannot sort or organise hits meaningfully; they are presented randomly Click here for another corpus vs. Google comparison What types of corpora exist? A brief overview

• A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable dynamic An example of planned balance: the British National Corpus • 100 m words of contemporary spoken and written British English • Representative of British English “as a whole” • Designed to be appropriate for a variety of uses: lexicography, education, research, commercial applications (computational tools) • Balanced with regard to genre, subject matter and style • Sampling and representativeness very difficult to ensure BNC

• 4,124 texts: 90% written, 10% spoken • Largest collection of spoken English ever collected (10m words), but reflects typical imbalance in favour of written text (for understandable practical reasons) • Written portion: 75% informative, 25% imaginative BNC written material

Sources: • 60% books • 25% periodicals • 5% brochures and other ephemera • E.g. bus tickets, produce containers, junk mail • 5% unpublished letters, essays, minutes • 5% plays, (written to be spoken) Register levels: • 30% literary or technical “high” • 45% “middle” • 25% informal “low” BNC Subject coverage

• Planned to reflect pattern of book publishing in UK over last 20 years Subject Number of texts % of total written Imaginative 625 22 World affairs 453 18 Social science 510 15 Leisure 374 11 Applied science 364 8 Commerce 284 8 Arts 259 8 Natural science 144 4 Belief & thought 146 3 Unclassified 50 3 BNC Spoken corpus

• Context-governed material • Lectures, tutorials, classrooms • News reports • Product demonstrations, consultations, interviews • Sermons, political speeches, public meetings, parliamentary debates • Sports commentaries, phone-ins, chat shows • Samples from 12 different regions

10/18 BNC Spoken corpus

• Ordinary conversation • 2000 hrs from 124 volunteers, 38 different regions • Four different socio-economic groupings • Equal male and female, age range 15 to 60+ • All conversations over a 2-day period recorded • No secret recording, and allowed to erase • Systematic details kept of time, location, details of participants (sex, age, race, occupation, education, social group, ), topic, etc. • Transcription issues: • include false starts, hesitations, etc. • some paralinguistic features (shouting, whispering), • use of dialect words/grammar • but no phonetic information What types of corpora exist? A brief overview

• A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable

dynamic 12 Dynamic (Monitor) vs static (Finite)

• A static corpus will give a snapshot of language use at a given time • Easier to control balance of content • May limit usefulness, esp. as time passes • A dynamic corpus is ever-changing • Called “monitor” corpus because allows us to monitor language change over time Key concepts and technical notions in corpus-based studies • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density)

“Type” and “token”

• “Token” means individual occurrence of a word • “Type” means instance of a given word • The man saw the girl with the telescope • 8 tokens, 6 types • “Type” may refer to lexeme, or individual word form • run, runs, ran, running: 1 or 4 types? Key concepts and technical notions • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) • Concordance (concordancing software) • KWIC (keyword in context) • Nodeword • Sorting Concordance for nodeword “eyes” (sorted 1L) generated from the BNC Key concepts and technical notions

• Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) • Concordance (concordancing software) • KWIC (keyword in context) • Nodeword • Sorting • Collocation (collocates) • Lemmatisation (morphological analysis) • (POS-)Tagging (grammatical analysis) • Parsing (syntactic analysis) www.nature.com/nature/journal/v455/n7215/full/455835b.html20 General / reference monolingual corpora (of English)

Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civil- service spending. Took to the streets

• http://corpus.leeds.ac.uk/internet.html

• English

• Let’s try to understand: • Meaning • Extended (sentential) co-text, preferential co-selections • Context(s) of use • Semantic preference • Semantic prosody Using general / reference monolingual corpora (from/on the Web): Leeds Internet corpora

*

http://corpus.leeds.ac.uk/internet.html Let’s explore internal variation - Examples of (possible) useful queries

• Any other forms of the take? (colligational constraints) • Plural/singular of the noun street? (colligational constraints) • Other ? (collocational flexibility) • Other nouns? (collocational flexibility)

• Select “CQP syntax only” * (automatic POS-tagging!) • http://cwb.sourceforge.net/files/CQP_Tutorial/ • Look at the examples on the following slides for guidance and adapt those models to your searches • Try out a number of different options to familiarise yourself with the search syntax, and understand what kinds of searches it can support Now the translation into Italian of “took to the streets”

• Verb? • Preposition? • andare • in • scendere • nella/nelle • …? • per la/per le? • …? • Noun? • strada/strade • piazza/piazze Which queries do we • …? need? How many are necessary? Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civil- service spending. REGISTER ONE’S OPPOSITION

• Now search the BNC for this expression. • What does it mean? • Which “feelings” are usually “registered”? • interest • concern • support • dismay • frustrations • dissatisfaction • disapproval • protest • commitment • … Monolingual general / reference corpora available online (at least partially, i.e. as demos) • British National Corpus (BNC, British English) • www.natcorp.ox.ac.uk • COCA (American English) • http://corpus.byu.edu/coca/ • The CORIS corpus (Italian) • http://corpora.dslo.unibo.it/coris_ita.html • Leeds Internet corpora • English, Chinese, Arabic, French, German, Italian, Japanese, Polish, Portuguese, Russian, Spanish: http://corpus.leeds.ac.uk/internet.html • Mannheim corpora (German) • http://corpora.ids-mannheim.de/ccdb explore the Web • Corpus del Español (Spanish) to see what other • www.corpusdelespanol.org corpora are available ! • CREA (Spanish) • http://corpus.rae.es/creanet.html A translation-relevant corpus typology Corpora general / reference specialised

monolingual monolingual multilingual (usually)

Comparable texts in terms of comparable parallel genre/text type or topic. Usually rather small, created Original texts aligned to ad-hoc for specific tasks corresponding . («DIY»), «disposable» Typically available and pre- compiled (as for general, reference monolingual corpora)29 She is the author of numerous articles regarding learning disabilities and she speaks often before parent and teacher groups concerning learning and behavior problems. È autrice di numerosi articoli riguardanti le disabilità di apprendimento e ha tenuto spesso conferenze davanti a gruppi di genitori e insegnanti sui problemi del comportamento e dell’apprendimento.

All expectations need to be direct and explicit. Don't require this child to 'read between the lines' to glean your intentions. Esplicitare chiaramente tutte le aspettative, in modo da non richiedere al bambino di “leggere tra le righe” per cogliere le intenzioni.

Obviously, the child with nonverbal learning disorders would not be expected to be the 'scribe' in a cooperative grouping - her contribution should be in the verbal arena. Ovviamente non ci si deve aspettare che sia lo “scriba” del gruppo cooperativo, il suo contributo deve essere inserito nell’arena verbale.

Sentence-level alignment30 (new line delimited) Bilingual parallel corpora on the web

• OPUS corpus, opus.lingfil.uu.se • A variety of multilingual parallel corpora • European Parliament debates () • European Central Bank corpus • UN documents • Subtitles (open subtitle project) • Software manuals (PHP, OO) • … • With linguistic annotation • Online interface based on CWB/CQP syntax • Corpora can also be downloaded for local use • COMPARA (EN-PT) • OSLO Multilingual Corpus

31 http://opus.lingfil.uu.se/ → EuroParl v7 search interface

help

Choose SL

Query Other useful functions

Choose TL(s) Sort + Launch the query http://opus.lingfil.uu.se/ → EuroParl v7 search interface [word="a|an|the"] [tnt="JJ.*"] "issue" http://opus.lingfil.uu.se/ → OPUS multilingual search interface > Europarl

Query Choose TL(s)

Launch the query Format of search results http://opus.lingfil.uu.se/ → OPUS multilingual search interface > Europarl A translation-relevant corpus typology Corpora general / reference specialised

monolingual monolingual multilingual (usually)

comparable parallel

Comparable texts in terms of genre/text type or topic. Normally relatively small, created ad-hoc for specific translation assignments («DIY»), «disposable», for texts belonging to specialised domains 36 Using comparable corpora for translation

• Learn something about a specific domain/topic • Understand the source text • Choose the “right” TL term/word/collocation • Identify and reproduce the features of the specific genre/register in the TL • Look for equivalents, definitions and contexts of use in both the source and target language Source text (we are the translator/interpreter)

P. R. O. Bally (1959) “Monadenium arborescens”. Candollea 17:25-26. Coming from Tanzania, this is a robust growing species and is a semi woody succulent, forming a lightly branched shrub/tree up to 4.25 metres high. The stems can grow to 10 cm. thick, are five angled and may be slightly spirally twisted. They are erect and may be solitary or in twos. If branched, the branches are quite slender, grow erect, and are some 30 – 60 cm. apart. They are smooth, and covered in a green bloom. Leaf scars, which are 10 mm. in diameter, are borne 4 –7 cm. apart and below each leaf scar is a small tubercle which on older plants has a small reddish/brown spine, but a more robust one up to 2 cm. long on is produced on younger plants. The leaves are crowded terminally around the ends of the stems and are produced from the angles of the stems. They are obovate, pointed and heart shaped, 7 – 19 cm long and 5.6 – 11 cm. wide. Flowering takes place from an eye situated directly above the leaf scar and several cymes may be produced near the apex of the branches with peduncles 6 –7 cm. long and 5 – 6 mm. thick. The colour of the inflorescence is red. This species is not in general cultivation due to its 38rapid growth and size. The process for manual corpus construction

• We want to build a bilingual specialised comparable corpus for the translation task (English → Italian) • Two stages: a) Source language corpus component (English) b) Target language corpus component (Italian)

39 Searching for similar SL (English) texts for the corpus

• We look for: • web pages in English, as similar to our ST as possible • e.g. searching for ‘monadenium’ on google.co.uk

• We find, e.g.: en.wikipedia.org/wiki/Monadenium_arborescens www.sdcss.com/monadenium.html davesgarden.com/guides/pf/go/65135/ www.gardening.eu/plants/Succulent-Plants/Monadenium-guentheri/3708/

• You can add to the search string: monadenium filetype:pdf

In general pages in pdf format tend to be more informative and authoritative 40 Uninformative, different genre 41 Very informative, authoritative (source: San Diego Cactus Society), similar genre (journal article) 42 Uninformative, little connected text, different function (promotional) and genre 43 Low quality, unreliable (language) 44 Searching for TL texts

• We look for “monadenium” in Italian (reliable) webpages, e.g.: • http://www.giardinaggio.it/grasse/singolegrasse/Monadenium/Monadenium.asp Practical considerations: file types • Corpus files must be downloaded/saved in this format: • Simple/pure text (.txt) • save as “text only” • Common formats of online texts these must be • HTML converted into • File → save as → xxx.txt (saved as) .txt • (just modify the file extension) format • Microsoft Word • Save as → xxx.txt • File type → plain text “.txt” → ok (ignore any error message) • pdf • image/“dead pdf” (not good) vs. searchable pdf (OK) • edit → select all → copy → paste in a new text file → save • Plan separate folders for each corpus (sub-)component • e.g. SL/TL, but also more/less authoritative, different genres etc.

46 Practical considerations: corpus query tools

Now that we have built the corpus, what concordancing tools are available? • AntConc – user-friendly, many functionalities, and you can download it (for free and legally!!) from this URL: • www.laurenceanthony.net/software.html • TextStat – free, includes an interesting web-spider which downloads as many pages as you want from a particular website (good if you have identified a reliable website) • http://neon.niederlandistik.fu-berlin.de/en/textstat/ • WordSmith Tools – commercial tool • (older) version 4.0 now freely available • http://lexically.net/wordsmith/version4/index.htm

47 • AntConc – user-friendly, many functionalities, and you can download it (for free and legally!!) from this URL: • www.laurenceanthony.net/software.html

And what can we do with it?

48