Università Degli Studi Di Macerata

Università Degli Studi Di Macerata

UNIVERSITÀ DEGLI STUDI DI MACERATA Dipartimento di Studi Umanistici – Lingue, Mediazione, Storia, Lettere, Filosofia Corso di Laurea Magistrale in Lingue Moderne per la Comunicazione e la Cooperazione Internazionale (ClasseLM-38) Traduzione per laComunicazione Internazionale – inglese -mod. B STRUMENTI E TECNOLOGIE PER LA TRADUZIONESPECIALISTICA 1 What is a corpus? Some (authoritative) definitions • “a collection of naturally-occurring language text, chosen to characterize a state or variety of a language” (Sinclair, 1991:171) • “a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis” (Francis, 1992:7) • “a closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria” (Engwall, 1992:167) • “a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration” (McEnery & Wilson, 1996:23) • “a collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety” (McEnery et al., 2006:5) What is / is not a corpus…? • A newspaper archive on CD-ROM? The answer is • An online glossary? always “NO” • A digital library (e.g. Project (see Gutenberg)? definition) • All RAI 1 programmes (e.g. for spoken TV language) Corpora vs. web •Corpora: – Usually stable •searches can be replicated – Control over contents •we can select the texts to be included, or have control over selection strategies – Ad-hoc linguistically-aware software to investigate them •concordancers can sort / organise concordance lines • Web (as accessed via Google or other search engines): – Very unstable •results can change at any time for reasons beyond our control – No control over contents •what/how many texts are indexed by Google’s robots? – Limited control over search results •cannot sort or organise hits meaningfully; they are presented randomly Click here for another corpus vs. Google comparison What types of corpora exist? A brief overview • A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable dynamic An example of planned balance: the British National Corpus • 100 m words of contemporary spoken and written British English • Representative of British English “as a whole” • Designed to be appropriate for a variety of uses: lexicography, education, research, commercial applications (computational tools) • Balanced with regard to genre, subject matter and style • Sampling and representativeness very difficult to ensure BNC • 4,124 texts: 90% written, 10% spoken • Largest collection of spoken English ever collected (10m words), but reflects typical imbalance in favour of written text (for understandable practical reasons) • Written portion: 75% informative, 25% imaginative BNC written material Sources: • 60% books • 25% periodicals • 5% brochures and other ephemera • E.g. bus tickets, produce containers, junk mail • 5% unpublished letters, essays, minutes • 5% plays, speeches (written to be spoken) Register levels: • 30% literary or technical “high” • 45% “middle” • 25% informal “low” BNC Subject coverage • Planned to reflect pattern of book publishing in UK over last 20 years Subject Number of texts % of total written Imaginative 625 22 World affairs 453 18 Social science 510 15 Leisure 374 11 Applied science 364 8 Commerce 284 8 Arts 259 8 Natural science 144 4 Belief & thought 146 3 Unclassified 50 3 BNC Spoken corpus • Context-governed material • Lectures, tutorials, classrooms • News reports • Product demonstrations, consultations, interviews • Sermons, political speeches, public meetings, parliamentary debates • Sports commentaries, phone-ins, chat shows • Samples from 12 different regions 10/18 BNC Spoken corpus • Ordinary conversation • 2000 hrs from 124 volunteers, 38 different regions • Four different socio-economic groupings • Equal male and female, age range 15 to 60+ • All conversations over a 2-day period recorded • No secret recording, and allowed to erase • Systematic details kept of time, location, details of participants (sex, age, race, occupation, education, social group, ), topic, etc. • Transcription issues: • include false starts, hesitations, etc. • some paralinguistic features (shouting, whispering), • use of dialect words/grammar • but no phonetic information What types of corpora exist? A brief overview • A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable dynamic 12 Dynamic (Monitor) vs static (Finite) • A static corpus will give a snapshot of language use at a given time • Easier to control balance of content • May limit usefulness, esp. as time passes • A dynamic corpus is ever-changing • Called “monitor” corpus because allows us to monitor language change over time Key concepts and technical notions in corpus-based translation studies • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) “Type” and “token” • “Token” means individual occurrence of a word • “Type” means instance of a given word • The man saw the girl with the telescope • 8 tokens, 6 types • “Type” may refer to lexeme, or individual word form • run, runs, ran, running: 1 or 4 types? Key concepts and technical notions • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) • Concordance (concordancing software) • KWIC (keyword in context) • Nodeword • Sorting Concordance for nodeword “eyes” (sorted 1L) generated from the BNC Key concepts and technical notions • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) • Concordance (concordancing software) • KWIC (keyword in context) • Nodeword • Sorting • Collocation (collocates) • Lemmatisation (morphological analysis) • (POS-)Tagging (grammatical analysis) • Parsing (syntactic analysis) www.nature.com/nature/journal/v455/n7215/full/455835b.html20 General / reference monolingual corpora (of English) Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civil- service spending. Took to the streets • http://corpus.leeds.ac.uk/internet.html • English • Let’s try to understand: • Meaning • Extended (sentential) co-text, preferential co-selections • Context(s) of use • Semantic preference • Semantic prosody Using general / reference monolingual corpora (from/on the Web): Leeds Internet corpora * http://corpus.leeds.ac.uk/internet.html Let’s explore internal variation - Examples of (possible) useful queries • Any other forms of the verb take? (colligational constraints) • Plural/singular of the noun street? (colligational constraints) • Other verbs? (collocational flexibility) • Other nouns? (collocational flexibility) • Select “CQP syntax only” * (automatic POS-tagging!) • http://cwb.sourceforge.net/files/CQP_Tutorial/ • Look at the examples on the following slides for guidance and adapt those models to your searches • Try out a number of different options to familiarise yourself with the search syntax, and understand what kinds of searches it can support Now the translation into Italian of “took to the streets” • Verb? • Preposition? • andare • in • scendere • nella/nelle • …? • per la/per le? • …? • Noun? • strada/strade • piazza/piazze Which queries do we • …? need? How many are necessary? Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civil- service spending. REGISTER ONE’S OPPOSITION • Now search the BNC for this expression. • What does it mean? • Which “feelings” are usually “registered”? • interest • concern • support • dismay • frustrations • dissatisfaction • disapproval • protest • commitment • … Monolingual general / reference corpora available online (at least partially, i.e. as demos) • British National Corpus (BNC, British English) • www.natcorp.ox.ac.uk • COCA (American English) • http://corpus.byu.edu/coca/ • The CORIS corpus (Italian) • http://corpora.dslo.unibo.it/coris_ita.html • Leeds Internet corpora • English, Chinese, Arabic, French, German, Italian, Japanese, Polish, Portuguese, Russian, Spanish: http://corpus.leeds.ac.uk/internet.html • Mannheim corpora (German) • http://corpora.ids-mannheim.de/ccdb explore the Web • Corpus del Español (Spanish) to see what other • www.corpusdelespanol.org

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    48 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us