Constructing and Analysing an Error-Tagged Learner Corpus of Persian

Total Page:16

File Type:pdf, Size:1020Kb

Constructing and Analysing an Error-Tagged Learner Corpus of Persian UNIVERSITY OF BELGRADE FACULTY OF PHILOLOGY Saeed G. Safari CONSTRUCTING AND ANALYSING AN ERROR-TAGGED LEARNER CORPUS OF PERSIAN Doctoral Dissertation Belgrade, 2017 UNIVERZITET U BEOGRADU FILOLOŠKI FAKULTET Said G. Safari IZRADA I ANALIZA ANOTIRANOG KORPUSA PERSIJSKOG JEZIKA KAO STRANOG Doktorska disertacija Beograd, 2017. УНИВЕРСИТЕТ БЕЛГРАДА ФАКУЛЬТЕТ ФИЛОЛОГИИ Саид Сафари Формирование и анализ аннотированного корпуса персидского языка Докторская диссертация Белград, 2017 г. Podaci o mentoru i članovima komisije Mentor: dr aja iličevi Petrovi , vanredni profesor Filološki fakultet, Beograd Članovi komisije: 1. dr jiljana arkovi redovni ro esor Filološki fakultet, Beograd 2. dr elena ili ovi redovni ro esor Filološki fakultet, Beograd 3. dr. Reza Morad Sahraei Fakultet za persijsku književnost i strane jezike, Teheran (Faculty of Persian Literature and Foreign Languages, Allameh Tabataba’i University, Tehran) Datum odbrane: Beograd, _______________ به نام خداوند جان آفرین حکیم سخن در زبانآفرین I would like to express my sincere gratitude to my mentor, for the continuous support of my thesis research and her advice, comments, guidance and immense knowledge. I would like to thank my esteemed professors, , Dr Julijana Vu and for their constant enthusiasm and encouragement during my doctoral studies. I would also like to thank Reza Morad Sahraei , from Allameh T b b ’ U s y T h for reviewing my research and his valuable comments and feedback. My deepest and endless gratitude goes to my amazing family, to whom this thesis is dedicated, especially to my loving and supportive wife, Solmaz Taghdimi. CONSTRUCTING AND ANALYSING AN ERROR-TAGGED LEARNER CORPUS OF PERSIAN Summary Linguistic corpora constitute reliable sources and empirical means for analyzing linguistic data. They are also widely used in the fields of Second/Foreign Language Acquisition and Foreign Language Teaching research, where the most commonly used type are Learner Corpora. The present thesis, based on a methodological approach for building a learner corpus, is generally in line with the domain of error analysis and the field of Learner Corpus Research. The thesis describes the process of constructing and developing an error- tagged Persian learner corpus, called the Salam Farsi Learner Corpus (SFLC), as well as an analysis of linguistic errors based on a collection of written texts produced by Serbian learners of the Persian language. Three major stages, namely, constructing the corpus, proposing a system of error annotation and developing tools and software, were followed, and the practical phases such as the systematic collection of data and metadata, defining the corpus design criteria, creating the error tagsets and developing the corpus interface, software and specific tools are described. The SFLC software is equipped with four main tools in order to function as an error-tagged learner corpus and provide the statistical reports. These tools include a tool for submitting data and metadata into the corpus database, a computer-aided error editor to facilitate error tagging, filters and search, and data statistics tools which show various statistical data related to the corpus. Based on the SFLC statistical reports, the frequency and error distribution in the whole corpus and the comparison of error distributions across different proficiency levels are discussed. The corpus statistics show that the most frequent errors made by the Serbian learners of the Persian language are initially to be found in the domain of orthography, while later on the most frequent errors lie in the domains of lexis and syntax. Word Order is marked as the most frequent error type in the corpus as a whole. As for the distribution of errors across specific proficiency levels, the results show that the total number of errors drops from level A2 to level C1, while errors in syntax increase, due to the use of more vi complex syntactic structures at higher proficiency levels. The SFLC not only provides authentic data gathered from learners at different proficiency levels, but also statistics regarding error tags and metadata. Research into Persian as a second/foreign language thus can clearly benefit from the SFLC as a resource. Keywords: Learner Corpus, Error Analysis, Second Language Acquisition, Teaching Persian as a Foreign Language. Research area: Linguistics Research subarea: Corpus linguistics, Second Language Acquisition UDC number: vii IZRADA I ANALIZA ANOTIRANOG KORPUSA PERSIJSKOG JEZIKA KAO STRANOG Rezime Lingvistički korpusi predstavljaju značajan izvor i sredstvo analize empirijskih jezičkih podataka. Njihova upotreba vrlo je raširena, između ostalog, u oblasti istraživanja usvajanja drugog/stranog jezika i nastavi jezika, gde posebno treba naglasiti značaj učeničkih korpusa. U ovoj disertaciji opisuje se izrada jednog takvog korpusa – učeničkog korpusa persijskog jezika, pod nazivom Salam Farsi Learner Corpus (SFLC). Ovaj korpus je izrađen na osnovu tekstova koje su tokom pohađanja kurseva persijskog jezika pisali učenici čiji maternji jezik je srpski. Pored toga što su tekstovi prebačeni u digitalni format, u korpusu su označene greške koje su učenici pravili prilikom pisanja. Tri glavne faze u izradi korpusa bile su njegovo koncipiranje i digitalizovanje, predlaganje sistema anotacije grešaka i razvijanje alata za izradu i pretragu korpusa. Sve tri faze detaljno su opisane u disertaciji. Konkretno, pažnja je posvećena opisu praktičnih koraka poput prikupljanja podataka i metapodataka, kao i konceptualnih zadataka kakvi su definisanje kriterijuma za izradu korpusa, sastavljanje oznaka za greške i idejno osmišljavanje korpusnog interfejsa, softvera i alata. SFLC se softverski oslanja na četiri glavna alata koji omogućuju unos podataka i metapodataka u korpusnu bazu, označavanje grešaka, preuzimanje i pretragu dokumenata (prema površinskim oblicima reči ili prema greškama) i generisanje statističkih izveštaja o greškama. Na osnovu statističkih izveštaja koje SFLC daje, u disertaciji se sprovodi i analiza grešaka – proučavaju se učestalost i raspodela grešaka u korpusu kao celini i na različitim pojedinačnim nivoima znanja persijskog jezika. Rezultati ove korpusno zasnovane analize pokazuju da učenici kojima je maternji jezik srpski na nižim nivoima znanja persijskog jezika najčešće prave greške u domenu ortografije, dok se kasnije greške češće nalaze u domenima leksike i sintakse. Greške vezane za red reči označene su kao ukupno gledano najčešći tip greške u čitavom korpusu. Ukupni broj grešaka smanjuje se kako se učenici kreću od nivoa A2 ka nivou C1. Međutim, kada je reč o sintaksi, broj grešaka raste, usled korišćenja složenijih sintaksičkih struktura na višim nivoima. viii SFLC ne samo da obezbeđuje autentične podatke prikupljene od učenika na različitim nivoima znanja, već pruža i statističke podatke o označenim greškama i drugim korpusnim parametrima. Stoga se zaključuje da korpus može biti od velike koristi za istraživanje i nastavu persijskog jezika kao drugog/stranog. Ključne reči: Učenički korpus, analiza grešaka, usvajanje drugog jezika, nastava persijskog kao stranog jezika. Naučna oblast: Nauka o jeziku Uža naučna oblast: Korpusna lingvistika, primenjena lingvistika UDK broj: ix TABLE OF CONTENTS 1. Introduction ................................................................................................................. 1 1.1 Learner Corpora, Second Language Acquisition and Error Analysis ........................... 2 1.2 Overarching Goals and Motivation ............................................................................... 3 1.3 Specific Objectives and Thesis Research Methodology ............................................... 4 1.4 Thesis Research Methodology ...................................................................................... 5 1.5 Outline of the Thesis ..................................................................................................... 7 2. Review of the Literature ............................................................................................. 9 2.1 Corpora and Corpus Linguistics ................................................................................... 10 2.1.1 Types of Corpora ....................................................................................................... 12 2.1.2 Types of Corpora in Language Learning and Teaching ............................................ 15 2.2 Learner Corpora .......................................................................................................... 16 2.2.1 Learner Corpus Research .......................................................................................... 17 2.3 Types of Learner Corpora ............................................................................................ 19 2.3.1 Types of LC Based on Comparative Descriptions ................................................... 19 2.3.2 Types of LC based on Corpus Features and Design Criteria .................................... 21 2.4 Learner Corpora and SLA Research ............................................................................ 22 2.5 Stages in Learner Corpora Research ............................................................................ 24 2.6 Learner Corpora Applications ...................................................................................... 27 2.6.1 Delayed Usage vs. Immediate Usage of LC ............................................................. 27 2.6.2 Specific Applications of LC .....................................................................................
Recommended publications
  • Talk Bank: a Multimodal Database of Communicative Interaction
    Talk Bank: A Multimodal Database of Communicative Interaction 1. Overview The ongoing growth in computer power and connectivity has led to dramatic changes in the methodology of science and engineering. By stimulating fundamental theoretical discoveries in the analysis of semistructured data, we can to extend these methodological advances to the social and behavioral sciences. Specifically, we propose the construction of a major new tool for the social sciences, called TalkBank. The goal of TalkBank is the creation of a distributed, web- based data archiving system for transcribed video and audio data on communicative interactions. We will develop an XML-based annotation framework called Codon to serve as the formal specification for data in TalkBank. Tools will be created for the entry of new and existing data into the Codon format; transcriptions will be linked to speech and video; and there will be extensive support for collaborative commentary from competing perspectives. The TalkBank project will establish a framework that will facilitate the development of a distributed system of allied databases based on a common set of computational tools. Instead of attempting to impose a single uniform standard for coding and annotation, we will promote annotational pluralism within the framework of the abstraction layer provided by Codon. This representation will use labeled acyclic digraphs to support translation between the various annotation systems required for specific sub-disciplines. There will be no attempt to promote any single annotation scheme over others. Instead, by promoting comparison and translation between schemes, we will allow individual users to select the custom annotation scheme most appropriate for their purposes.
    [Show full text]
  • Child Language
    ABSTRACTS 14TH INTERNATIONAL CONGRESS FOR THE STUDY OF CHILD LANGUAGE IN LYON, IASCL FRANCE 2017 WELCOME JULY, 17TH21ST 2017 SPECIAL THANKS TO - 2 - SUMMARY Plenary Day 1 4 Day 2 5 Day 3 53 Day 4 101 Day 5 146 WELCOME! Symposia Day 2 6 Day 3 54 Day 4 102 Day 5 147 Poster Day 2 189 Day 3 239 Day 4 295 - 3 - TH DAY MONDAY, 17 1 18:00-19:00, GRAND AMPHI PLENARY TALK Bottom-up and top-down information in infants’ early language acquisition Sharon Peperkamp Laboratoire de Sciences Cognitives et Psycholinguistique, Paris, France Decades of research have shown that before they pronounce their first words, infants acquire much of the sound structure of their native language, while also developing word segmentation skills and starting to build a lexicon. The rapidity of this acquisition is intriguing, and the underlying learning mechanisms are still largely unknown. Drawing on both experimental and modeling work, I will review recent research in this domain and illustrate specifically how both bottom-up and top-down cues contribute to infants’ acquisition of phonetic cat- egories and phonological rules. - 4 - TH DAY TUESDAY, 18 2 9:00-10:00, GRAND AMPHI PLENARY TALK What do the hands tell us about lan- guage development? Insights from de- velopment of speech, gesture and sign across languages Asli Ozyurek Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands Most research and theory on language development focus on children’s spoken utterances. However language development starting with the first words of children is multimodal. Speaking children produce gestures ac- companying and complementing their spoken utterances in meaningful ways through pointing or iconic ges- tures.
    [Show full text]
  • Segmentability Differences Between Child-Directed and Adult-Directed Speech: a Systematic Test with an Ecologically Valid Corpus
    Report Segmentability Differences Between Child-Directed and Adult-Directed Speech: A Systematic Test With an Ecologically Valid Corpus Alejandrina Cristia 1, Emmanuel Dupoux1,2,3, Nan Bernstein Ratner4, and Melanie Soderstrom5 1Dept d’Etudes Cognitives, ENS, PSL University, EHESS, CNRS 2INRIA an open access journal 3FAIR Paris 4Department of Hearing and Speech Sciences, University of Maryland 5Department of Psychology, University of Manitoba Keywords: computational modeling, learnability, infant word segmentation, statistical learning, lexicon ABSTRACT Previous computational modeling suggests it is much easier to segment words from child-directed speech (CDS) than adult-directed speech (ADS). However, this conclusion is based on data collected in the laboratory, with CDS from play sessions and ADS between a parent and an experimenter, which may not be representative of ecologically collected CDS and ADS. Fully naturalistic ADS and CDS collected with a nonintrusive recording device Citation: Cristia A., Dupoux, E., Ratner, as the child went about her day were analyzed with a diverse set of algorithms. The N. B., & Soderstrom, M. (2019). difference between registers was small compared to differences between algorithms; it Segmentability Differences Between Child-Directed and Adult-Directed reduced when corpora were matched, and it even reversed under some conditions. Speech: A Systematic Test With an Ecologically Valid Corpus. Open Mind: These results highlight the interest of studying learnability using naturalistic corpora Discoveries in Cognitive Science, 3, 13–22. https://doi.org/10.1162/opmi_ and diverse algorithmic definitions. a_00022 DOI: https://doi.org/10.1162/opmi_a_00022 INTRODUCTION Supplemental Materials: Although children are exposed to both child-directed speech (CDS) and adult-directed speech https://osf.io/th75g/ (ADS), children appear to extract more information from the former than the latter (e.g., Cristia, Received: 15 May 2018 2013; Shneidman & Goldin-Meadow,2012).
    [Show full text]
  • Lexical Ambiguity • Syntactic Ambiguity • Semantic Ambiguity • Pragmatic Ambiguity
    Welcome to the course! IntroductionIntroduction toto NaturalNatural LanguageLanguage ProcessingProcessing (NLP)(NLP) Professors:Marta Gatius Vila Horacio Rodríguez Hontoria Hours per week: 2h theory + 1h laboratory Web page: http://www.cs.upc.edu/~gatius/engpln2017.html Main goal Understand the fundamental concepts of NLP • Most well-known techniques and theories • Most relevant existing resources • Most relevant applications NLP Introduction 1 Welcome to the course! IntroductionIntroduction toto NaturalNatural LanguageLanguage ProcessingProcessing Content 1. Introduction to Language Processing 2. Applications. 3. Language models. 4. Morphology and lexicons. 5. Syntactic processing. 6. Semantic and pragmatic processing. 7. Generation NLP Introduction 2 Welcome to the course! IntroductionIntroduction toto NaturalNatural LanguageLanguage ProcessingProcessing Assesment • Exams Mid-term exam- November End-of-term exam – Final exams period- all the course contents • Development of 2 Programs – Groups of two or three students Course grade = maximum ( midterm exam*0.15 + final exam*0.45, final exam * 0.6) + assigments *0.4 NLP Introduction 3 Welcome to the course! IntroductionIntroduction toto NaturalNatural LanguageLanguage ProcessingProcessing Related (or the same) disciplines: •Computational Linguistics, CL •Natural Language Processing, NLP •Linguistic Engineering, LE •Human Language Technology, HLT NLP Introduction 4 Linguistic Engineering (LE) • LE consists of the application of linguistic knowledge to the development of computer systems able to recognize, understand, interpretate and generate human language in all its forms. • LE includes: • Formal models (representations of knowledge of language at the different levels) • Theories and algorithms • Techniques and tools • Resources (Lingware) • Applications NLP Introduction 5 Linguistic knowledge levels – Phonetics and phonology. Language models – Morphology: Meaningful components of words. Lexicon doors is plural – Syntax: Structural relationships between words.
    [Show full text]
  • Multimedia Corpora (Media Encoding and Annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek)
    Multimedia Corpora (Media encoding and annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek) Draft submitted to CLARIN WG 5.7. as input to CLARIN deliverable D5.C­3 “Interoperability and Standards” [http://www.clarin.eu/system/files/clarin­deliverable­D5C3_v1_5­finaldraft.pdf] Table of Contents 1 General distinctions / terminology................................................................................................................................... 1 1.1 Different types of multimedia corpora: spoken language vs. speech vs. phonetic vs. multimodal corpora vs. sign language corpora......................................................................................................................................................... 1 1.2 Media encoding vs. Media annotation................................................................................................................... 3 1.3 Data models/file formats vs. Transcription systems/conventions.......................................................................... 3 1.4 Transcription vs. Annotation / Coding vs. Metadata ............................................................................................. 3 2 Media encoding ............................................................................................................................................................... 5 2.1 Audio encoding ..................................................................................................................................................... 5 2.2
    [Show full text]
  • Conference Abstracts
    EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues.
    [Show full text]
  • Gold Standard Annotations for Preposition and Verb Sense With
    Gold Standard Annotations for Preposition and Verb Sense with Semantic Role Labels in Adult-Child Interactions Lori Moon Christos Christodoulopoulos Cynthia Fisher University of Illinois at Amazon Research University of Illinois at Urbana-Champaign [email protected] Urbana-Champaign [email protected] [email protected] Sandra Franco Dan Roth Intelligent Medical Objects University of Pennsylvania Northbrook, IL USA [email protected] [email protected] Abstract This paper describes the augmentation of an existing corpus of child-directed speech. The re- sulting corpus is a gold-standard labeled corpus for supervised learning of semantic role labels in adult-child dialogues. Semantic role labeling (SRL) models assign semantic roles to sentence constituents, thus indicating who has done what to whom (and in what way). The current corpus is derived from the Adam files in the Brown corpus (Brown, 1973) of the CHILDES corpora, and augments the partial annotation described in Connor et al. (2010). It provides labels for both semantic arguments of verbs and semantic arguments of prepositions. The semantic role labels and senses of verbs follow Propbank guidelines (Kingsbury and Palmer, 2002; Gildea and Palmer, 2002; Palmer et al., 2005) and those for prepositions follow Srikumar and Roth (2011). The corpus was annotated by two annotators. Inter-annotator agreement is given sepa- rately for prepositions and verbs, and for adult speech and child speech. Overall, across child and adult samples, including verbs and prepositions, the κ score for sense is 72.6, for the number of semantic-role-bearing arguments, the κ score is 77.4, for identical semantic role labels on a given argument, the κ score is 91.1, for the span of semantic role labels, and the κ for agreement is 93.9.
    [Show full text]
  • A Massively Parallel Corpus: the Bible in 100 Languages
    Lang Resources & Evaluation DOI 10.1007/s10579-014-9287-y ORIGINAL PAPER A massively parallel corpus: the Bible in 100 languages Christos Christodouloupoulos • Mark Steedman Ó The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora. Keywords Parallel corpus Á Multilingual corpus Á Comparative corpus linguistics 1 Introduction Parallel corpora are a valuable resource for linguistic research and natural language processing (NLP) applications. One of the main uses of the latter kind is as training material for statistical machine translation (SMT), where large amounts of aligned data are standardly used to learn word alignment models between the lexica of two languages (for example, in the Giza?? system of Och and Ney 2003). Another interesting use of parallel corpora in NLP is projected learning of linguistic structure. In this approach, supervised data from a resource-rich language is used to guide the unsupervised learning algorithm in a target language. Although there are some techniques that do not require parallel texts (e.g. Cohen et al. 2011), the most successful models use sentence-aligned corpora (Yarowsky and Ngai 2001; Das and Petrov 2011). C. Christodouloupoulos (&) Department of Computer Science, UIUC, 201 N.
    [Show full text]
  • Corpus Linguistics: a Practical Introduction
    Corpus Linguistics: A Practical Introduction Nadja Nesselhauf, October 2005 (last updated September 2011) 1) Corpus Linguistics and Corpora - What is corpus linguistics (I)? - What data do linguists use to investigate linguistic phenomena? - What is a corpus? - What is corpus linguistics (II)? - What corpora are there? - What corpora are available to students of English at the University of Heidelberg? (For a list of corpora available at the Department of English click here) 2) Corpus Software - What software is there to perform linguistic analyses on the basis of corpora? - What can the software do? - A brief introduction to an online search facility (BNC) - A step-to-step introduction to WordSmith Tools 3) Exercises (I and II) - I Using the WordList function of WordSmith - II Using the Concord function of WordSmith 4) How to conduct linguistic analyses on the basis of corpora: two examples - Example 1: Australian English vocabulary - Example 2: Present perfect and simple past in British and American English - What you have to take into account when performing a corpuslingustic analysis 5) Exercises (III) - Exercise III.1 - Exercise III.2 6) Where to find further information on corpus linguistics 1) Corpus Linguistics and Corpora What is corpus linguistics (I)? Corpus linguistics is a method of carrying out linguistic analyses. As it can be used for the investigation of many kinds of linguistic questions and as it has been shown to have the potential to yield highly interesting, fundamental, and often surprising new insights about language, it has become one of the most wide-spread methods of linguistic investigation in recent years.
    [Show full text]
  • The Relationship Between Transitivity and Caused Events in the Acquisition of Emotion Verbs
    Love Is Hard to Understand: The Relationship Between Transitivity and Caused Events in the Acquisition of Emotion Verbs The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Hartshorne, Joshua K., Amanda Pogue, and Jesse Snedeker. 2014. Citation Love Is Hard to Understand: The Relationship Between Transitivity and Caused Events in the Acquisition of Emotion Verbs. Journal of Child Language (June 19): 1–38. Published Version doi:10.1017/S0305000914000178 Accessed January 17, 2017 12:55:19 PM EST Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:14117738 This article was downloaded from Harvard University's DASH Terms of Use repository, and is made available under the terms and conditions applicable to Open Access Policy Articles, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#OAP (Article begins on next page) Running head: TRANSITIVITY AND CAUSED EVENTS Love is hard to understand: The relationship between transitivity and caused events in the acquisition of emotion verbs Joshua K. Hartshorne Massachusetts Institute of Technology Harvard University Amanda Pogue University of Waterloo Jesse Snedeker Harvard University In press at Journal of Child Language Acknowledgements: The authors wish to thank Timothy O’Donnell for assistance with the corpus analysis as well as Alfonso Caramazza, Susan Carey, Steve Pinker, Mahesh Srinivasan, Nathan Winkler- Rhoades, Melissa Kline, Hugh Rabagliati, members of the Language and Cognition workshop, and three anonymous reviewers for comments and discussion. This material is based on work supported by a National Defense Science and Engineering Graduate Fellowship to JKH and a grant from the National Science Foundation to Jesse Snedeker (0623845).
    [Show full text]
  • A New Venture in Corpus-Based Lexicography: Towards a Dictionary of Academic English
    A New Venture in Corpus-Based Lexicography: Towards a Dictionary of Academic English Iztok Kosem1 and Ramesh Krishnamurthy1 1. Introduction This paper asserts the increasing importance of academic English in an increasingly Anglophone world, and looks at the differences between academic English and general English, especially in terms of vocabulary. The creation of wordlists has played an important role in trying to establish the academic English lexicon, but these wordlists are not based on appropriate data, or are implemented inappropriately. There is as yet no adequate dictionary of academic English, and this paper reports on new efforts at Aston University to create a suitable corpus on which such a dictionary could be based. 2. Academic English The increasing percentage of academic texts published in English (Swales, 1990; Graddol, 1997; Cargill and O’Connor, 2006) and the increasing numbers of students (both native and non-native speakers of English) at universities where English is the language of instruction (Graddol, 2006) testify to the important role of academic English. At the same time, research has shown that there is a significant difference between academic English and general English. The research has focussed mainly on vocabulary: the lexical differences between academic English and general English have been thoroughly discussed by scholars (Coxhead and Nation, 2001; Nation, 2001, 1990; Coxhead, 2000; Schmitt, 2000, Nation and Waring, 1997; Xue and Nation, 1984), and Coxhead and Nation (2001: 254–56) list the following four distinguishing features of academic vocabulary: “1. Academic vocabulary is common to a wide range of academic texts, and generally not so common in non-academic texts.
    [Show full text]
  • Corpus Studies in Applied Linguistics
    106 Pietilä, P. & O-P. Salo (eds.) 1999. Multiple Languages – Multiple Perspectives. AFinLA Yearbook 1999. Publications de l’Association Finlandaise de Linguistique Appliquée 57. pp. 105–134. CORPUS STUDIES IN APPLIED LINGUISTICS Kay Wikberg, University of Oslo Stig Johansson, University of Oslo Anna-Brita Stenström, University of Bergen Tuija Virtanen, University of Växjö Three samples of corpora and corpus-based research of great interest to applied linguistics are presented in this paper. The first is the Bergen Corpus of London Teenage Language, a project which has already resulted in a number of investigations of how young Londoners use their language. It has also given rise to a related Nordic project, UNO, and to the project EVA, which aims at developing material for assessing the English proficiency of pupils in the compulsory school system in Norway. The second corpus is the English-Norwegian Parallel Corpus (Oslo), which has provided data for both contrastive studies and the study of translationese. Altogether it consists of about 2.6 million words and now also includes translations of English texts into German, Dutch and Portuguese. The third corpus, the International Corpus of Learner English, is a collection of advanced EFL essays written by learners representing 15 different mother tongues. By comparing linguistic features in the various subcorpora it is possible to find out about non-nativeness generally and about problems shared by students representing different languages. 1 INTRODUCTION 1.1 Corpus studies and descriptive linguistics Corpus-based language research has now been with us for more than 20 years. The number of recent books dealing with corpus studies (cf.
    [Show full text]