UNIVERSITY OF BELGRADE FACULTY OF PHILOLOGY Saeed G. Safari CONSTRUCTING AND ANALYSING AN ERROR-TAGGED LEARNER CORPUS OF PERSIAN Doctoral Dissertation Belgrade, 2017 UNIVERZITET U BEOGRADU FILOLOŠKI FAKULTET Said G. Safari IZRADA I ANALIZA ANOTIRANOG KORPUSA PERSIJSKOG JEZIKA KAO STRANOG Doktorska disertacija Beograd, 2017. УНИВЕРСИТЕТ БЕЛГРАДА ФАКУЛЬТЕТ ФИЛОЛОГИИ Саид Сафари Формирование и анализ аннотированного корпуса персидского языка Докторская диссертация Белград, 2017 г. Podaci o mentoru i članovima komisije Mentor: dr aja iličevi Petrovi , vanredni profesor Filološki fakultet, Beograd Članovi komisije: 1. dr jiljana arkovi redovni ro esor Filološki fakultet, Beograd 2. dr elena ili ovi redovni ro esor Filološki fakultet, Beograd 3. dr. Reza Morad Sahraei Fakultet za persijsku književnost i strane jezike, Teheran (Faculty of Persian Literature and Foreign Languages, Allameh Tabataba’i University, Tehran) Datum odbrane: Beograd, _______________ به نام خداوند جان آفرین حکیم سخن در زبانآفرین I would like to express my sincere gratitude to my mentor, for the continuous support of my thesis research and her advice, comments, guidance and immense knowledge. I would like to thank my esteemed professors, , Dr Julijana Vu and for their constant enthusiasm and encouragement during my doctoral studies. I would also like to thank Reza Morad Sahraei , from Allameh T b b ’ U s y T h for reviewing my research and his valuable comments and feedback. My deepest and endless gratitude goes to my amazing family, to whom this thesis is dedicated, especially to my loving and supportive wife, Solmaz Taghdimi. CONSTRUCTING AND ANALYSING AN ERROR-TAGGED LEARNER CORPUS OF PERSIAN Summary Linguistic corpora constitute reliable sources and empirical means for analyzing linguistic data. They are also widely used in the fields of Second/Foreign Language Acquisition and Foreign Language Teaching research, where the most commonly used type are Learner Corpora. The present thesis, based on a methodological approach for building a learner corpus, is generally in line with the domain of error analysis and the field of Learner Corpus Research. The thesis describes the process of constructing and developing an error- tagged Persian learner corpus, called the Salam Farsi Learner Corpus (SFLC), as well as an analysis of linguistic errors based on a collection of written texts produced by Serbian learners of the Persian language. Three major stages, namely, constructing the corpus, proposing a system of error annotation and developing tools and software, were followed, and the practical phases such as the systematic collection of data and metadata, defining the corpus design criteria, creating the error tagsets and developing the corpus interface, software and specific tools are described. The SFLC software is equipped with four main tools in order to function as an error-tagged learner corpus and provide the statistical reports. These tools include a tool for submitting data and metadata into the corpus database, a computer-aided error editor to facilitate error tagging, filters and search, and data statistics tools which show various statistical data related to the corpus. Based on the SFLC statistical reports, the frequency and error distribution in the whole corpus and the comparison of error distributions across different proficiency levels are discussed. The corpus statistics show that the most frequent errors made by the Serbian learners of the Persian language are initially to be found in the domain of orthography, while later on the most frequent errors lie in the domains of lexis and syntax. Word Order is marked as the most frequent error type in the corpus as a whole. As for the distribution of errors across specific proficiency levels, the results show that the total number of errors drops from level A2 to level C1, while errors in syntax increase, due to the use of more vi complex syntactic structures at higher proficiency levels. The SFLC not only provides authentic data gathered from learners at different proficiency levels, but also statistics regarding error tags and metadata. Research into Persian as a second/foreign language thus can clearly benefit from the SFLC as a resource. Keywords: Learner Corpus, Error Analysis, Second Language Acquisition, Teaching Persian as a Foreign Language. Research area: Linguistics Research subarea: Corpus linguistics, Second Language Acquisition UDC number: vii IZRADA I ANALIZA ANOTIRANOG KORPUSA PERSIJSKOG JEZIKA KAO STRANOG Rezime Lingvistički korpusi predstavljaju značajan izvor i sredstvo analize empirijskih jezičkih podataka. Njihova upotreba vrlo je raširena, između ostalog, u oblasti istraživanja usvajanja drugog/stranog jezika i nastavi jezika, gde posebno treba naglasiti značaj učeničkih korpusa. U ovoj disertaciji opisuje se izrada jednog takvog korpusa – učeničkog korpusa persijskog jezika, pod nazivom Salam Farsi Learner Corpus (SFLC). Ovaj korpus je izrađen na osnovu tekstova koje su tokom pohađanja kurseva persijskog jezika pisali učenici čiji maternji jezik je srpski. Pored toga što su tekstovi prebačeni u digitalni format, u korpusu su označene greške koje su učenici pravili prilikom pisanja. Tri glavne faze u izradi korpusa bile su njegovo koncipiranje i digitalizovanje, predlaganje sistema anotacije grešaka i razvijanje alata za izradu i pretragu korpusa. Sve tri faze detaljno su opisane u disertaciji. Konkretno, pažnja je posvećena opisu praktičnih koraka poput prikupljanja podataka i metapodataka, kao i konceptualnih zadataka kakvi su definisanje kriterijuma za izradu korpusa, sastavljanje oznaka za greške i idejno osmišljavanje korpusnog interfejsa, softvera i alata. SFLC se softverski oslanja na četiri glavna alata koji omogućuju unos podataka i metapodataka u korpusnu bazu, označavanje grešaka, preuzimanje i pretragu dokumenata (prema površinskim oblicima reči ili prema greškama) i generisanje statističkih izveštaja o greškama. Na osnovu statističkih izveštaja koje SFLC daje, u disertaciji se sprovodi i analiza grešaka – proučavaju se učestalost i raspodela grešaka u korpusu kao celini i na različitim pojedinačnim nivoima znanja persijskog jezika. Rezultati ove korpusno zasnovane analize pokazuju da učenici kojima je maternji jezik srpski na nižim nivoima znanja persijskog jezika najčešće prave greške u domenu ortografije, dok se kasnije greške češće nalaze u domenima leksike i sintakse. Greške vezane za red reči označene su kao ukupno gledano najčešći tip greške u čitavom korpusu. Ukupni broj grešaka smanjuje se kako se učenici kreću od nivoa A2 ka nivou C1. Međutim, kada je reč o sintaksi, broj grešaka raste, usled korišćenja složenijih sintaksičkih struktura na višim nivoima. viii SFLC ne samo da obezbeđuje autentične podatke prikupljene od učenika na različitim nivoima znanja, već pruža i statističke podatke o označenim greškama i drugim korpusnim parametrima. Stoga se zaključuje da korpus može biti od velike koristi za istraživanje i nastavu persijskog jezika kao drugog/stranog. Ključne reči: Učenički korpus, analiza grešaka, usvajanje drugog jezika, nastava persijskog kao stranog jezika. Naučna oblast: Nauka o jeziku Uža naučna oblast: Korpusna lingvistika, primenjena lingvistika UDK broj: ix TABLE OF CONTENTS 1. Introduction ................................................................................................................. 1 1.1 Learner Corpora, Second Language Acquisition and Error Analysis ........................... 2 1.2 Overarching Goals and Motivation ............................................................................... 3 1.3 Specific Objectives and Thesis Research Methodology ............................................... 4 1.4 Thesis Research Methodology ...................................................................................... 5 1.5 Outline of the Thesis ..................................................................................................... 7 2. Review of the Literature ............................................................................................. 9 2.1 Corpora and Corpus Linguistics ................................................................................... 10 2.1.1 Types of Corpora ....................................................................................................... 12 2.1.2 Types of Corpora in Language Learning and Teaching ............................................ 15 2.2 Learner Corpora .......................................................................................................... 16 2.2.1 Learner Corpus Research .......................................................................................... 17 2.3 Types of Learner Corpora ............................................................................................ 19 2.3.1 Types of LC Based on Comparative Descriptions ................................................... 19 2.3.2 Types of LC based on Corpus Features and Design Criteria .................................... 21 2.4 Learner Corpora and SLA Research ............................................................................ 22 2.5 Stages in Learner Corpora Research ............................................................................ 24 2.6 Learner Corpora Applications ...................................................................................... 27 2.6.1 Delayed Usage vs. Immediate Usage of LC ............................................................. 27 2.6.2 Specific Applications of LC .....................................................................................
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages176 Page
-
File Size-