Downloaded on 2017-02-12T11:34:32Z a Cross-Linguistic Database of Children’S Printed Words in Three Slavic Languages

Downloaded on 2017-02-12T11:34:32Z a Cross-Linguistic Database of Children’S Printed Words in Three Slavic Languages

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Cork Open Research Archive Title A cross linguistic database of children's printed words in three Slavic languages Author(s) Garabik, Radovan; Caravolus, Marketa; Kessler, Brett; Hoeflerova, Eva; Masterson, Jackie; Mikulajova, Marina; Szczerbinski, Marcin; Wierzchon, Piotr Editor(s) Levicka, Jana Garabik, Radovan Publication date 2007 Original citation Garabík, R., Caravolas, M., Kessler, B., Höflerová, E., Masterson, J., Mikulajová, M., Szczerbiński, M., Wierzchoń, P. (2007). 'A cross- linguistic database of children’s printed words in three Slavic languages'. In Levická, J., & Garabík, R. (Eds.). Computer Treatment of Slavic and East European Languages: Fourth International Seminar, Bratislava, Slovakia, 25−27 October 2007: Proceedings (pp. 51−64). Bratislava: Tribun. Type of publication Conference item Link to publisher's http://spell.psychology.wustl.edu/Garabík2007/ version http://korpus.juls.savba.sk/~slovko/2007/ Access to the full text of the published version may require a subscription. Rights © 2007, the respective authors. The articles can be used under the Creative Commons Attribution-ShareAlike 3.0 Unported License. http://creativecommons.org/licenses/by-sa/3.0/ Item downloaded http://hdl.handle.net/10468/736 from Downloaded on 2017-02-12T11:34:32Z A Cross-linguistic Database of Children’s Printed Words in Three Slavic Languages Radovan Garabík1, Markéta Caravolas2, Brett Kessler3, Eva Höflerová4, Jackie Masterson5, Marína Mikulajová6, Marcin Szczerbiński7, and Piotr Wierzchoń8 1 Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava 2 School of Psychology, University of Liverpool 3 Psychology Department, Washington University in St. Louis 4 Department of Czech Language and Literature with Didactics, University of Ostrava 5 Institute of Education, University of London 6 Faculty of Education, Comenius University, Bratislava 7 Department of Human Communication Sciences, The University of Sheffield 8 Institute of Linguistics, Adam Mickiewicz University, Poznań Abstract. We describe a lexical database consisting of morphologically and phonetically tagged words that occur in the texts primarily used for language arts instruction in the Czech Republic, Poland and Slovakia in the initial period of primary education (up to grade 4 or 5). The database aims to parallel the contents and usage of the British English Children’s Printed Word Database. It contains words from texts of the most widely used Czech, Polish and Slovak textbooks. The corpus is accessible via a simple WWW interface, allowing regular expression searches and boolean expression across word forms, lemmas, morphology tags and phonemic transcription, and providing useful statistics on the textwords included. We anticipate extensive usage of the database as a reference in the de- velopment of psychodiagnostic batteries for literacy impairments in the three languages, as well as for the creation of experimental materials in psycholinguistic research. 1 Motivations for the West Slavic database Lexical databases that reflect language use across the developmental spectrum are critical tools for research on the development of spoken and written lan- guage skills because they allow researchers to select materials for their studies that are age- and grade-appropriate. A number of databases exist for adult lan- guage, but only a few have been developed based on child language. Available child-language corpora include the earlier American English sources The Amer- ican Heritage Word Frequency Book (Carroll, 1971) and The Educator’s Word Frequency Guide (Zeno, Ivenz, Millard & Duvvuri, 1995), and more recently, Manulex, a French database (Lété, Sprenger-Charolles & Colé, 2004) and the British English Children’s Printed Word Database – CPWD (Masterson, Stuart, Dixon & Lovejoy, 2003). However, to our knowledge, no corpus of children’s printed words has been published in any of the Slavic languages. 52 Radovan Garabík et al. The data that can be generated from lexical databases have diverse applica- tions in psycholinguistic research. For example they can produce statistics about lexical and sublexical variables such as frequency of specific units, word length in terms of letters or syllables, orthographic and phonological neighbourhoods, and grapheme–phoneme consistency, to name but a few. Accumulating evidence shows that text-based variables such as these affect learning to read and spell from an early age (e.g. Caravolas, Kessler, Hulme & Snowling, 2005; Treiman & Kessler, 2006; Pacton, Perruchet & Fayol, 2001). An emerging key issue in this research area concerns the relative influence of orthographic depth on the learning process: Does the predictability (transparency) of a specific writing sys- tem significantly influence the way children learn to read and write it? Direct cross-language comparisons based on corpus statistics will play a critical role in answering this question. However, a current limitation is that there are still few children’s lexical databases in different languages, and those that do exist, rarely generate directly comparable statistics. This is because databases may be designed for different scientific purposes and thus do not always contain similar information from language to language. Moreover, linguistic features that are im- portant in one language may be deemed to be of marginal importance and thus not warrant inclusion in another. Thus, a fundamental motivation for our project was to redress these shortcomings in the creation of a database that would allow direct cross-linguistic comparisons of a wide range of measures across Czech, Pol- ish and Slovak. Cognizant of the prevalence of English-language research and of English-based models of language and literacy development, we based the West Slavic lexical database (Weslalex) on the existing English CPWD (Masterson et al., 2003). These design features will enable researchers of Slavic languages to investigate questions that could not be addressed without corpus data, and, they will facilitate meaningful comparisons to English measures, which so often provide the benchmark in developmental psycholinguistic research. 2 Types of corpus statistics provided in existing children’s corpora The existing American English corpora provide only word frequency information across (Carroll, 1971) and within (Zeno et al., 1995) primary school grades. The French Manulex (Lété et al., 2004) currently contains lemmatized and nonlem- matized grade-level word frequency lists, limited part of speech (POS) informa- tion, and letter frequencies. The more recent extension, Manulex-infra (Peere- man, Lété & Sprenger-Charolles, in press), generates statistics at the sublexical level (syllable, grapheme-to-phoneme mappings, bigrams), and lexical level (lex- ical neighborhood, homophony and homography). The British English CPWD (Masterson et al., 2003) allows searches by grade and it offers a wide range of possible searches at the lexical and sublexical levels. These include searches of orthographic and phonological attributes such as neighbourhoods, component letters and phonemes, word length, and frequency. A feature that is currently missing from all of these corpora is a detailed morphosyntactic level of analysis. A Cross-linguistic Database of Children’s Printed Words in 3 Languages 53 Although impressive advances are now being made in several languages, no com- prehensive children’s database, that includes all of the above search possibilities, has yet been developed. 3 Features of the West Slavic database The database that we are developing is modelled in part on the CPWD, and one of our key objectives is to make possible parallel cross-linguistic searches in any of our Slavic languages and this English language resource. In addition, however, a truly useful tool for psycholinguistic research in the inflected Slavic languages requires information not only at the lexical and sublexical (grapho-phonological) levels, but also at the morphophonological, grammatical and phrase levels. Thus we include POS information derived from sentence-level analyses, and one of our search tools permits searching of multiword sequences. The integration of the lexical/sublexical database and of the sentence-level database is one of the critical challenges being addressed in our project. 4 A description of the pilot database materials The database currently contains printed words in Czech (388 654 tokens, 64 411 distinct wordforms, 24 364 distinct lemmas), Polish (175 404 tokens, 34 067 dis- tinct wordforms, 13 767 distinct lemmas), and Slovak (180 674 tokens, 30 060 distinct wordforms, 14 610 distinct lemmas)1 from texts primarily used for lan- guage arts instruction in each country in grades up to 4 or 5. Based on surveys carried out in each of the three countries, we selected books and materials from those series that are currently the most widely used. Some intercultural dif- ferences necessarily emerged so that different numbers of books were sampled in each country. The simplest case proved to be Slovak where only one lan- guage arts series is approved by the Ministry of Education; thus we selected the designated readers and one Slovak language grammar book from each primary grade (1 to 4). The total number of Slovak books is therefore relatively small (9 books), but they represent an exhaustive sample of the materials children read as part of their language arts instruction. The Czech case was less straightfor- ward because

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    15 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us