A Study of Issues and Techniques for Creating Core Vocabulary Lists for English As an International Language

Total Page:16

File Type:pdf, Size:1020Kb

A Study of Issues and Techniques for Creating Core Vocabulary Lists for English As an International Language A STUDY OF ISSUES AND TECHNIQUES FOR CREATING CORE VOCABULARY LISTS FOR ENGLISH AS AN INTERNATIONAL LANGUAGE BY C. JOSEPH SORELL A thesis submitted to Victoria University of Wellington in fulfilment of the requirements for the degree of Doctor of Philosophy Victoria University of Wellington 2013 ABSTRACT Core vocabulary lists have long been a tool used by language learners and instructors seeking to facilitate the initial stages of foreign language learning (Fries & Traver, 1960: 2). In the past, these lists were typically based on the intuitions of experienced educators. Even before the advent of computer technology in the mid-twentieth century, attempts were made to create such lists using objective methodologies. These efforts regularly fell short, however, and – in the end – had to be tweaked subjectively. Now, in the 21st century, this is unfortunately still true, at least for those lists whose methodologies have been published. Given the present availability of sizable English- language corpora from around the world and affordable personal computers, this thesis seeks to fill this methodological gap by answering the research question: How can valid core vocabulary lists for English as an International Language be created? A practical taxonomy is proposed based on Biber’s (1988, 1995) multi-dimensional analysis of English texts. This taxonomy is based on correlated linguistic features and reasonably covers representative spoken and written texts in English. The four-part main study assesses the variance in vocabulary data within each of the four key text types: interactive (face-to-face conversation), academic exposition, imaginative narrative, and general reported exposition. The variation in word types found at progressive intervals in corpora of various sizes is measured using the Dice coefficient, a coefficient originally used to measure species variation in different biotic regions (Dice, 1945). The second study proceeds to compare the most frequent vocabulary types in each of the four text types using an equal-sized collection of each text type. Of special interest is the difference between spoken and written texts. Though types are arguably the proper unit to investigate when comparing vocabulary variation, few learners would want to approach vocabulary learning one word type at a time (Nation & Meara, 2002; Bauer & Nation, 1993). The third study thus compares the effect reordering words as families (as opposed to types) has on core vocabulary lists. An analysis is made of the major differences resulting from grouping the members of each word family under a single headword and summing their individual frequencies. Methods are then discussed for how core vocabulary lists of various sizes can be constructed based on the findings of these three studies. Recommendations are made regarding the size and composition of the source corpus and the core list extraction and construction methodology based on the learning objectives. i ACKNOWLEDGEMENTS A simple reference in the bibliography would be insufficient recognition of the contribution made to my research by Prof. Stefan Gries. Without his text Quantitative Corpus Linguistics with R (2009e), I probably would not have finished this thesis. When the paperback edition came out in late 2010, I ordered it immediately and spent the next semester working page by page through that text and his Statistics for Linguistics with R. These texts are valuable resources, not only for learning how to use the R language but also for understanding corpus linguistic research. Dr. Marc Hasenbank’s one-day seminar on R sponsored by the University Teaching Development Centre at Victoria University in July of 2010 and subsequent explanations by Dr. Sasha Calhoun were especially helpful in pointing me toward R as a useful solution to the programming I needed to do for this thesis. As I would later learn, small influences can lead to significant changes. I also wish to express special thanks to Prof. Estate Khmaladze, Dr. Yuichi Hirose, Dr. Giorgi Kvizhinadze and Dr. Haizhen Wu for their time and patience in discussing Zipf’s law and word frequencies with someone whose math skills should have prevented him from even setting foot in the School of Mathematics. Thanks go, as well, to the faculty and students of the School of Linguistics and Applied Language Studies at Victoria University for always welcoming me during my regular migrations between Taipei and Wellington, especially to the Vocabulary Discussion Group: Myq Larsen, Yosuke Sasao, Tatsuya Nakata, Michael Rogers, Betsy Quero, Tatsushiko Matsushita, Friederike Tegge and others. One of the highlights of every stay in Wellington was the opportunity to spend each Shabbat at Temple Sinai. I want to especially thank Fred, Susan, Rick, Martin and many others for welcoming me into the community. I owe an immense debt of gratitude to Prof. Paul Nation. His writings and the clarity with which he translates research findings into practical teaching applications were what first attracted me to Victoria University and the School of Linguistics and Applied Language Studies. At conferences in several countries, I have heard him encourage language teachers to engage in research, and to share their findings with the language learning community by publishing. To say that the writing topics he suggested to me were life-changing would not be an exaggeration. Pursuing a doctoral degree is not usually good for one’s health, but the suggestion to write on Zipf’s law and word frequencies lead to discoveries of how dynamic ii systems work, including the human body. Prof. Nation continues to inspire my learning and teaching, and now, the teaching of many of my former students. I also wish to thank many of my colleagues here in Taipei who regularly encouraged me and Christ’s College for granting me leave on more than one occasion to focus more intently on research. This thesis is dedicated to my wife, Shufeng, my son, Nathan, and my parents, Michael and Barbara. My parents were my first teachers and are still among the most dedicated I have ever met. During the time it took to write this thesis, Nathan has become a proficient user of two languages. Both of which, he uses to research the inner workings of computers. When my laptop began to black out from running week-long R-scripts, he single- handedly designed and assembled a more powerful computer that I have used to complete most of the important studies in this thesis. During the last few years, Shufeng has put up with a husband who spent more time in the office than at home. When I did make it home, she always made sure there were plenty of fresh fruits and vegetables to keep me going. Shufeng also made sure Nathan paid attention to things other than computers. I look forward to being able to spend more time with them now that this thesis is finished, and I have a better understanding of how research should be conducted. iii TABLE OF CONTENTS ABSTRACT........................................................................................................................ i ACKNOWLEDGEMENTS............................................................................................... ii TABLE OF CONTENTS................................................................................................... iv LIST OF TABLES............................................................................................................. x LIST OF FIGURES........................................................................................................... xiii 1.0 INTRODUCTION........................................................................................................ 1 1.1 Problems with vocabulary lists................................................................................. 5 1.1.1 Words are listed in an inappropriate sequence...................................... 5 1.1.2 Word lists are not consistent.................................................................. 7 1.1.3 Word lists are not representative........................................................... 8 1.1.4 Which variety of English?..................................................................... 9 1.1.5 What is a word?..................................................................................... 10 1.1.6 How big is the core?.............................................................................. 10 1.1.7 Perceptions of word lists........................................................................ 12 1.1.8 Summary of problems with core vocabulary lists................................. 12 1.2 Purpose and Significance.......................................................................................... 13 1.3 Research questions and hypotheses................................................................................ 14 1.4 Definition of terms.................................................................................................... 14 1.4.1 Core vocabulary..................................................................................... 14 1.4.2 Corpus.................................................................................................... 14 1.4.3 General and comparable corpora................................................................ 15 1.4.4 Word types and tokens................................................................................. 15 1.4.5 Hapax legomenon and dislegomenon........................................................ 15 1.4.6 Lemmas and word families........................................................................
Recommended publications
  • Teaching and Learning Academic Vocabulary
    Musa Nushi Shahid Beheshti University Homa Jenabzadeh Shahid Beheshti University Teaching and Learning Academic Vocabulary Developing learners' lexical competence through vocabulary instruction has always been high on second language teachers' agenda. This paper will be focusing on the importance of academic vocabulary and how to teach such vocabulary to adult EFL/ESL learners at intermediate and higher levels of proficiency in the English language. It will also introduce several techniques, mostly those that engage students’ cognitive abilities, which can in turn facilitate the process of teaching and learning academic vocabulary. Several online tools that can assist academic vocabulary teaching and learning will also be introduced. The paper concludes with the introduction and discussion of four important academic vocabulary word lists which can help second language teachers with the identification and selection of academic vocabulary in their instructional planning. Keywords: Vocabulary, Academic, Second language, Word lists, Instruction. 1. Introduction Vocabulary is essential to conveying meaning in a second language (L2). Schmitt (2010) notes that L2 learners seem cognizant of the importance of vocabulary in language learning, as evidenced by their tendency to carry dictionaries, and not grammar books, around with them. It has even been suggested that the main difference between intermediate and advanced L2 learners lies not in how complex their grammatical knowledge is but in how expanded and developed their mental lexicon is (Lewis, 1997). Similarly, McCarthy (1990) observes that "no matter how well the student learns grammar, no matter how successfully the sounds of L2 are mastered, without words to express a wide range of meanings, communication in an L2 just cannot happen in any meaningful way" (p.
    [Show full text]
  • Defining Vocabulary Words Grades
    Vocabulary Instruction Booster Session 2: Defining Vocabulary Words Grades 5–8 Vaughn Gross Center for Reading and Language Arts at The University of Texas at Austin © 2014 Texas Education Agency/The University of Texas System Acknowledgments Vocabulary Instruction Booster Session 2: Defining Vocabulary Words was developed with funding from the Texas Education Agency and the support and talent of many individuals whose names do not appear here, but whose hard work and ideas are represented throughout. These individuals include national and state reading experts; researchers; and those who work for the Vaughn Gross Center for Reading and Language Arts at The University of Texas at Austin and the Texas Education Agency. Vaughn Gross Center for Reading and Language Arts College of Education The University of Texas at Austin www.meadowscenter.org/vgc Manuel J. Justiz, Dean Greg Roberts, Director Texas Education Agency Michael L. Williams, Commissioner of Education Monica Martinez, Associate Commissioner, Standards and Programs Development Team Meghan Coleman, Lead Author Phil Capin Karla Estrada David Osman Jennifer B. Schnakenberg Jacob Williams Design and Editing Matthew Slater, Editor Carlos Treviño, Designer Special thanks to Alice Independent School District in Alice, Texas Vaughn Gross Center for Reading and Language Arts at The University of Texas at Austin © 2014 Texas Education Agency/The University of Texas System Vaughn Gross Center for Reading and Language Arts at The University of Texas at Austin © 2014 Texas Education Agency/The University of Texas System Introduction Explicit and robust vocabulary instruction can make a significant difference when we are purposeful in the words we choose to teach our students.
    [Show full text]
  • MASC: the Manually Annotated Sub-Corpus of American English
    MASC: The Manually Annotated Sub-Corpus of American English Nancy Ide*, Collin Baker**, Christiane Fellbaum†, Charles Fillmore**, Rebecca Passonneau†† *Vassar College Poughkeepsie, New York USA **International Computer Science Institute Berkeley, California USA †Princeton University Princeton, New Jersey USA ††Columbia University New York, New York USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing
    [Show full text]
  • Aspects of Idiom*
    University of Calgary PRISM: University of Calgary's Digital Repository Calgary (Working) Papers in Linguistics Volume 07, Winter 1982 1982-01 Aspects of idiom* Sayers, Coral University of Calgary Sayers, C. (1982). Aspects of idiom*. Calgary Working Papers in Linguistics, 7(Winter), 13-32. http://hdl.handle.net/1880/51297 journal article Downloaded from PRISM: https://prism.ucalgary.ca Aspects of Idiom* Coral Sayers • 1. Introduction Weinreich (1969) defines an idiom as "a complex expression • whose meaning cannot be derived from the meanings of its elements." This writer has collected examples of idioms from the English, German, Australian English, and Quebec French dialects (Appendix I) in order to examine the properties of the idiom, to explore in the literature the current concepts of idiom, and to relate relevant knowledge gained by these processes to the teaching of English as a Second Language. An idiom may be a word, a "lexical idiom," or, if it has a more complicated constituent structure, a "phrasal idiom." (Katz and Postal, 1964, cited by Fraser, 1970:23) As the lexical idiom, domi­ nated by a single branch syntactic category (e.g. "squatter," a noun) does not present as much difficulty as does the phrasal idiom, this paper will focus upon the latter. Despite the difficulties presented by idioms to the Trans­ formational Grammar model, e.g. the contradiction of the claim that virtually all sentences have a low occurrence probability and frequency (Coulmas, 1979), the consensus of opinion seems to be that idioms can be accommodated in the grammatical description. Phrasal idioms of English should be considered as a "more complicated variety of mono­ morphemic lexical entries." (Fraser, 1970:41) 2.
    [Show full text]
  • SELECTING and CREATING a WORD LIST for ENGLISH LANGUAGE TEACHING by Deny A
    Teaching English with Technology, 17(1), 60-72, http://www.tewtjournal.org 60 SELECTING AND CREATING A WORD LIST FOR ENGLISH LANGUAGE TEACHING by Deny A. Kwary and Jurianto Universitas Airlangga Dharmawangsa Dalam Selatan, Surabaya 60286, Indonesia d.a.kwary @ unair.ac.id / [email protected] Abstract Since the introduction of the General Service List (GSL) in 1953, a number of studies have confirmed the significant role of a word list, particularly GSL, in helping ESL students learn English. Given the recent development in technology, several researchers have created word lists, each of them claims to provide a better coverage of a text and a significant role in helping students learn English. This article aims at analyzing the claims made by the existing word lists and proposing a method for selecting words and a creating a word list. The result of this study shows that there are differences in the coverage of the word lists due to the difference in the corpora and the source text analysed. This article also suggests that we should create our own word list, which is both personalized and comprehensive. This means that the word list is not just a list of words. The word list needs to be accompanied with the senses and the patterns of the words, in order to really help ESL students learn English. Keywords: English; GSL; NGSL; vocabulary; word list 1. Introduction A word list has been noted as an essential resource for language teaching, especially for second language teaching. The main purpose for the creation of a word list is to determine the words that the learners need to know, in order to provide more focused learning materials.
    [Show full text]
  • Background and Context for CLASP
    Background and Context for CLASP Nancy Ide, Vassar College The Situation Standards efforts have been on-going for over 20 years Interest and activity mainly in Europe in 90’s and early 2000’s Text Encoding Initiative (TEI) – 1987 Still ongoing, used mainly by humanities EAGLES/ISLE Developed standards for morpho-syntax, syntax, sub-categorization, etc. (links on CLASP wiki) Corpus Encoding Standard (now XCES - http://www.xces.org) Main Aspects" ! Harmonization of formats for linguistic data and annotations" ! Harmonization of descriptors in linguistic annotation" ! These two are often mixed, but need to deal with them separately (see CLASP wiki)" Formats: The Past 20 Years" 1987 TEI Myriad of formats 1994 MULTEXT, CES ~1996 XML 2000 ISO TC37 SC4 2001 LAF model introduced now LAF/GrAF, ISO standards Myriad of formats Actually…" ! Things are better now" ! XML use" ! Moves toward common models, especially in Europe" ! US community seeing the need for interoperability " ! Emergence of common processing platforms (GATE, UIMA) with underlying common models " Resources 1990 ! WordNet gains ground as a “standard” LR ! Penn Treebank, Wall Street Journal Corpus World Wide Web ! British National Corpus ! EuroWordNet XML ! Comlex ! FrameNet ! American National Corpus Semantic Web ! Global WordNet ! More FrameNets ! SUMO ! VerbNet ! PropBank, NomBank ! MASC present NLP software 1994 ! MULTEXT > LT tools, LT XML 1995 ! GATE (Sheffield) 1996 1998 ! Alembic Workbench ! ATLAS (NIST) 2003 ! What happened to this? 200? ! Callisto ! UIMA Now: GATE
    [Show full text]
  • An Analysis of the Merriam-Webster's Advanced Learner's English Dictionary^ Second Edition
    An Analysis of the Merriam-Webster's Advanced Learner's English Dictionary^ Second Edition Takahiro Kokawa Y ukiyoshi Asada JUNKO SUGIMOTO TETSUO OSADA Kazuo Iked a 1. Introduction The year 2016 saw the publication of the revised Second Edition of the Merriam-Webster^ Advanced Learner }s English Dictionary (henceforth MWALED2)y after the eight year interval since the first edition was put on the market in 2008. We would like to discuss in this paper how it has, or has not changed through nearly a decade of updating between the two publications. In fact,when we saw the striking indigo color on the new edition ’s front and back covers, which happens to be prevalent among many EFL dictionaries on the market nowadays, yet totally different from the emerald green appearance of the previous edition, MWALED1, we anticipated rather extensive renovations in terms of lexicographic fea­ tures, ways of presentation and the descriptions themselves, as well as the extensiveness of information presented in MWALED2. However, when we compare the facts and figures presented in the blurbs of both editions, we find little difference between the two versions —‘more than 160,000 example sentences ,’ ‘100,000 words and phrases with definitions that are easy to understand/ ‘3,000 core vocabulary words ,’ ’32,000 IPA pronunciations ,,‘More than 22,000 idioms, verbal collocations, and commonly used phrases from Ameri­ can and British English/ (More than 12,000 usage labels, notes, and paragraphs’ and 'Original drawings and full-color art aid understand ­ ing [sic]5 are the common claims by the two editions.
    [Show full text]
  • Informatics 1: Data & Analysis
    Informatics 1: Data & Analysis Lecture 12: Corpora Ian Stark School of Informatics The University of Edinburgh Friday 27 February 2015 Semester 2 Week 6 http://www.inf.ed.ac.uk/teaching/courses/inf1/da Student Survey Final Day ! ESES: The Edinburgh Student Experience Survey http://www.ed.ac.uk/students/surveys Please log on to MyEd before 1 March to complete the survey. Help guide what we do at the University of Edinburgh, improving your future experience here and that of the students to follow. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Lecture Plan XML We start with technologies for modelling and querying semistructured data. Semistructured Data: Trees and XML Schemas for structuring XML Navigating and querying XML with XPath Corpora One particular kind of semistructured data is large bodies of written or spoken text: each one a corpus, plural corpora. Corpora: What they are and how to build them Applications: corpus analysis and data extraction Ian Stark Inf1-DA / Lecture 12 2015-02-27 Homework ! Tutorial Exercises Tutorial 5 exercises went online earlier this week. In these you use the xmllint command-line tool to check XML validity and run your own XPath queries. Reading T. McEnery and A. Wilson. Corpus Linguistics. Second edition, Edinburgh University Press, 2001. Chapter 2: What is a corpus and what is in it? (§2.2.2 optional) Photocopied handout, also available from the ITO. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Remote Access to DICE ! Much coursework can be done on your own machines, but sometimes it’s important to be able to connect to and use DICE systems.
    [Show full text]
  • The Expanding Horizons of Corpus Analysis
    The expanding horizons of corpus analysis Brian MacWhinney Carnegie Mellon University Abstract By including a focus on multimedia interactions linked to transcripts, corpus linguistics can vastly expand its horizons. This expansion will rely on two continuing developments. First, we need to develop easily used methods for each of the ten analytic methods we have examined, including lexical analyses, QDA (qualitative data analysis), automatic tagging, language profiles, group comparisons, change scores, error analysis, feedback studies, conversation analysis, and modeling. Second, we need to work together to construct a unified database for language studies and related sciences. This database must be grounded on the principles of open access, data-sharing, interoperability, and integrated structure. It must provide powerful tools for searching, multilinguality, and multimedia analysis. If we can build this infrastructure, we will be able to explore more deeply the key questions underlying the structure and functioning of language, as it emerges from the intermeshing of processes operative on eight major timeframes. 1. Introduction Corpus linguistics has benefitted greatly from continuing advances in computer and Internet hardware and software. These advances have made it possible to develop facilities such as BNCweb (bncweb.lancs.ac.uk), LDC (Linguistic Data Consortium) online, the American National Corpus (americannationalcorpus. org), TalkBank (talkbank.org), and CHILDES (childes.psy.cmu.edu). In earlier periods, these corpora were limited to written and transcribed materials. However, most newer corpora now include transcripts linked to either audio or video recordings. The development of this newer corpus methodology is facilitated by technology which makes it easy to produce high-quality video recordings of face-to-face interactions.
    [Show full text]
  • Prototypes and Discreteness in Terminology Examples of Domains
    Prototypes and Discreteness in Terminology Pius ten Hacken Swansea University Characterizing the nature of terms in their opposition to general language words is one of the tasks of a theory of terminology. It determines the selection of entries for a terminological dictionary. This task is by no means straightforward, because terms seem to have different properties depending on the field that is studied. This is illustrated by a brief discussion of examples: terms in mathematical linguistics, traffic law, piano manufacturing, and non-terms in the reporting of general experiences. Two properties can be derived from these discussions as candidates for the delimitation of terms from general words. Firstly, the degree of specialization. This property distinguishes specialized expressions in mathematical linguistics and in piano manufacturing from non-specialized expressions in traffic law and reporting general experiences. Secondly, the lack of a prototype. In mathematical linguistics and in traffic law, the definition of terms concentrates on the boundaries of the concept. In piano manufacturing and in reporting general experiences, concepts have a prototype and fuzzy boundaries. Defining the word term as a disjunction of the two properties implies that it is a less coherent concept than general language word, because it is only the complement of the latter. When the two properties are considered in isolation, it can be shown that the degree of specialization is a gradual property whereas the lack of a prototype is an absolute property. Whether or not we choose to use the name term for it, the latter property identifies a concept that is ontologically different from general vocabulary.
    [Show full text]
  • English Monolingual Learner's Dictionaries in a Cognitive Perspective
    Ludwig-Maximilians-University Munich Institute for English Philology Prof. Dr. Hans-Jörg Schmid PhD student: Carolin Hanika English monolingual learner’s dictionaries in a cognitive perspective (working title) 1. Research background and research question Pupils in Bavarian Grammar schools are allowed to use an English monolingual dictionary from year 10 onwards when completing homework or when sitting exams, for example the famous Oxford Advanced Learner’s Dictionary ; students of English are encouraged to use it all the way up to their final examinations. But despite a long history of using a monolingual learner’s dictionary, only a very small number of users is able to make the most out of the tool: many admit to not knowing how to handle the dictionary but stress that they preferred bilingual over monolingual dictionaries anyway, since the former offered the requested information much more easily and at a higher speed. This incompetence in use is on the one hand often a result of a lack of instruction, on the other hand, it seems to be a sustained fact or even a “curse”: I had the opportunity to gain experience as a teacher in this field and had to accept that even a well-prepared and well-intended instruction is by far anything but a guarantee for the successful use of a dictionary by the pupils. But dictionary use is not solely a didactic matter; many experimental scientific studies have dealt with the topic and have tried to find out how users perform e.g. in reading or text production tasks. Apart from the (not surprising!) insight that success in dictionary performance strongly correlates with a learner’s proficiency level of the language, studies have mostly dealt with performance tasks and left out cognitive processes of the user.
    [Show full text]
  • From the Bilingual to the Monolingual Dictionary Gabriele Stein
    From the Bilingual to the Monolingual Dictionary Gabriele Stein This paper is based on the following two assumptions: 1. Vocabulary acquisition has been seriously neglected in language teaching and language learning. 2. The critical period when a massive expansion of the foreign vocabulary is needed is the intermediate level (cf. F. Twadell, 1966:79; J. C. Richards, 1976:84; E. A. Levinston, 1979:149; P. Meara, 1982:100). The main reasons for this neglect are: a. the great dependence of language teaching methodologists on the research interests and fashions in linguistics and psychology, and b. the emphasis placed by language teaching methodologists and language acquisition researchers on the beginning stage. The belief generally held is that vocabulary acquisition can be delayed until the rudiments of pronunciation and a substantial proportion of the grammatical system have been mastered. Therefore language teaching has naturally concentrated on grammar (cf. E. A. Levinston, 1979:148). Teaching methods and curricula designs vary from country to country, and from language to language. As a non-native teacher of English I can only speak for the English context, but a number of the points that I shall make may well be applicable to the teaching of other languages. What we then need in the teaching of English as a foreign and as a second language is a major reorientation in teaching methodology. A reorientation that assigns vocabulary acquisition its proper place in the language acquisition process. Who are the pupils and students that are learning English at an intermediate level? In countries in which English is taught as a foreign or as a second language they typically are science students who have to improve their knowledge of English to read scientific material, students who want to do a degree in a situation where the medium of instruction is English, professionals who need a better command of English for promotion, and the future English language/literature students.
    [Show full text]