A Study of Issues and Techniques for Creating Core Vocabulary Lists for English As an International Language

A STUDY OF ISSUES AND TECHNIQUES FOR CREATING CORE VOCABULARY LISTS FOR ENGLISH AS AN INTERNATIONAL LANGUAGE BY C. JOSEPH SORELL A thesis submitted to Victoria University of Wellington in fulfilment of the requirements for the degree of Doctor of Philosophy Victoria University of Wellington 2013 ABSTRACT Core vocabulary lists have long been a tool used by language learners and instructors seeking to facilitate the initial stages of foreign language learning (Fries & Traver, 1960: 2). In the past, these lists were typically based on the intuitions of experienced educators. Even before the advent of computer technology in the mid-twentieth century, attempts were made to create such lists using objective methodologies. These efforts regularly fell short, however, and – in the end – had to be tweaked subjectively. Now, in the 21st century, this is unfortunately still true, at least for those lists whose methodologies have been published. Given the present availability of sizable English- language corpora from around the world and affordable personal computers, this thesis seeks to fill this methodological gap by answering the research question: How can valid core vocabulary lists for English as an International Language be created? A practical taxonomy is proposed based on Biber’s (1988, 1995) multi-dimensional analysis of English texts. This taxonomy is based on correlated linguistic features and reasonably covers representative spoken and written texts in English. The four-part main study assesses the variance in vocabulary data within each of the four key text types: interactive (face-to-face conversation), academic exposition, imaginative narrative, and general reported exposition. The variation in word types found at progressive intervals in corpora of various sizes is measured using the Dice coefficient, a coefficient originally used to measure species variation in different biotic regions (Dice, 1945). The second study proceeds to compare the most frequent vocabulary types in each of the four text types using an equal-sized collection of each text type. Of special interest is the difference between spoken and written texts. Though types are arguably the proper unit to investigate when comparing vocabulary variation, few learners would want to approach vocabulary learning one word type at a time (Nation & Meara, 2002; Bauer & Nation, 1993). The third study thus compares the effect reordering words as families (as opposed to types) has on core vocabulary lists. An analysis is made of the major differences resulting from grouping the members of each word family under a single headword and summing their individual frequencies. Methods are then discussed for how core vocabulary lists of various sizes can be constructed based on the findings of these three studies. Recommendations are made regarding the size and composition of the source corpus and the core list extraction and construction methodology based on the learning objectives. i ACKNOWLEDGEMENTS A simple reference in the bibliography would be insufficient recognition of the contribution made to my research by Prof. Stefan Gries. Without his text Quantitative Corpus Linguistics with R (2009e), I probably would not have finished this thesis. When the paperback edition came out in late 2010, I ordered it immediately and spent the next semester working page by page through that text and his Statistics for Linguistics with R. These texts are valuable resources, not only for learning how to use the R language but also for understanding corpus linguistic research. Dr. Marc Hasenbank’s one-day seminar on R sponsored by the University Teaching Development Centre at Victoria University in July of 2010 and subsequent explanations by Dr. Sasha Calhoun were especially helpful in pointing me toward R as a useful solution to the programming I needed to do for this thesis. As I would later learn, small influences can lead to significant changes. I also wish to express special thanks to Prof. Estate Khmaladze, Dr. Yuichi Hirose, Dr. Giorgi Kvizhinadze and Dr. Haizhen Wu for their time and patience in discussing Zipf’s law and word frequencies with someone whose math skills should have prevented him from even setting foot in the School of Mathematics. Thanks go, as well, to the faculty and students of the School of Linguistics and Applied Language Studies at Victoria University for always welcoming me during my regular migrations between Taipei and Wellington, especially to the Vocabulary Discussion Group: Myq Larsen, Yosuke Sasao, Tatsuya Nakata, Michael Rogers, Betsy Quero, Tatsushiko Matsushita, Friederike Tegge and others. One of the highlights of every stay in Wellington was the opportunity to spend each Shabbat at Temple Sinai. I want to especially thank Fred, Susan, Rick, Martin and many others for welcoming me into the community. I owe an immense debt of gratitude to Prof. Paul Nation. His writings and the clarity with which he translates research findings into practical teaching applications were what first attracted me to Victoria University and the School of Linguistics and Applied Language Studies. At conferences in several countries, I have heard him encourage language teachers to engage in research, and to share their findings with the language learning community by publishing. To say that the writing topics he suggested to me were life-changing would not be an exaggeration. Pursuing a doctoral degree is not usually good for one’s health, but the suggestion to write on Zipf’s law and word frequencies lead to discoveries of how dynamic ii systems work, including the human body. Prof. Nation continues to inspire my learning and teaching, and now, the teaching of many of my former students. I also wish to thank many of my colleagues here in Taipei who regularly encouraged me and Christ’s College for granting me leave on more than one occasion to focus more intently on research. This thesis is dedicated to my wife, Shufeng, my son, Nathan, and my parents, Michael and Barbara. My parents were my first teachers and are still among the most dedicated I have ever met. During the time it took to write this thesis, Nathan has become a proficient user of two languages. Both of which, he uses to research the inner workings of computers. When my laptop began to black out from running week-long R-scripts, he single- handedly designed and assembled a more powerful computer that I have used to complete most of the important studies in this thesis. During the last few years, Shufeng has put up with a husband who spent more time in the office than at home. When I did make it home, she always made sure there were plenty of fresh fruits and vegetables to keep me going. Shufeng also made sure Nathan paid attention to things other than computers. I look forward to being able to spend more time with them now that this thesis is finished, and I have a better understanding of how research should be conducted. iii TABLE OF CONTENTS ABSTRACT........................................................................................................................ i ACKNOWLEDGEMENTS............................................................................................... ii TABLE OF CONTENTS................................................................................................... iv LIST OF TABLES............................................................................................................. x LIST OF FIGURES........................................................................................................... xiii 1.0 INTRODUCTION........................................................................................................ 1 1.1 Problems with vocabulary lists................................................................................. 5 1.1.1 Words are listed in an inappropriate sequence...................................... 5 1.1.2 Word lists are not consistent.................................................................. 7 1.1.3 Word lists are not representative........................................................... 8 1.1.4 Which variety of English?..................................................................... 9 1.1.5 What is a word?..................................................................................... 10 1.1.6 How big is the core?.............................................................................. 10 1.1.7 Perceptions of word lists........................................................................ 12 1.1.8 Summary of problems with core vocabulary lists................................. 12 1.2 Purpose and Significance.......................................................................................... 13 1.3 Research questions and hypotheses................................................................................ 14 1.4 Definition of terms.................................................................................................... 14 1.4.1 Core vocabulary..................................................................................... 14 1.4.2 Corpus.................................................................................................... 14 1.4.3 General and comparable corpora................................................................ 15 1.4.4 Word types and tokens................................................................................. 15 1.4.5 Hapax legomenon and dislegomenon........................................................ 15 1.4.6 Lemmas and word families........................................................................

A Study of Issues and Techniques for Creating Core Vocabulary Lists for English As an International Language

Teaching and Learning Academic Vocabulary

Defining Vocabulary Words Grades

MASC: the Manually Annotated Sub-Corpus of American English

Aspects of Idiom*

SELECTING and CREATING a WORD LIST for ENGLISH LANGUAGE TEACHING by Deny A

Background and Context for CLASP

An Analysis of the Merriam-Webster's Advanced Learner's English Dictionary^ Second Edition

Informatics 1: Data & Analysis

The Expanding Horizons of Corpus Analysis

Prototypes and Discreteness in Terminology Examples of Domains

English Monolingual Learner's Dictionaries in a Cognitive Perspective

From the Bilingual to the Monolingual Dictionary Gabriele Stein