Strategies for Learning Lexemes Efficiently: a Graph-Based Approach
Total Page:16
File Type:pdf, Size:1020Kb
COGNITIVE 2018 : The Tenth International Conference on Advanced Cognitive Technologies and Applications Strategies for Learning Lexemes Efficiently: A Graph-Based Approach Jean-Marie Poulin, Alexandre Blondin Massé and Alexsandro Fonseca Département d’informatique Université du Québec à Montréal Montréal, QC, Canada H3C 3P8 Email: [email protected] Abstract—Given a particular lexicon, what would be the best so as to minimize the overall learning effort: an “efficient strategy to learn all of its lexemes? By using elementary graph learning strategy”. theory, we propose a simple formal model that answers this question. We also study several learning strategies by comparing There is a large amount of related work aiming to identify their efficiency on eight digital English dictionaries. It turns small subsets of words from which one can learn all remaining out that a simple strategy based purely on the degree of the words of a given language. It has been for a long time of great vertices associated with the lexemes could improve significantly interest from psycholinguistic, pedagogical and computational the learning process with respect to other psycholinguistical points of view. For instance, in 1936, Ogden proposed a strategies. reduced list of 850 English words which would suffice to Keywords–Lexicons; Learning; Strategies; Graph theory. express virtually any complex words or thought [8]. In 1953, West [9] published the “General Service List” (GSL). Based on a corpus of 5 million words and containing about 2000 I. INTRODUCTION words, it is oriented toward the needs of students learning When learning a new language, the effort to develop a english as a second language. Despite having been criticized sufficient vocabulary basis plays an important role. Notwith- numerous times for its shortcomings and its incompleteness, standing the fact that various cognitive skills are required, it was considered until very recently as irreplaceable [10]. In being able to associate a word with its meaning, its definition, the last few years, two principal contenders have been vying is an essential part of the learning process. According to with one another to replace West’s GSL. At about the same Schmitt [1], the "form-meaning link is the first and most time in 2013, Brezina and Gablasova [11], and Browne [12] essential lexical aspect which must be acquired". But as Gu both presented what they call their New General Service List and Johnson [2] mention, vocabulary acquisition is an intri- (NGSL), whose purpose is to restrict the attention to the most cate task. Joyce identifies and compares two such strategies, basic English words that should be understood first by non aimed at vocabulary improvement [3]: L1 traduction, from the native speakers. A question remains open though: What is the speaker’s native language, and L2 definition, in the language optimal way to establish those word lists in an automated way? being learned. Traduction is in itself a very different problem, All the word lists discussed above were built using large which we do not address here. As for the “L2 definition” corpora. In a recent work, Nation [13] even describes a approach, it can be seen as the action of consulting a dictionary detailed corpus-based appoach to word list building. Our main to acquaint oneself with the definition of a word in the new contribution in this paper is to present a different, lexicon- language, thus establishing this crucial “form-meaning link”. based technique. To our knowledge, there has never been But what if this definition contains unknown words? Shall a fully computational, graph-based approach for identifying the reader examine in turn the definition of these unknown efficient learning strategies of a complete lexicon. Although in words in the lexicon? And then the unknown words in the real life the process of learning words is clearly more intricate definition of the unknown words, and so on? As discussed than the method we present here, our results suggest that an by Blondin Massé et al. in [4], this can lead to an infinite hybrid strategy, based both on cognitive observations and on regression, the symbol grounding problem [5]. At some point formal tools such as graphs, could enhance significantly the in time, it is necessary to learn some words in ways other way people improve their L2 vocabulary. than dictionary perusal: either by sensorimotor experience, The manuscript is divided as follows. In Section II, we or through some other external contribution. In particular, it introduce definitions and notation about lexicons, graphs and seems interesting to design a learning strategy to alleviate the grounding sets. In Section III, we discuss different lexicon- burden of these potentially expensive forms of learning. based learning strategies. Section IV describes the data sets Dansereau characterizes a learning strategy as a sequence used in our analyses. Section V is devoted to the comparison of "processes or steps that can facilitate the acquisition, storage of the efficiency of those different strategies. Finally, we briefly and/or utilization of information" [6]. And in the more specific conclude in Section VI. context of second language pedagogy, Bialystok defines it as "activities in which the learner may engage for the purpose of improving target language competence" [7]. One compelling II. LEXICONS,GRAPHS AND GROUNDING SETS alternative to the expensive direct learning approach is to We now propose formal definitions and notation about identify a sequence of words, as small as possible, and ordered lexicons, when viewed as directed graphs. We believe that Copyright (c) IARIA, 2018. ISBN: 978-1-61208-609-5 18 COGNITIVE 2018 : The Tenth International Conference on Advanced Cognitive Technologies and Applications this rather formal representation simplifies the discussion when to the classical book by Bondy and Murty [15], but for sake comparing the efficiency of several lexicon learning strategies. of self-consistency, we briefly recall some definitions and Roughly speaking, a lexicon can be defined as a set of notation. The formal representation of lexicons is inspired from lexical units, called lexemes enriched with definitions and the definition of dictionaries found in [4]. arbitrary additional information [14]. For our purposes, we A directed graph is an ordered pair G = (V; A), where consider the following simplified representation of a lexicon. V is a finite set whose elements are called vertices and A ⊆ V × V is a finite set whose elements are called arcs. Directed Definition 1. A lexicon is a quadruple X = (A; P; L; D), graphs are useful for representing the relation “lexeme ` defines where lexeme `0”: Given a disambiguated lexicon X = (A; P; L; D), (i) A is a finite alphabet, whose elements are called letters. we define the graph G(X) of X as the directed graph whose (ii) P is a nonempty finite set whose elements are syntactic set of vertices is V = L and whose set of arcs A satisfies 0 0 categories, called parts-of-speech (POS). In particular, (`; ` ) 2 A if and only if ` 2 D(` ). In other words, the 0 it contains a special element denoted by STOP, which lexemes are the vertices, and there is an arrow from ` to ` if 0 identifies lexemes whose semantic value is ignored. and only if ` appears in the definition of ` . Figure 1 depicts a subgraph of the graph G(X ) (see Example 1). (iii) L is a finite set of triples ` = (w; i; p) called lexemes, 1 i ∗ denoted by ` = wp, where w 2 A is a word form or simply word, i ≥ 1 is an integer and p 2 P. If p = STOP, then ` is called a stop lexeme. We say that (w; i; p) is Rest of the Lexicon the i-th sense of the pos-tagged word (w; p). To make the numbering consistent, we also assume that if (w; i; p) 2 L and i > 1, then (w; i − 1; p) 2 L as well. Whenever there exists (w; i; p) 2 L with i > 1, we say that (w; p) and L polysemic are . 1 2 foodN fruitN (iv) D is a map associating, with each lexeme ` 2 L, a finite ∗ sequence D(`) = (d1; d2; : : : ; dk), where di 2 A for 1 1 existenceN actionN 1 1 i = 1; 2; : : : ; k, called the definition of `. substanceN resultN eat1 flesh2 ∗ ∗ V sustain1 work1 N If we replace the condition di 2 A by di 2 A × P in V N (iv), then D(`) is called a POS-tagged definition of ` and we 1 1 consumeV animalN say that X is a POS-tagged lexicon. It is also convenient to 1 2 consider only the lemmatized, canonical form of words. If we mouthN partN 1 1 replace in (iv) the condition by di 2 L, then D(`) is called a edibleA fleshN safe1 part1 disambiguated definition and we say that X is a disambiguated A N 1 1 lexicon. Finally, if D(`) is non empty whenever ` is a non-stop useV developV lexeme, then we say that X is complete. 1 1 livingA smallA 1 1 groupN 1 newA 1 thingN 1 Example 1. Let X1 = (A; P; L; D) be the lexicon such that vegetableN fruitN A = fa; b; : : : ; zg; plant1 seed1 P = fN; V; A; R; Sg; N N where N, V, A, R, S stand for noun, verb, adjective, adverb, STOP respectively, and L and D are both defined by Table I. Then X1 is polysemic, lemmatized and disambiguated. More- Figure 1. Graph of a polysemic, lemmatized, disambiguated, complete lexicon over, assuming that all words used in at least one definition are defined as well, X1 is complete. Let G = (V; A) be a directed graph. Given u; v 2 V , we say that u is a predecessor of v if (u; v) 2 A.