The Extended Lexicon: Language Processing As Lexical Description
Total Page:16
File Type:pdf, Size:1020Kb
The Extended Lexicon: Language Processing as Lexical Description Roger Evans Natural Language Technology Group Computing Engineering and Mathematics University of Brighton, UK [email protected] Abstract In this paper we introduce an approach to lexical description which is sufficiently powerful to support language processing tasks such as part-of-speech tagging or sentence recognition, traditionally consid- ered the province of external algorith- mic components. We show how this ap- proach can be implemented in the lexi- cal description language, DATR, and pro- Figure 1: A simple inheritance-based lexicon vide examples of modelling extended lex- ical phenomena. We argue that applying might be a matter of theoretical disposition, or a a modelling approach originally designed practical consideration of how the lexicon is pop- for lexicons to a wider range of language ulated or used. phenomena brings a new perspective to In the Extended Lexicon, we introduce a third the relationship between theory-based and kind of linguistic object, called word instances (or empirically-based approaches to language just instances), consisting of word forms as they processing. occur in strings (sequences of words, typically 1 The Extended Lexicon sentences). For example, a string such as the cats sat on the mat contains two distinct instances of A lexicon is essentially a structured description of the word the. the cats slept contains further (dis- a set of lexical entries. One of the first tasks when tinct) instances of the and cats. However the in- developing a lexicon is to decide what the lexical stances in a repetition of the cats sat on the mat entries are. This task has two dimensions: what are the same as those in the original (because in- kind of linguistic object does a lexical entry de- stances are defined relative to strings, that is, string scribe, and what does it say about it. So for exam- types not string tokens). ple, one might decide to produce a lexicon which So in an extended lexicon, the lexical entries are describes individual word instances, and provides word instances, and the lexicon itself is a struc- the orthographic form and part-of-speech tag for tured description of a set of word instances. In or- each form. It is the first of these dimensions that is der to explore this notion in more detail, it is help- most relevant to the idea of the Extended Lexicon. ful to introduce a more specific notion of a ‘struc- Conventionally, there are two main candidates for tured description’. We shall use an inheritance- the type of linguistic object described by a lexi- based lexicon, in which there are internal abstract con: word forms (such as sings, singing, sang1), ‘nodes’ representing information that is shared by corresponding to actual words in a text and lex- several lexical entries and inherited by them. Fig- emes (such as SING, WALK, MAN), describing ab- ure 1 shows the structure of a simple inheritance- stract words, from which word forms are somehow based lexicon with some abstract high-level struc- derived. Choosing between these two candidates ture (CATEGORY, VERB, NOUN), then a layer of 1Typographical conventions for object types: ABSTRACT, lexemes (WALK, TALK, HOUSE, BANK), and be- LEXEME, wordform, instance, code. low that a layer of word forms (walks, walking, 270 Proceedings of Recent Advances in Natural Language Processing, pages 270–276, Hissar, Bulgaria, 7-13 September 2013. talked, house, houses, banks, as well as many more). Thus the word form walks inherits in- formation from the lexeme WALK, which inherits from abstract node VERB and then abstract node CATEGORY. Figure 3: A simple Extended Lexcion, with in- stance nodes linked into a chain plicitly encodes word strings, as maximal chains of linked instances. Figure 2: A lexicon with instance nodes 2 The Extended Lexicon in DATR 2.1 DATR in brief Adding instances to this model is in principle very easy: one just creates a further layer of nodes DATR (Evans and Gazdar, 1996) is a lexical de- below the word forms. The word instances are scription language originally designed to model now the lexical entries, and the word form nodes the structure of lexicons using default inheritance. are abstractions, representing information shared The core descriptive unit in DATR is called a node, by all instances of the form. Figure 2 shows a which has a unique node name (capitalised) and first pass at adding an instance layer to a lexicon has associated with it a set of definitional path supporting the string the cats sat on the mat, by equations mapping paths (sequences of features) adding new nodes for each instance in the string. onto value definitions. However, what is missing from this figure is any DOG: representation of the string as a whole – noth- <cat> == noun <form> == dog. ing distinguishes the two instance nodes the from each other, or indeed from their parent word form Figure 4: DATR description – version 1 node the, and nothing identifies them as members of a specific string. One way this information Figure 4 is a simple example of DATR code. This could be added is simply by stipulating it: each fragment defines a node called DOG with two path instance node could have a feature whose value is equations, specifying that the (syntactic) category the string, and another whose value is the index in is noun, and the (morphological) form is dog. the string of the current instance. However, in the NOUN: Extended Lexicon, we adopt a structural solution, <cat> == noun by linking the instance nodes of a string together <form> == "<root>". DOG: into a chain, using inheritance links prev (‘pre- <> == NOUN:<> vious’) and next to inherit information from this <root> == dog. instance’s neighbours in the string. Diagrammati- Figure 5: DATR description – version 2 cally, we represent this as in figure 3. To summarise, in the Extended Lexicon model, Figure 5 provides a slightly more complex def- a lexicon is an inheritance-based structured de- inition. In this version, there is an abstract node, scription of a set of word instances. This notion NOUN, capturing information shared between all simultaneously captures and combines two impor- nouns and a new definition for <form> which is tant modelling properties: first, that instances of defined to be the same as the path <root>. DOG the same word share properties via an abstract now specifies a value for <root>, and inherits word form node, and second that the lexicon im- everything else from NOUN. 271 Word1: Inheritance in DATR operates as follows: to de- <> == The:<> termine the value associated with a path at a partic- <next> == "Word2:<>". ular node, use the definition from the equation for Word2: <> == Dogs:<> the longest path that matches a leading (leftmost) <prev> == "Word1:<>" subpath of the desired path (if none matches, the <next> == "Word3:<>". value is undefined). The definition might give you Word3: <> == Slept:<> a value, or a redirection to a different node and/or <prev> == "Word2:<>". path, or a combination of these. If the definition Figure 7: Instance node for the dogs slept contains path values, extend those paths with the portion of the desired path that did not match the left-hand-side and seek the value of the resulting <table> has a default value which is just the expression. root, and a plural value which appends an s to the So in this example, the path <root> at DOG morphological root. Two word form nodes have matches a definition equation path exactly, and so also been added, Dog whose form will be dog, has value dog. The path <cat> is not defined at and Dogs whose form will be dog s. DOG and the longest defined subpath is <>, so this 2.2 Modelling the Extended Lexicon definition is used. It specifies a value NOUN:<>, but the path is extended with the unmatched part Figure 6 provides an example of DATR code to of the original path, so the definition becomes represent lexeme and word form nodes. Extend- NOUN:<cat>. This has the value noun, so ing this to represent instance nodes as well is quite this is the value for DOG:<cat> as well. Fi- straightforward. The instance nodes themselves nally, the path <form> at DOG similarly matches inherit directly from the corresponding word form the <> path and is rewritten to NOUN:<form>. nodes. The prev and next links map between This matches the definition in NOUN which spec- the instance nodes, as shown in figure 7, for the ifies "<root>". The quotes here specify this is word string the dogs slept. evaluated as DOG:<root> (without the quotes it As a first simple example of the Extended Lex- would be interpreted locally as NOUN:<root>), icon approach, figure 8 provides a definition for and because the entire path matched, there is noth- the lexeme A which varies the actual form ac- ing further to add to the path here, so the value is cording to whether the next word starts with a DOG:<root>, that is, dog. vowel or not. This definition presupposes a fea- ture <vstart> which returns true for words that NOUN: 3 <cat> == noun start with a vowel, false otherwise . A evaluates <num> == sing vstart not on itself, but on the word instance <form> == "<table "<num>" >" that follows it (signified by <next vstart>) to <table> == "<root>" <table plur> == "<root>" s. determine whether its own form is a or an. DOG: A: <> == NOUN:<> <> == DET <root> == dog. <form> == <table "<next vstart>"> Dog: <table> == a <> == DOG:<>. <table true> == an. Dogs: <> == DOG:<> Figure 8: Word form definition for A <num> == plur. Figure 6: DATR description – version 3 This example illustrates some important features of the approach.