Conversational Lexical Standards Kurt Fuqua, Cambridge Mobile [email protected]
Total Page:16
File Type:pdf, Size:1020Kb
Conversational Lexical Standards Kurt Fuqua, Cambridge Mobile [email protected] Abstract Intelligent morphology processing is essential to creating conversational applications and engines. This intelligence allows developers to create more powerful apps with less need to understand linguistics. A single standardized lexicon can be shared among engines and apps and the process becomes more efficient. This paper discusses standardization of parts of speech, grammatical features, phoneme set, morphologic processing, and a formalism for defining lexical grammars. Are the words ‘apple’ and ‘apples’ related? identical lexicon should be used for both Despite the obvious answer, many systems synthesis and analysis. If there are two treat these word forms as if they are unrelated different lexicons (one for synthesis and semantically and morphologically. This paper another for analysis) then synchronization addresses the need for lexical standards in problems arise. Likewise the morphologic order to support conversational applications. engine should be capable of processing morphology in both the synthesis and analysis Conversation requires tracking linguistic directions. information across multiple applications. A virtual agent is constantly tracking pronouns, From the root form, the engine can derive every and anaphoric references, expanding ellipsis regularly-derived form. This assures consistent and sending the expanded, resolved statements coverage. In current systems, coverage is to the appropriate application. This requires a inconsistent. It may be possible to use a verb in much higher level of abstraction than required the past tense but not the future tense, or a in stand-alone, non-conversational applications. noun in the singular but not the plural. The applications must share a common syntactic and lexical grammar, as well as a There are 3 required components: common lexicon. The information shared The lexicon information, between the applications must contain high- A formalism for defining the grammar, level semantic information. The morphologic engine for processing. The Need for Morphology What is in a Lexicon? If an application is to be conversational, it must There are two main distinctions between a support morphology. These are the changes word list and a lexicon. A lexicon contains that words undergo during regular inflection. grammatical information – not just The conversational system should be able to pronunciations. Second, lexicons are process these regular changes in an intelligent, lemmatized – that is they contain the root form symbolic fashion. An application developer may of the words. A lexicon is the repository for the need to synthesize a statement such as: grammatical information on each lemma (root “You have three messages.” form). The pluralization of ‘message’ should be done Each lexicon entry should contain: symbolically with automatic agreement The graphemic representation: ‘message’ between the number and noun. The lexicon The phonemic representation: /mɛsəd͡ʒ/ should only need to store the root form of each The part(s) of speech: noun word; a morphologic engine can synthesize the The grammatical feature-values proper plural form for the given language. The Language-specific token Animate animate Semantic token Natural Gender ngender Grammatical Gender ggender Each of these elements must be standardized Mood mood and be consistent between the lexical grammar, Valency valency morphologic engine, and application. Pronoun Type prontype Moreover, this consistency should extend Conjunction type conjtype across languages to the extent possible. Each feature has possible values. Some are Parts of Speech normally Boolean with a value of + or -. Others The most basic grammatical information is the have specifically enumerated values. These part of speech (POS). Very little processing can features can be used to tag entries in the be done without knowing a word’s part of lexicon or to derive word forms. The word speech. Currently, there is no W3C standard for qualifier is placed in brackets after the word. POS. This is the most basic requirement for creating a lexicon standard. ‘message’ <plur=+> While word categories are not without This has the meaning ‘the plural form of the controversy, it is possible to construct a set of word message’. The developer does not need basic POS tags which are indeed universal. This to concern himself with how the morphology of proposal follows the SLAPI POS standard. This English works, or whether that word is a regular standard has been used for dozens of forming plural – the morphology engine will languages. inflect it properly according to the defined lexical grammar. Open Class POS (5) verb, adverb, adjective, common noun, While the aggregate potential values for a given proper noun feature are fixed, the specific possible values may vary from language to language. Closed Class POS (10) pronoun, number, conjunction, English preposition, determiner, quantifier, n plur = (+, -) interjection, portmaneux, clitic, punctuation Arabic n plur = (+,-, dual) Binary representation of the POS is extremely important to efficient implementations. With In a report titled “Towards a Standard for the 15 POS, the POS of each polysemic entry can be Creation of Lexica”, a group of researchers set represented in bitwise form with just 16 bits. out a harmonized feature set for twelve (12) The importance of the bitwise representation European languages. The current SLAPI feature cannot be overemphasized! Therefore it is set encompasses all of these features plus necessary to limit the POS to 15 categories. several required for other languages. About 175 grammatical features with their possible Features values are defined. This is a good start but Each part of speech is subcategorized with there is room for improvement. grammatical features. Examples include: Plurality plur Pronunciations Count count Pronunciations are only meaningful if they can Case case be used across applications and tools in a consistent way. There is currently gross principals are described in the document. This inconsistency between vendors and even within is a good starting point for a lexical standard. a single vendor. The first requirement is to agree on what is represented. Does the In order to assure adequate minimal processing, pronunciation symbol represent a phoneme or every engine should be required to support a an allophone? The W3C standards are defined minimal set of phonemes. SLAPI and deliberately ambiguous. Some vendors CPS define a required set of 42 consonantal represent phonemes, others allophones. One phonemes. This includes every phoneme widely used mobile phone product from Google required for English, German, Danish, Dutch, actually contains an admixture of both Norwegian, Farsi, Japanese, French, Italian, phonemes and allophones. Such products are Spanish, Portuguese, Romanian, Catalan, unworkable even for a trained computational Finnish, Lithuanian, Slovenian, and Bulgarian. linguist. Core Lexicon Certification For standardized lexicons the symbols must be A common lexicon must be shared among the phonemes – not allophones! Phonologic rules conversational applications. Irrespective of the transform phonemes into allophones. There topic, there is a certain vocabulary content that are regular phonologic rules which apply within ought to be included for any conversational a word to morphologic derivations. Other application. This is referred to as the ‘core’. Sandhi rules apply between adjacent words. The core should contain all closed parts of Without phonemic symbols, phonologic rules speech (i.e. pronouns, prepositions, cannot be applied. conjunctions, determiners, etc), plus the most common words from the open parts of speech If the lexicons always contain phonemic entries, (verbs, nouns, adjectives, adverbs, proper a common engine can apply phonologic rules in nouns) and the most common irregular forms. a systematic, efficient and accurate manner. To this core, additional vocabulary can be added. Phoneme Sets Every language has a well-defined set of Generally, it is the lexicon core which involves phonemes from which all words of the language the greatest development effort. The core often are constructed. This phoneme set is the involves a great deal of polysemy and spoken equivalent of the written alphabet. exceptions, and must be manually created Determining a language’s phoneme set is fairly whereas the remainder of the lexicon can be objective for a linguist. The standard should created through automated or semi-automated include a published set of defined phonemes for processes. each language, and the symbols to represent those phonemes. The standard should contain metrics for measuring the completeness of the core. Under Even when scholars agree on the set of SLAPI, there are two levels defined: Core1 and phonemes, and use IPA to represent those Core2. To qualify as Core1 complete, the phonemes, they may use different symbols. lexicon must meet numeric completeness, Rarely do 2 dictionaries use the same symbols. feature completeness and semantic completeness. Here is a partial list of the The Cambridge Phoneme Set (CPS) defines the criteria for a Core1 complete lexicon. It must phonemes for about 52 languages. In addition contain at least: all non-archaic closed-part-of- to defining the phonemes used in each speech entries, 500 lexical lemma entries, 100 language, it defines a consistent set of symbols distinct verbs, 100 common nouns, 50 adj, 50 for representing