<<

Analogical Modeling: An Update

David Eddington and Deryle Lonsdale Department of & English Language Brigham Young University Provo, UT, USA 84602 {eddington,lonz}@byu.edu

Abstract accessed. A search is conducted for the stored ex- emplars that are most similar to the one whose be- Analogical modeling is a supervised havior is being predicted. The behavior of highly exemplar-based approach that has been similar stored entities generally predicts the behav- widely applied to predict linguistic behavior. ior of the one in question, although less similar ones The paradigm has been well documented in have a small chance of applying as well. the linguistics and cognition literature, but AM has been implemented in a succession of is less well known to the machine learning computer implementations that allow users to spec- community. This paper sets out some of the ify an exemplar base and then to test unseen in- basics of the approach, including a simpli- stances of language behavior against this accumu- fied example of the fundamental algorithm’s lated store to predict (and quantify) the relevant and operation. It then surveys some of the recent possible outcome(s). analogical modeling language applications, and sketches how the computational system A comprehensive examination of the system is has been enhanced lately to offer users beyond the scope of this paper1. Instead, an at- increased flexibility and processing power. tempt will be made to situate AM within the space of Some comparisons and contrasts are drawn exemplar-based modeling systems and, to a certain between analogical modeling and other extent, within the larger context of machine learning language modeling and machine learning systems and cognitive modeling systems in general. approaches. The paper concludes with a This paper begins by giving an informal overview discussion of ongoing issues that still con- of the algorithm used to perform AM. To accomplish front developers and users of the analogical this it traces in schematic form an early but typi- modeling framework. cal application of the AM approach: Spanish gen- der guessing. The subsequent section discusses AM 1 Introduction with respect to other types of modeling systems, and Analogical modeling (AM) is an exemplar-based gives an overview of the types of linguistic phenom- modeling approach designed to predict linguistic be- ena modeled by AM. The paper then discusses sev- havior on the basis of stored memory tokens (Sk- eral items of recent and ongoing work with the AM ousen, 1989; Skousen, 1992; Skousen, 1995; Sk- program. The conclusion mentions possible areas ousen, 1998). It is founded on the premise that for future improvement and investigation. previous linguistic experience is stored in the men- tal lexicon. When the need arises to determine 1For further information including a bibliography some linguistic behavior (pronunciation, morpho- and software downloads see the project webpage at: logical relationship, word, etc.), the lexicon itself is http://humanities.byu.edu/am Subcontexts Members of the Subcontext Pointers # of Disagreements pan none none 0 ¯pan plan M plan M > plan M 0 p¯an none none 0 pa¯n paz F, par M paz F > paz F 2 par M > par M *paz F > par M *par M > paz F ¯pa¯n cal F cal F > cal F 0 p¯an¯ tren M, ron M tren M > tren M 0 ron M > ron M tren M > ron M ron M > tren M p¯a¯n none none 0 p¯a¯n¯ rey M rey M > rey M 0

Table 1: Subcontexts and their disagreements.

2 The AM algorithm are marked with asterisks in Table 1. A disagree- ment occurs when words that are equally similar to Perhaps the best way to understand the AM algo- the given context, exhibit different behaviors, in this rithm is with a concrete illustration. Predictions are case, different genders. The number of disagree- always made in terms of specific exemplars, there- ments is determined by pairing all members of a sub- fore, for the purposes of the example, the following context with every other member, including itself, seven monosyllabic Spanish nouns and their corre- 2 by means of unidirectional pointers, and counting sponding gender will be considered : the number of times the members of the pair have paz F different behaviors. In this example, the only sub- plan M context containing any disagreement is pa¯n. tren M cal F Subcontexts are then arranged into more compre- ron M hensive groups called supracontexts as shown in Ta- par M ble 2. In Table 2, a hyphen indicates a wildcard. rey M In the subcontextual analysis a tally of the all The task will be to predict the gender of pan of the subcontextual disagreements is made. The ’bread’ on the basis of these seven items as the supracontextual analysis consists of analyzing all of set of exemplars. The three variables used are the the words that appear in a given supracontext, and phoneme or phoneme cluster of the onset, nucleus, again tallying disagreements (Table 3). and rhyme. First, all possible exemplar items are as- In the supracontexutal analysis, words that have signed to a series of subcontexts which are defined more than one variable in common with pan appear in terms of the given context pan. This yields eight in more that one supracontext. This is AM’s way of subcontexts (Table 1). A bar over a variable indi- allowing the gender of words which are more similar cates that any value of the variable except the barred to pan to influence the gender assignment of pan to value is permitted. a greater extent. By assigning members to subcontexts it is pos- The purpose of AM’s algorithm is to determine sible to determine disagreements. Disagreements which members of the exemplar base are most likely 2This example is a simplification; see (Eddington, 2002) for to affect the gender assignment of pan, and also to a complete AM treatment of this phenomenon. calculate the extent of analogical influence exerted. Supracontext Subcontexts in Supracontext # Subcontextual Disagreements p a n pan 0 p a – pan, *pa¯n 2 p – n pan, p¯an 0 – a n pan, ¯pan 0 p – – pan, p¯an, *pa¯n, p¯a¯n 2 – a – pan, ¯pan, *pa¯n, ¯pa¯n 2 – – n pan, ¯pan, p¯an, ¯p¯an 0 – – – pan, ¯pan, p¯an, *pa¯n, ¯pa¯n, ¯p¯an, p¯a¯n, ¯p¯a¯n 2

Table 2: Subcontextual analysis.

This is accomplished by calculating heterogeneity. the influence of the analogical set on the behavior Heterogeneity is determined by comparing the num- of the given context. One is to assign the most fre- ber of disagreements in the supracontextual and sub- quently occurring behavior to the given context (se- contextual analyses. If there are more disagreements lection by plurality). Of the 18 pointers in the set, in the supracontextual analysis, the supracontext is 14 point to masculine. Therefore, selection by plu- heterogenous, and its members are eliminated from rality would assign masculine gender to pan. The consideration as possible analogs. If the number of other method (random selection) involves randomly disagreements does not increase, the supracontext is selecting a pointer, and assigning the behavior of the homogenous. word indicated by the pointer to the given context. Words belonging to homogenous supracontexts In this case, the probability of masculine gender as- comprise the analogical set (Table 4). In the example signment would be 77.78% (14/18). under consideration, disagreements increase in the supracontexts – – –, and – a –. Therefore, their mem- 3 Situating AM bers are eliminated from consideration. Cal, and rey The AM theoretical construct has been implemented appear exclusively in these heterogenous supracon- computationally through several generations of ap- texts. As a result, they do not form part of the ana- plication programs. Pascal code for the first version logical set. The word plan is also a member of both was listed as an appendix in (Skousen, 1989). A – – –, and – a –, however, it is also a member of the subsequent version was implemented in C. In recent homogenous supracontexts – a n, and – – n, so it will years the system was reimplemented as a Perl mod- still be available to influence pan. ule; extensive details on how to use the current ver- It should not be surprising that rey would be elim- sion are available in (Skousen et al., 2002). inated through heterogeneity; it has no phonemes in The system assumes a priori labeled input in- common with pan. However, consider the words ron stances, and thus only works in supervised learn- and cal. Both share only one feature with pan, yet ing mode. Each exemplar must be represented as heterogeneity eliminates only cal, and not ron. This a fixed-length vector comprised of features that may is due to the fact that ron appears in the supracontext or may not eventually be relevant to the phenomenon – – n, and all of the members of that supracontext being modeled. The features themselves may be are masculine, therefore, there is no disagreement. of fixed or variable length, as long as they are ap- Cal F, on the other hand, competes with masculine propriately delimited. An outcome is specified for words in all of the supracontexts in which it appears. each exemplar instance vector. Vector features are The analogical set contains all of the exemplar purely symbolic and nominal, not admitting contin- items that can possibly influence the gender assign- uous values. When the use of continuous-valued ment of pan. There are two methods for calculating features is needed (e.g. integers or real numbers), Supracontexts Words in Pointers # of Subcontextual Supracontext Disagreements p a n none none 0 p a – par M, paz F paz F > paz F 2 par M > par M *paz F > par M *par M > paz F p – n none none 0 – a n plan M plan M > plan M 0 p – – par M, paz F paz F > paz F 2 par M > par M *paz F > par M *par M > paz F – a – plan M, cal F, par M, paz F plan M > plan M 6 *plan M > cal F plan M > par M *plan M > paz F cal F > cal F *cal F > plan M *cal F > par M cal F > paz F *paz F > plan M *paz F > par M (not all shown; all the rest involve agreement) – – n plan M, tren M, ron M (not shown) 0 – – – plan M, tren M, ron M, cal F, (not shown) 12 paz F, par M, rey M

Table 3: Supracontextual analysis. the features must be symbolically quantized within puted, and outcomes reported. AM, unlike some pre-specified ranges in order to be meaningful to the other machine learning approaches, can report all system. More details on how to choose, quantize, possible outcomes, and associates each with a prob- and encode exemplar vectors is given in (Skousen et ability rating or confidence rating reported as a per- al., 2002). centage.

As the system loads up the feature vectors, it 3.1 Sample implementations builds the contextual lattice as sketched in the pre- AM has been applied to a number of different lin- vious section. Once the lattice has been constructed, guistic phenomena in several different languages. the system can process incoming test items to deter- Although a comprehensive enumeration is not pos- mine their outcome(s). Each test item consists of a sible here, a representative sampling of applications feature vector containing the same number of fea- gives an appreciation for the breadth and depth of tures as the exemplar vectors do. Each test item also coverage possible. has the appropriate outcome specified, if known, or The earliest documented AM results with lan- else some placeholder symbol. Each test item is guage data—and indeed the preponderance of work matched against the lattice, its analogical set com- since—focus primarily on , , Homogenous Words in Pointers # of Pointers # of Pointers Supracontext Supracontext to Masculine to Feminine p a – par M, paz F par M > par M 2 2 par M > paz F paz F > par M paz F > paz F – a n plan M plan M > plan M 1 0 p – – par M, paz F par M > par M 2 2 par M > paz F paz F > par M paz F > paz F – – n plan M, tren M, ron M (not shown) 9 0

Table 4: Non-empty homogenous supracontexts. and lexical selection. Early applications (Skousen, ment of English negative prefixes. 1989) included predicting Finnish past tense forms, Increasingly, modelers are using their own ex- describing allomorphic variation (e.g. the a/an dis- perimentally obtained results to compare against tinction in English), and predicting lexical selection more standard resources. The Arabic lexical selec- based on sociolinguistic variables (e.g. Arabic forms tion example involved only features encoding novel for “friend”). Subsequent efforts (Skousen et al., researcher-collected situational observations. A re- 2002) addressed such phenomena as German plural- cent paper on the comparative form of English ad- ization, Dutch compound linkers, Spanish gender on jectives (Elzinga, 2006) matches human judgments nouns, stress in Dutch, and Turkish consonantal al- against occurrences from the internet collected via ternation. the Google search engine. Most of these implementations involve feature The choice of which features to include in the ex- vectors that specify relatively low-level linguistic emplar and test instances is an important issue for features such as sound segments, phonetic or ortho- the AM user. Since features have to be enumerated graphic environments, syllable structure, and word and encoded a priori, at least some of the features boundaries. Often the information for construct- chosen should reflect salient properties of the phe- ing these features comes from widely-used anno- nomenon in question. This does not presuppose a tated language resources such as lexicons or cor- prescient omniscience; it is not the case that all and pora. For example, several English, German, and only the relevant features must appear in the vec- Dutch applications have used features partially de- tors. Any features that end up not contributing to rived the CELEX lexical database (); others have the outcome simply constitute irrelevant overhead. used WordNet (for English) (), TELL (for Turkish) Of course, in an exponential system such overhead (), and LEXESP (for Spanish) (). should be avoided wherever possible. With the advent of large-scale corpora, both an- notated and unannotated, increasing amounts of in- 3.2 Comparison to other approaches formation are available to language modelers. Re- The algorithm sketched in Section 2 is what sets cent AM efforts have benefited from speech and text the AM paradigm apart from other machine learn- corpora. The popular TIMIT speech database pro- ing and language modeling approaches. The sys- vided exemplars for recent work on flapping in En- tem does not employ sub-symbolic representations glish (Eddington, 2007). In probably the first AM or continuous-valued features and thus differs in application to diachronic language description, a re- fundamental ways from connectionist systems. The cent paper (Chapman and Skousen, 2005) used the computation of analogical (dis)similarity resembles Helsinki corpus for modeling the historical develop- in many ways nearest-neighbor approaches, though in some cases a “gang” of similar items, none of learning realm, one area deserves considerable more which is the nearest neighbor, may conspire to over- effort. The natural language processing (NLP) field come the nearest neighbor and impose their own out- has profited greatly from machine learning, in part come. These gang effects are difficult to identify, due to the increased availability of large-scale lan- though some have been discussed in the literature, guage resources. Though early work in AM illus- and represent a unique aspect of AM results. When trated the promise of the approach to typical NLP gang effects are not present, AM processing gen- tasks (Jones, 1996), the paradigm has been under- erally obtains results comparable to memory-based represented in recent NLP shared tasks and compet- learning (e.g. in TiMBL). The latter, though, allows itive evaluations addressing such problems as word- for continuous-valued features. sense disambiguation, part-of-speech tagging, shal- Another aspect where AM and some approaches low parsing, semantic role labeling, and phoneme differ is that AM considers each feature in any given classification. vector of equal weight; only across the exemplar base do features emerge as more or less relevant. In 4 Recent progress in AM this respect AM does not perform feature weighting in a separate processing stage across the exemplar The AM system has undergone substantial develop- base before testing new items. Similarly, no depen- ment since the last documented release of the system dencies can be specified explicitly across features. (Skousen et al., 2002), and this has permitted ap- Extensive comparisons have been made between plications in areas that would have been previously AM and other machine learning approaches includ- impossible. In this section we survey some of the ing memory-based learning, version spaces, neural recent AM developments and consequent research networks, and Optimality Theory; see (Skousen et they have enabled. al., 2002) for complete details. 4.1 Algorithmic refinement The concept of an analogical set with quantifiable support is also unique to AM, and its relevance is be- As explained above, the AM system’s basic data ing explored. For example, recent work (Chandler, structure is a fully connected lattice representing 2005) has shown how AM results can map onto re- (dis)agreements among data items. The original al- sponse times in certain types of psycholinguistic ex- gorithm for handling the lattice involved a straight- periments. forward but exhaustive traversal and was hence very In fact, another notable feature of AM is its abil- costly. The most current version of the system in- ity to model human language errors and variants that volves a completely new way of extracting informa- seem to result from analogy. Skousen, in early AM tion from the lattice. work, discusses how “leakage” (i.e. erroneous AM The crucial insight is to focus on heterogeneity guesses) often tends to follow observable usage er- instead of homogeneity. For each supracontext in rors (Skousen, 1989). the lattice, heterogeneity can be locally determined (??? Dave: a couple of sentences about your by looking for plurality of outcomes and plurality Spanish data...) of subcontexts. When heterogeneity is detected, A comparison of corpus data for Russian verbs all more general supracontexts can be automatically of motion—and subsequent modeling in AM discarded from that point on. combining lexical, morphological, and semantic Implementing this insight has resulted in a sub- variables—showed a tendency for the system’s er- stantial speed-up in processing time, and the abil- rors to coincide with those of language learners ity to handle a greater number of features, exem- (Dodge and Lonsdale, 2006). plars, and test items. Figure 1 illustrates the run Certainly further comparison of the behavior ob- time for different versions of the AM program on served in AM modeling tasks versus human per- a typical task (English part-of-speech tagging) in- formance constitues a promising future direction to volving 50,000 exemplars, 1000 test items, and 58 pursue. outcomes. It shows that the previous version of the When considering the place of AM in the machine system grew exponentially as the number of vari- ables grew (though the maximum number of vari- (DWL: I still need to add a little bit here...) ables possible was fixed at 29); the algorithm also performed particularly poorly for a vector size of 3 4.3 Lattice visualization features. This contrasts with the current version of For many users of the AM system, insight into the the system, which runs largely linearly until the cur- operation of the AM algorithm has proven difficult, rent maximum vector size of 59 features. with the system operating as a black box. A new vi- The rewrite of the algorithm also incorporated sualization tool allows users to load up their exem- more decomposition in the handling of the lattice, plars, enter test items, and step through the lattice in theory laying the groundwork for a truly paral- incrementally to view the effects of matches, homo- lel implementation. To date, though, no concurrent geneity/heterogeneity, and gangs. parallel implementation has been developed. Even with this improvement, exponentiality still poses a problem. Two independent efforts have been pursued recently to address the processing bottle- neck issue. One involves using Monte Carlo sam- pling to estimate the results that traditional AM pro- cessing provides (Johnsen and Johansson, 2005). The other method is to compute the dissimilarities via a quantum algorithm (Skousen, 2005). To dis- cuss either would be beyond the scope of this paper. Figure 2: A screenshot of the AM lattice visualizer. 4.2 Modularization Figure 2 shows ... The newest version of the AM implementation has been repackaged as a Perl module. This was done 4.4 Architectural improvements primarily to improve flexibility in using the system. Loading a large number of exemplar items is time- Its deployment within Perl supports better cross- consuming. This is keenly felt in sequential imple- architecture functionality. The provision of a num- mentations, where the outcome of one test might de- ber of processing hooks allows for various user- pend on the results of a previous one (cf. the Farsi programmable add-on procedures for customizing example described above). the input, matching, and output stages of processing. To alleviate this problem, a new implementation One examples of the use of customized out- provides a client/server infrastructure. The exemplar put involves the integration of AM with a cogni- data can be loaded on a server machine once, and th tive modeling system for natural language called server stands by for test item queries (over a socket) NL-Soar. In this work, which involves pars- from client machines. ing, the cognitive agent must occasionally resolve prepositional-phrase attachment ambiguities (Lons- 4.5 User tools dale and Manookin, 2004; Manookin and Lonsdale, Recent use of the system by users has resulted in two 2004); it does so by invoking AM to resolve the am- other tools worthy of note. biguity using a PP-attachment exemplar base of pre- Development of exemplar and test item feature viously mined by others ??. This integration would vectors can sometime result in ill-formed vectors not have been possible with prior versions of the AM which are difficult to locate. A new Perl script al- system. lows users to check their feature files and receive a Sequential output example: use of these options report of which vectors do not comply to formatting has been helpful in profiling system performance specifications. or tailoring system output. Lonsdale’s work on For users unable to configure the basic system pa- Farsi name vocalization and Romanization (Lons- rameters within the Perl code, use of the system can dale, 2006) is an example of work only possible with be awkward. A basic Perl/TK graphical user inter- the changes to the most recent version. face has been developed to provide users top-level 50000 current-64 current 45000 previous

40000

35000

30000

25000

20000

15000

10000

5000

0 0 10 20 30 40 50 60

Figure 1: Performance statistics of previous and current versions of AM control to the overall system, allowing them to se- David Eddington. 2002. Spanish gender assignment in lect pertinent parameters and input files. In this way an analogical framework. Journal of Quantitative Lin- the users will not need to modify any Perl code. guistics, 9:49–75. David Eddington. 2007. Flapping and other variants of 5 Final observations /t/ in american english: Allophonic distribution with- out constraints, rules, or abstractions. Cognitive Lin- (Dave: were you going to take a short first stab at guistics, 18:23–46. this section?) Dirk Elzinga. 2006. English adjective comparison and Acknowledgements analogy. Lingua, 116:757–770. We would like to thank Theron Stanford for his pro- Lars Johnsen and Christer Johansson. 2005. Ef- gramming support work. ficient modeling of analogy. In Proceedings of the CiCLing Conference, volume 3406 of Lecture Notes in Computer Science, pages 694–703. Springer References Berlin/Heidelberg. Steve Chandler. 2005. Computer simulations of RT in Daniel Jones. 1996. Analogical Natural Language Pro- analogical modeling. Paper presented at the Thirty- cessing. Studies in Computational Linguistics. Uni- second LACUS Forum at Dartmouth College, August versity College of London Press Limited, London, 4, 2005. England. Don Chapman and Royal Skousen. 2005. The negative Deryle Lonsdale and Michael Manookin. 2004. Com- prefix in english and analogical modeling of language. bining learning approaches for incremental on-line Journal of English Language and Linguistics, 9:333– parsing. In Proceedings of the Sixth International 357. Conference on Cognitive Modeling, pages 160–165. Lawrence Erlbaum. Inna Danielyan Dodge and Deryle Lonsdale. 2006. Modeling Russian verbs of motion: an analogical ac- Deryle Lonsdale. 2006. Exploring syllables, Roman- count. In Proceedings of the 28th Annual Conference ization, and analogy in names. In Proceedings of the of the Cognitive Science Society, pages 1239–1244. Sixth Annual Family History Technology Workshop. Lawrence Erlbaum. BYU Computer Science Department. Michael Manookin and Deryle Lonsdale. 2004. Re- solving automatic prepositional phrase attachments by non-statistical means. In LACUS Forum XXX: Lan- guage, Thought, and Reality, pages 291–300.The Lin- guistic Association of Canada and the United States. Royal Skousen, Deryle Lonsdale, and Dilworth B. Parkinson, editors. 2002. Analogical Modeling: An exemplar-based approach to language. John Ben- jamins, Amsterdam/Philadelphia. Royal Skousen. 1989. Analogical Modeling of Lan- guage. Kluwer Academic Publishers, Dordrecht, Netherlands. Royal Skousen. 1992. Analogy and Structure. Kluwer Academic Publishers, Dordrecht, Netherlands. Royal Skousen. 1995. Analogy: A non-rule alternative to neural networks. Rivista di linguistica, 7:213–231. Royal Skousen. 1998. Natural statistics in language modeling. Journal of Quantitative Linguistics, 5:246– 255. Royal Skousen. 2005. Quantum analogical modeling: A general quantum computing algorithm for predicting language behavior. Posted at arxiv.org.