<<

From: AAAI-98 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved.

Concepts From Time Series

Michael T. Rosenstein and Paul R. Cohen Computer Science Department, LGRC University of Massachusetts Box 34610 Amherst, MA01003-4610 {mtr, cohen}@cs.umass.edu

Abstract membership). A similar definition follows from the in- fluential work of Rosch involving categories and their This paper describes a way of extracting concepts from abstractions, called prototypes (Rosch & Lloyd 1978). streams of sensor readings. In particular, we demon- A protoype is best understood as a representative cat- strate the value of attractor reconstruction techniques egory member that may or may not correspond to any for transforming time series into clusters of points. observed instance. (In fact, the latter case is true for These clusters, in turn, represent perceptual categories the results in this paper.) Thus, whenever such abstrac- with predictive value to the agent/environment system. tions can take the place of a category, one can think of Wealso discuss the relationship between categories and concepts, with particular emphasis on class member- a concept as a category prototype plus its meaning or ship and predictive inference. predictive inferences.

FUNCTIONAL EPISTEMIC Introduction Time Series Instance This research is part of an effort to explain how senso- I rimotor agents develop symbolic, conceptual thought, clustering as every human child does. As in (Cohen et al. 1996) Concept Category Discovery Cluster we are trying to "grow" an intelligent agent fi’om min- I imal beginnings by having it interact with a complex observation environment. One problem with such projects is the Outc~omes Entailments transformation of streams of sensor data into symbolic concepts, cf. the symbol grounding problem (Harnad Time Series Instance 1990). Hence, the focus of this paper is an unsuper- I vised learning mechanism for extracting concepts from recognition time series of sensor readings. Concept Cluster Prototype Category Prototype Concepts are abstractions of experience that confer Use I a predictive ability for new situations. Moreover, for prediction this project we assume a predictive semantics where Outc~ome Meaning the meaning of a concept is the predictions it makes. This working definition applies equally well to both Figure 1: Functional view of concepts and the corresponding objects and activities and depends upon a notion of epistemic terms. category. For instance, just as one can form the cat- egory TOY for objects BALL and TOP, one can also create the category PLAYfor activities BOUNCEand Figure 1 illustrates the way we realize these defini- tions of concepts. In particular, we begin with time SPIN. From the interactionist perspective (Lakoff 1984; Johnson 1987), this correspondence is a natural one. In- series data and form clusters of points that have some- deed, objects and activities seem to be duals -- linked thing in common. We then observe the subsequent out- by sensorimotor experience -- with a category of one comes, i.e., the future properties of the cluster mem- connected to a category of the other. bers. Since we equate clusters with categories and out- Epistemically, a category is simply a collection of in- comes with entailments, we also equate the correspond- stances (of objects or activities) and a concept is a cat- ing acts of clustering and observation with the discovery egory plus its entailments (or consequences of category of concepts. Concept use then follows a similar two-step procedure: (1) Recognition. Given a new time series, Copyright © 1998, American Association for Artificial find the cluster prototype most like the new instance. Intelligence (www.aaai.org). All rights reserved. (2) Prediction. Report the most likely outcome for the cluster referred to by the matching prototype. CRASHor KISS, and interaction mapsfor predicting out- comes such as CONTACTor NO-CONTACT.These repre- Experimental Environment sentations, in turn, were based on phase portraits and To demonstrate concept discovery and use, a simple ex- basins of attraction -- commontools used by dynami- perimental environment was created where two agents cists for understanding system behavior. interact based on their predetermined movementstrate- During a learning phase, a library of activity maps gies. These agents are entirely reactive and pursue or was constructed by running thousands of trials in the avoid one another using one of four basic behaviors: simulator while recording the movementpatterns of each agent as well as the outcomeof every trial. In a su- 1. NONE.The agent follows a line determined by its pervised manner, one behavior mapwas built for each of initial Velocity. Noattention is paid to the opponent’s eight agent types (four basic movementstrategies with position. three levels of intensity for AVOIDand CRASH), and one 2. AVOID.The agent attempts to move away from its interaction mapwas built for each of the 36 distinct opponent and avoid contact. pairs of behaviors (64 possible pairs minus 28 symmet- 3. CRASH.The agent attempts to move toward its op- rically equivalent pairs). With the pursuit/avoidance ponent and initiate contact. simulator, this library of activity mapsproved sufficient for recognizing the participants of a new trial and for 4. KISS. The agent slows down before making contact predicting a CONTACTor NO-CONTACToutcome. (implemented as a combination of AVOIDand CRASH). Table 1 illustrates the performance of the recogni- Figure 2 shows two examples from the pur- tion algorithm in (Rosenstein et al. 1997). Interest- suit/avoidance simulator. During each trial, the agents ingly, this confusion table demonstrates the misinter- interact only whenthey are close enough to detect one pretation of the various behaviors. For example, 66% another, as represented by the inner circle in Figure of the time the algorithm confused KISSwith one of the 2. A trial ends when the agents make contact, when CRASHmovement strategies, yet rarely mistook KISS for they get too far apart, or whenthe trial exceeds a time one of the AVOIDtypes. The reason for this sort of limit whichis large comparedto the duration of a typi- confusion is that apparent behavior is dependent upon cal interaction. The simulator implements a movement not only the agent’s predetermined movementstrategy, strategy by varying the agent’s acceleration in response but also the circumstances, i.e., initial velocity and op- to the relative position of its opponent. (See (Rosen- ponent behavior. Actually, in some situations, a KISS stein et al. 1997) for details.) In particular, movement agent reacts just like a CRASHtype, and vice versa. strategies are equations of the form These results suggest that another way to categorize a = i. f(d, [l), (1) interactions is by the nature of the interaction itself. In fact, the explicit step of behavior recognition is no where a is acceleration, d is distance between agents, longer necessary with the clustering approach described f is a function that gives one of the basic movement shortly. strategies, and i is a scale factor that represents the agent’s strength or intensity. Recognizer Response Actual N A- A AT C- C G-t- K

N 40 21 7 3 26 3 0

A- 18 53 19 5 5 0 0

A 2 16 7*4 8 0 0 0

AT 0 2 8 90 0 0 0

C- 26 6 3 1 35 23 6 Figure 2: Simulator screen dump showing a representative C 2 2 0 1 21 25 21 28 trial of: (a) AVOIDVS. CRASH; (b) KISS VS. KISS. C÷ 0 0 0 0 1 9 7.’/" 13

Activity Maps K 8 1 0 0 21 20 25 25 Our previous approach to concept discovery was based on representations of dynamics called activity Table 1: Confusion table illustrating recognition perfor- maps (Rosenstein et al. 1997). In keeping with the mance with agent behaviors chosen randomly. Shown axe functional view of concepts in Figure 1, we also made response percentages, where behavior names are shortened the distinction betweentwo types of activity maps: be- to first letters only, and - and + indicate weak and strong havior maps for recognizing agent behaviors such as forms, respectively. Despite the success of activity mapsat recognizing behaviors and predicting outcomes, the approach has ESCAPE/ two drawbacks worth mentioning. First, even for a . .,-’"’""" simple simulator with just two agents, the size of the 0.8 map library scales as O(n2), where n is the number of movementstrategies. Preferably, the concept library should have size proportional to the number of needed concepts. Second, thousands of trials are necessary to build just one mapand the associated learning algo- rithm must operate in a supervised fashion. Instead, an agent should find the relevant categories for itself, without the imposed biases of an external teacher. For these reasons, we developed the current approach to CHASE concept discovery based on clustering techniques. Be- 0.2 low, we showthat few clusters and few experiences are CONTACT needed to recognize situations and predict outcomes -- ....’" / without a supervisor -- and to do so quite accurately. / I I I I 0.2 0.4 0.6 0.8 The Method of Delays obj ect-distance [ t-5 ] The formation of clusters requires a metric and so one must first devise a suitable metric space for the data. Figure 3: Two-dimensionaldelay portrait with J = 5 and In this work we makeuse of an attractor reconstruction trajectories illustrating CONTACT, ESCAPE, and CHASE. technique called the methodof delays or delay-space em- bedding. Takens proved that the method of delays pro- vides a means for mappinga time series to a topolog- this: if the state of the environmentat time t is uncer- ically equivalent spatial representation (an embedding) tain because the agent has, say, one sensor, then exam- of aa underlying dynamical system (Takens 1981). This ination of sensor readings prior to time t will reduce the mapping is accomplished by forming an m-dimensional ambiguity. vector of delay coordinates, X[ = [xi xi-j ~i-2J ... xi-(m-1)j], (2) Concepts and Clusters where {xl,x2, ...,xn} is the n-point time series, m is As shownearlier in Figure 1, there is a close relation- called the embedding dimension, and J is the embed- ship between concepts and categories. Indeed, "there is ding delay. As described in (Rosenstein, Collins, nothing more basic than categorization to our thought, De Luca 1994), and references therein, the practical ap- perception, action, and speech" (Lakoff 1984). But be- plication of Takens theorem requires some care on the fore any agent can make use of concepts, it must first experimenter’s part whenchoosing the delay value. In discover the pertinent categories. any case, the theoretical basis of attractor reconstruc- One general technique for discovering categories is to tion is geared toward dynamical systems, although the form clusters of points in a suitable space. This was techniques often prove useful for uncovering patterns in the basis of Elman’s work on learning lexical classes time series data, regardless of their source. from word co-occurrence statistics (Elman 1990). El- For instance, Figure 3 showsa delay portrait with rep- manfirst trained a recurrent neural network to predict resentative trajectories leading to three different out- successive words in a long input string. This then set comes in the pursuit/avoidance simulator. Each curve the stage for hierarchical clustering of the hidden-unit is based on the distance time series recorded during one activation space, where the result was groups of words interaction. Wechose object-distance as the agent’s that coincide with classes like NOUN-FOODor VERB- only sensor because we want to demonstrate concept PERCEPT.Similarly, Omlin and Giles described a way discovery while supplying as little innate knowledgeor to identify clusters in a network’s hidden-unit space, structure as possible. Additionally, the ability to mea- with each cluster representing a node in an associated sure object distance is a reasonable assumption since finite-state machine (Omlin & Giles 1996). most mobile robots are equipped with sonar or video- Our approach to clustering works directly from de- based position sensors. lay coordinates and proceeds with no prior training. The method of delays is important for this research, For the pursuit/avoidance simulator, we begin with the not because we have attractors to reconstruct, but be- first observedinteraction and immediatelycreate a clus- cause we need somebasis for discovering concepts with- ter consisting of this lone experience. A cluster itself is out detailed knowledge of the agent/environment sys- simply a data structure that stores the numberof mem- tem. In all but the simplest applications, an agent is bers, the frequency of each outcome, and a prototype unable to measureall relevant quantities in the world, for the class. Whenevera new experience arrives, the and so the intuition behind delay-space embeddingsis algorithm creates a new cluster from the data and then attempts to merge existing clusters based on a measure one of these outcomesfor each trial, and recorded the of similarity. Specifically, two clusters are replaced by percentage of correct responses. The prediction scheme one formed from their constituent information when- was a straightforward voting algorithm that followed ever the Euclidean distance between them (in delay- the two-step procedure for concept use: (1) Recogni- coordinate space) is less than a threshold parameter, tion. Find the cluster prototype nearest the time series 0s. Hence, the algorithm continually updates its list of in delay-coordinate space. (2) Prediction. Report the categories to reflect new experiences. Central to this majority outcome for the corresponding cluster. Fig- updating procedure is the use of cluster prototypes. ure 5 illustrates the prediction performanceas a func- Figure 4 shows the six cluster prototypes derived tion of observation time (the time when clusters are from 100 agent interactions, where the movement formed). The graph shows that the algorithm’s re- strategies were chosen randomly and the similarity sponse improves as the interaction unfolds and more threshold was set to 20% of the range in the dis- data becomeavailable. For example at observation time tance data. Each prototype is simply the average t = 10 the prediction is correct only 64%of the time, obj ect-distance time series formed from all interac- but by t = 35 prediction performance exceeds 90%. tions that makeup a given cluster. (Strictly speaking, prototype consists of m delay coordinates, although we 100 showthe entire time series to illustrate distinctive pat- terns.) This implementation is entirely consistent with 80 Rosch’s work on prototypes (Rosch & Lloyd 1978), and also serves two practical purposes: as a wayfor testing ~ 6o cluster similarity, and as a way for understanding the meaningof a cluster. o~ 40

2O

0.8 /ESCAPE ’ ’ I ’ ’ I ’ ’ I ’ ’ I OVERSHOOT-CONTACT ~.~ 0 15 S0 45 60 75 -~ 0.6 Observation Time i I CHASE "~ 0.4 Figure 5: Prediction performanceversus observation time for 1000test interactions and the clusters shownin Figure o 0.2 . ..~.co~..~A~...... 4. Clustering was based on 100 trials with 0, = 0.2, and .X...... error bars showthe standard deviation for five replications. I I I I 0 20 40 60 80 100 Observation Time How Many Concepts? Our previous work using three intensity levels, i = Figure 4: Cluster prototypesformed at observationtime t -- {0.5,1.0, 2.0}, required 36 interaction mapsand, there- 50, with similarity threshold 0, -- 0.2, embeddingdimension fore, 36 concepts. With our present effort, we made m -- 5, and embeddingdelay J = 1. things more difficult and selected i randomly from the interval [0.5,2.0]. Nevertheless, clustering in delay- The prototypes in Figure 4 correspond to six differ- coordinate space resulted in far fewer concepts with no ent categories of agent interactions, with each category substantial differences in prediction performance. Even possessing its ownentailments. Thus, the clustering al- when we increased the number of trials to 10,000, we gorithm discovered six concepts about the experimen- observed little change in the numberof clusters formed. tal domain. Moreover, these prototypes reflect actual (See Figure 6.) In fact, very few trials -- and, therefore, differences in the simulated environment. In particular, very few clusters -- were needed to achieve a reason- the algorithm found concepts that one could describe as able level of performance,after whichthe algorithm im- "chase," "contact," contact after the agents first "over- proved prediction performance by fine-tuning the exist- shoot" one another, and "escape" with short, medium, ing clusters rather than creating new ones. Our expla- and long escape times. But are these clusters any good nation is that the algorithm discovered all the concepts for recognition and prediction? that were available for discovering in the first place. The agents in the pursuit/avoidance simulator have Onecaveat about clustering is that the results can be- implicit goals of CONTACTand ESCAPE.These goals comeskewed by making a poor choice for the similarity correspond to two possible outcomes, and the exper- threshold. For example, Figure 7a shows the number of imental domain affords a third, emergent outcome of clusters formed for several values of 0s. Whenthe sim- CHASE.TO evaluate the usefulness of a set of clusters, ilarity threshold is too small, trials seemto have little we generated 1000 additional interactions, predicted in commonwith one another and the number of clus- 10 100

¸ 8 80

6 60.

0 4 40

2 20

0 ...... ~ : : 0 10 100 1000 10000 0.01 0.1 Trials Similarity Threshold

Figure 6: Semi-log plot of the numberof clusters formed versus numberof trials, where 08 = 0.2. Error bars show 100 the standard deviationfor five replications. 8O

ters is roughly the same as the number of trials. As 60 expected, an increase in 88 yields a shorter cluster list, (1) o with the total number approaching the limiting value 0 of 1. Figure 7b illustrates the other half of the story. o~ 40 When08 is too large, too few concepts are discovered ¸ and prediction performance is poor. But once the num- 20 ber of clusters exceeds a value of about five, prediction b) , , , performance saturates and there is no benefit to learn- I, i i ii ing a more detailed breakdownof "concept space." Put 0.01 0.1 differently, plots such as Figure 7 offer a way to de- Similarity Threshold termine the number of concepts worth learning in the given domain. Figure 7: Effects of similarity threshold on (a) number of clusters and (b) prediction performance,for observation Delay Coordinates times t = 10, 50. Clustering was based on 100 trials and In this section we demonstrate the benefits of delay error bars showthe standard deviation for five replications. coordinates for reducing ambiguity in sensor readings. Our results thus far rely on the choice of observation time, not the dynamics, to tease apart different inter- actions. In particular, Figure 4 suggests that by time Figure 8a shows the result of clustering in a five- t -- 20 we could classify most interactions correctly from dimensional delay-coordinate space with delay J -- 1. a single sensor reading. To makethe task moredifficult Oneprototype is similar to the category for CONTACTin (and more realistic) we nowshow clustering at a given Figure 4, whereas the other is an amalgamof all three value for object-distance. In other words, we fix the basic outcomes. For Figure 8b, we increased the em- first delay coordinate so all trials appear the same to bedding delay to J = 3 so the delay-coordinate vector a naive algorithm that performs clustering without ac- described by Eq. (3) spans a larger portion of the sen- counting for the dynamics. Moreover, we make use of sor stream (for the same fixed dimension). The effect delay coordinates forward in time, with Eq. (2) replaced clusters of greater homogeneitysince the algorithm is by able to capture manyof the subtle distinctions between X~= [z~ x~+jx~+2j ... z~+(m-1)j]. (3) two interactions. In practice, delay coordinates -- both forward and Figure 9 illustrates moreclearly the role of the em- backward in time -- require the agent to wait until bedding dimension. Whenm is too small, there is little the data becomeavailable. The difference is simply a time to observe a change between the first and last de- matter of perspective. For the forward view, the agent lay coordinates, and so manyinteractions fall near one buffers its sensor readings whentriggered by an event, another in delay-coordinate space. This situation yields such as the proximity of an opponent. For delay coor- relatively few clusters and poor prediction performance, dinates backwardin time, the agent updates its sensor muchlike Figure 7 for large values of 08. As m in- buffer continuously and associates an event with the creases, points spread apart in embeddingspace since most recent delay coordinate, rather than the oldest the additional coordinates reveal latent differences in one. the corresponding time series. 12

J=3 10 0.8

8 0.6. 6 0.4. o 4 J=l O 0.2 2.

(a) t , t , (a) I I I I I I I I I I 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 11 Observation Time EmbeddingDimension

100 -- (1:6:20) 0.8 8O , J. . 1 L J=3 0 T 1 o

0.6 o.) 60 P__ o (2:7:1)) 0 ~ 0.4 o~ 40 0 X 0 0.2 ~ 2) 2O

(30:0:2) (b) b) I I I I I I I I I I I I I I 20 40 6O 8O 1 O0 1 2 3 4 5 6 7 8 9 10 11 Observation Time EmbeddingDimension

Figure 8: Cluster prototypes for object-distance=0.7, Figure 9: Effects of embeddingdimension on (a) number with 0s = 0.1, embedding dimension m = 5, and embed- of clusters and (b) prediction performance,for embedding ding delay (a) J = 1 and (b) J = 3. Labels show delays J = 1, 3. Clustering was based on i00 trials, where makeupof each cluster. For instance, (29:18:26) indicates 0s = 0.1 and object-d±stance=0.7. Error bars showthe that the prototype was formedfrom 29 interactions ending standarddeviation for five replications. in CONTACT, 18 ending in CHASE,and 26 ending in ESCAPE. trajectory. The mapis then colored as a function of Interaction Maps the counter values in each cell. With the pursuit/avoidance simulator, the clustering Figure 10 shows the interaction mapfor the extreme approach to concept acquisition afforded a prediction case where all trials are placed into just one cluster. performance exceeding 95%. However, in more compli- (For graphical purposes, we split the map into three cated, possibly noisy, environments it maynot be pos- grayscale images, one for each outcome, although one sible to achieve this level of success with a prediction color diagram is often more informative.) Whenever algorithm based solely on outcomefrequencies. Our so- a trajectory enters a dark region of a map, one can lution is to exploit the dynamics in the environment confidently predict the corresponding outcome. With and store more detailed information about entailments interaction maps we saw a boost in prediction perfor- in the form of interaction maps. Weavoid the combina- mancefrom 48%to 78%at t = 10 and to 88%at t = 50. torial drawbacks of maplibraries by creating just one Notice that these results comparefavorably to those for interaction mapper cluster. six clusters as in Figure 5. Interaction maps are similar to diagrams that show basins of attraction, or sets of initial conditions that Conclusions lead to different limit sets (long-term outcomes). One Perhaps the most promising aspect of this work is the way to build an interaction mapinvolves the partition- possibility of an agent discovering concepts for itself. ing of delay-coordinatespace into cells, whereevery cell Anemphasis on the agent not only aids our understand- contains a counter for each interesting outcome. As ing of humancognitive development,but also influences two agents interact, they generate a trajectory in this the design of autonomous agents for more pragmatic space, and once the outcomeis known, the correspond- purposes. In practical applications, someone, either ing counter is incremented in each cell visited by the agent or designer, must carve up the world in some (a) (b) (c)

Figure 10: Interaction mapsfrom one cluster with outcomes(a) CONTACT,(b) ESCAPE,and (c) CHASE.Darker shades of indicate a greater likelihood of the associated outcome.Also shownis an overlay of the delay portrait in Figure 3. appropriate way, but what one deems relevant is quite References dependent on one’s perceptions of the world and one’s Cohen, P. R.; Oates, T.; Atkin, M. S.; and Beal, C. R. ability to affect those percepts through action. For in- 1996. Building a baby. In Proceedingsof the Eighteenth stance, as part of another project in our lab (Schmill et Annual Conference of the Cognitive Science Society, al. 1998), a mobile robot discovered an unexpected so- 518-522. lution to the problem of being jammed between two Elman, J. L. 1990. Finding structure in time. Cogni- objects. Whereas a programmer might instruct the tive Science 14:179-211. robot to rock back-and-forth -- as an automobile driver caught in mud or snow -- the robot found on its own Harnad, S. 1990. The symbol grounding problem. that opening its gripper could push itself far enough Physica D 42:335-346. away from one of the objects. The point of this ex- Johnson, M. 1987. The Body in the Mind. University ample is that a solution was found through interaction of Chicago Press. with the world. Indeed, Lakoff (Lakoff 1984) and John- Lakoff, G. 1984. Women,Fire, and Dangerous Things. son (Johnson 1987) provide cogent arguments that more University of Chicago Press. advanced symbolic thought is founded upon this very Omlin, C. W., and Giles, C. L. 1996. Extraction sort of sensorimotor interaction. of rules from discrete-time recurrent neural networks. Although it maybe a bit of a stretch to say that our Neural Networks 9(1):41-52. pursuit/avoidance agents perform "sensorimotor" inter- Rosch, E., and Lloyd, B. B. 1978. Cognition and action, the results in this paper provide a beginning for Categorization. Hillsdale, N J: LawrenceErlbaum As- future work with mobile robots. At the heart of the ap- sociates. proach is the discovery of sensor reading patterns, and the specific contribution of this paper is a technique for Rosenstein, M.; Cohen, P. R.; Schmill, M. D.; and deducing such patterns, i.e., clusters, from time series Atkin, M. S. 1997. Action representation, prediction data. These clusters, together with their entailments, and concepts. University of Massachusetts Computer then provide a meansfor recognizing situations and pre- Science Department Technical Report 97-31, also pre- dicting outcomesin the agent’s world. sented at the 1997 AAAIWorkshop on Robots, Soft- bots, Immobots: Theories of Action, Planning and Control. Acknowledgments Rosenstein, M. T.; Collinsl J. J.; and De Luca, C. J. 1994. Reconstruction expansion as a geometry-based This research is supported by the National Defense framework for choosing proper delay times. Physica D Science and Engineering Graduate Fellowship and by 73:82-98. DARPAunder contract No. N66001-96-C-8504. The Schmill, M. D.; Rosenstein, M. T.; Cohen, P. R.; and U.S. Governmentis authorized to reproduce and dis- Utgoff, P. 1998. Learning what is relevant to the tribute reprints for governmental purposes notwith- effects of actions for a mobile robot. To appear in standing any copyright notation hereon. The views and Proceedings of the Second International Conference on conclusions contained herein are those of the authors Autonomous Agents. and should not be interpreted as the official policies or Takens, F. 1981. Detecting strange attractors in tur- endorsements, either expressed or implied, of DARPA bulence. Lecture Notes in Mathematics 898:366-381. or the U.S. Government.