<<

Packing: A Geometric Analysis of Feature Selection and Category Formation

Shohei Hidaka and Linda B. Smith Department of Psychological and Brain Sciences Indiana University

This paper presents a geometrical analysis of how local interactions in a large population of categories packed into a feature space create a global structure of feature relevance. The theory is a formal proof that the joint optimization of discrimination and inclusion creates a smooth space of categories such that near categories in the space have similar generalization gradients. Packing theory offers a unified account of several phenomena in human categoriza- tion including the differential importance of different features for different kinds of categories, the dissociation between judgments of similarity and judgments of category membership, and children’s ability to generalize a category from very few examples.

There are an infinite number of objectively correct de- sults: (1) and theoretical analyses showing that scriptions of the features characteristic of any thing. Thus, the distribution of instances in a feature space is critical to Murphy and Medin (1985) argue that the key problem for the weighting of those features in adult category judgments any theory of categories is feature selection: picking out the and (2) evidence from young children’s category judgments relevant set of features for forming a category and generaliz- that near categories in the feature space are generalized in ing to new instances. The feature selection problem is partic- similar ways. Distributions of instances In a seminal paper, ularly difficult because considerable research on human cat- Rips (1989) reported that judgments of instance similarity to egories indicates that the features people think are relevant a category and judgments of the likelihood that the instance depend on the kind of category (Murphy & Medin, 1985; is a member of the category did not align. The result, now Macario, 1991; Samuelson & Smith, 1999). For example, replicated many times, shows that people take into account color is relevant to food but not to artifacts; material (e.g., how broadly known instances vary on particular properties wood versus plastic) is relevant to substance categories but (Kloos & Sloutsky, 2008; Zeigenfuse & Lee, 2009; Rips & not typically to artifact categories. This leads to a circu- Collins, 1993; Holland, Holyoak, Nisbett, & Thagard, 1986; larity as pointed out by Murphy and Medin; to know that Nisbett, Krantz, Jepson, & Kunda, 1983; Thibaut, Dupont, something is a pea, for example, one needs to attend to its & Anselme, 2002) With respect to judgments of the likeli- color, but to know that one should attend to its color, one hood that an instance was a member of the category, people has to know it is a potential pea. Children as young as two take into account the of features across known in- and three years of age seem to already know this and exploit stances and do not just judge the likelihood of membership these regularities when forming new categories (Colunga & in the category by similarity across all features. Importantly, Smith, 2005; Yoshida & Smith, 2003). This paper presents however, similarity judgments in Rips’ study were not influ- a new analysis of feature selection based on the idea that in- enced by the of features across cate- dividual categories reside in a larger geometry of other cat- gory instances (see also, Rips and Collins, 1993; Holland et egories. Nearby categories through processes of generaliza- al., 1986; Nisbett et al., 1983; Thibaut et al., 2002, Stewart tion and discrimination compete and these local interactions & Carter, 2002). Rather, the similarity relations of poten- set up a gradient of feature relevance such that categories that tial instances to each other and the importance of features to are near to each other in the feature space have similar fea- judgments of the likelihood of category membership appear ture distributions over their instances. We call this proposal to be separable sources of information about category struc- ”packing theory” because the joint optimization of general- ture with the distribution of features across known instances ization and discrimination yields a space of categories that most critical in determining the importance of features in de- is like a suitcase of well-packed clothes folded into the right cisions about category membership. Figure 1 provides an il- shapes so that they fit tightly together. The proposal, in the lustration. The figure shows the feature distribution on some form of a mathematical proof, draws on two empirical re- continuous dimension for two categories, A and B. A feature that is highly frequent and varies little within a category is more defining of category membership than one that is less frequent and varies more broadly. Thus, a novel instance that Linda B. Smith, Department of Psychological and Brain Sci- falls just to the right side of the dotted line would be farther ences, Indiana University, 1101 East Tenth Street, Bloomington, from the of B than A, but may be judged as Indiana 47405-7007, e-mail:[email protected]) a member of category B and not as a member of category A. 2 HIDAKA AND SMITH

2001; Markman, 1989; Booth and Waxman, 2002; Gather- cole & Min, 1997; Imai & Gentner, 1997; Landau, Smith & Optimal boundary Jones, 1988, 1992, 1998; Soja, Carey, & Spelke, 1991; see also, Gelman & Coley, 1991; Keil, 1994). Discrimination error Critically, the features that cue these categories - hav- ing eyes or legs, being angular or rounded, being solid or The boundary nonsolid - may be treated as continuous rather than as dis- crete and categorical features. When the degree to which in- AB stances present these features is systematically varied so that named exemplars are more or less animal-like or more or µ µ less artifact-like (Yoshida & Smith, 2003; Colunga & Smith, A B 2005, 2008) children show graded generalization patterns: σ A σ B Exemplars with similar features are generalized in similar ways and there is a graded, smooth, shift in the patterns of generalizations across the feature space. Based on an anal- θ ysis of the structure of early learned nouns, Colunga and Figure 1. Likelihoods of instances in two categories, A and B, in Smith (Colunga & Smith, 2005, 2008; Samuelson & Smith, a hypothetical feature space. The solid line shows the optimal de- 1999) proposed that children’s generalizations reflected the cision boundary between category A and B. The broken line shows instance distributions of early learned nouns that categories mean value of the instances in the feature space for each category. of the same general kind (artifacts versus animals versus substances) typically have many overlapping features and also have similar dimensions as the basis for including in- stances and discriminating membership in nearby categories. This is because the likelihood of that feature given category In brief, categories of the same kind will be similar to each B is greater than the likelihood of the feature given category other, located near each other in some larger feature geome- A. This can also be conceptualized in terms of this feature try1 of categories, and have similar patterns of instance distri- having greater importance to category A than B. butions. This has potentially powerful consequences: If fea- The potentially separate contributions of similarity and the ture importance is similar for nearby categories, then the lo- category likelihoods of features provide a foundation for the cation of categories -or even one instance – within that larger present theory. Similarity is a reflection of the proximity of space of categories could indicate the relevant features for instances and categories in the feature space. The density determining category membership. Thus, children’s system- of instances in this feature space result in local distortions atic novel noun generalizations may reflect the distribution of feature importance in the space, a result of competitions of features for nearby known categories. among nearby categories. The result is a patchwork of local The findings in the literature on children’s generalizations distortions that set up a global gradient of feature importance from a single instance of a category may be summarized that constrains and promotes certain category organizations with respect to Figure 2. The cube represents some large over others as a function of location in the global geometry. hyperspace of categories on many dimensions and features. Further, in this view, the weighting of features is locally dis- Within that space we know from previous studies of adult torted, but similarity is not. judgments of category structure and from children’s noun generalizations (Soja, Carey, & Spelke, 1991; Samuelson & Nearby categories Smith, 1999; Colunga & Smith, 2005) that solid, rigid and constructed things, things like chairs and tables and shov- Studies of young children’s novel noun generalizations els) are in categories in which instances tend to be similar in also suggest that the proximity of categories to each other in shape but different in other properties. This category gener- a feature space influence category formation. These results alization pattern is represented by the in the bottom derive from laboratory studies of how 2- and 3- year olds left corner; these are narrow in one direction (constrained in generalize a category to new instances given a single exem- their shape variability) but broad in other directions (varying plar. In these experiments, children are given a novel never- more broadly in other properties such as color or texture). seen-before thing, told its name (“This is a dax”) and asked We also know from previous studies of adult judgments of what other things have that name. The results show that chil- category structure and from children’s novel noun general- dren extend the names for things with features typical of an- izations (Soja et al., 1991; Samuelson & Smith, 1999; Col- imates (e.g., eyes) by multiple similarities, for things with unga & Smith, 2005), that nonsolid, nonrigid things with ac- features typical of artifacts by shape (e.g., solid and angular cidental shapes (things like sand, powder, and water) tend shapes), and for things with features typical of substances by to be in categories well organized by material. This cate- material (e.g., nonsolid, rounded flat shape). The children gory generalization pattern is represented by the ellipses in systematically extend the name to new instances by different the upper right corner of the hyperspace; these are broad in features for different kinds (Jones, Smith & Landau, 1991; one direction (wide variation in shape) but narrow in other Kobayashi, 1998; Jones & Smith, 2002; Yoshida & Smith, directions (constrained in material and texture). Finally, Col- PACKING: A GEOMETRIC ANALYSIS OF CATEGORIES 3 unga and Smith (2005; 2008) found evidence for a gradient we assume that at the probabilistic edges of categories, there of generalization patterns within one local region of feature is competition among categories for instances. Competition space of early-learned noun categories. In sum, the evidence characterizes representational processes at the cognitive, sen- suggests that near categories (with similar instances) have sory, motor, cortical and subcortical levels. In general, ac- similar generalization patterns and far categories (with dis- tivation of a representation of some property, event or ob- similar instances) have dissimilar generalization patterns. ject is at the expense of other complementary representations At present Figure 2 represents a theoretical conjecture, (Duncan, 1996; Beck & Kastner, 2009; Marslen-Wilson, one based on empirical evidence about one small region of 1987; Swingley & Aslin, 2007). Packing theory proposes the space of human categories and one that has not been that near by categories have similarly shaped generalizations theoretically analyzed. Still, it is a particularly interesting patterns because of the joint optimization of including nearby conjecture in the context of Rips’ (1989) insight that the dis- instances and discriminating instances associated with differ- tributions of features are distinct from similarity and deter- ent categories. Third, the present approach assumes some mine the relative importance of features in category judg- feature-based representation of categories (McRae, Cree, ment. Packing theory builds on these two ideas: distributions Seidenberg, & McNorgan, 2005). However, we make no as- of instances and proximity of categories in the feature space sumptions about the specific nature of these features or these to suggest how they jointly determine feature selection. origins. Although we will use perceptual features in ours dis- cussions and simulations, the relevant features could be per- ceptual, functional or conceptual. Packing Theory is a gen- Nonsolid nonrigid, eral theory, about any distribution of many instances in many Accidental shape categories across any set of features and dimensions. More- over, the theory does not need the right pre-specification of the set of features and dimensions. Optimization within the theory depends only on distance relations in the space (and thus on the number of orthogonal, that is uncorrelated, di- mensions but not on any assumptions about what orthogonal directions in that space constitute the dimensions). Further, the predictions are general; along any direction in that space (a direction that might consist of joint changes in two psy- chological dimensions, angularity and rigidity, for example), one should see near categories having more similar gener- alization patterns and far categories having more different Solid, generalization patterns. To present the theory, we will of- rigid, constructed shape ten use figures illustrating instance and category relations in Figure 2. A hyperspace of categories. The ellipses represent cate- a two-dimensional space; however, the formal theory as pre- gories with particular. generalization patterns (constrained in some sented in the proof -and the conceptual assumptions behind directions but allowing variability in others). Packing Theory pre- it -assume a high dimensional space. dicts that near categories in the space will have similar generaliza- tion patterns and that there should be a smooth gradient of changing Packing Theory category generalizations as one moves in any direction in the space. Past research shows that categories of solid, rigid and constructed Geometry is principally about how one determines neigh- things are generalized by shape but categories of nonsolid, nonrigid, bors. If the structure of neighboring categories determines and accidentally shaped things are generalized by material. Packing Theory predicts a graded transition in feature space between these feature selection, then a geometrical analysis should enhance two kinds of category organizations. our understanding of why categories have the structure they 1 do. Figure 2 is the starting conjecture for the theory pre- sented here and it suggests that near categories have similar instance distributions whereas as far categories have more Starting Assumptions dissimilar instance distributions. Thus, a geometry is needed when it represents both local distortions and the more global Three theoretical assumptions form the backdrop for structure that emerges from a space of such local distor- Packing Theory. First, as in many contemporary theories tions. Many theorists of categorization have suggested that of categorization, we take an exemplar approach (Nosofsky, although Euclidean assumptions work well within small and 1986; Ashby & Townsend, 1986; Nosofsky, Palmeri, & local stimulus spaces, a Riemann (or non-Euclidian) geom- McKinley, 1994). We assume that noun categories begin etry is better for characterizing the local and global struc- with mappings between names and specific instances with ture of large systems of categories (Tversky & Hutchinson, generalization to new instances by some weighted function 1986; Griffiths, Steyvers, & Tenenbaum, 2007; Steyvers & of feature similarity. Packing is about those weighted fea- Tenenbaum, 2005). Packing Theory follows this lead. We tures. This that categories do not have fixed or rule- first present a conceptual understanding of the main idea that like boundaries but rather probabilistic boundaries. Second, packing categories into a space creates a smooth structure 4 HIDAKA AND SMITH and then the formal proof. Packing Theory proposes that a smooth space of cate- gories results from the optimization with respect to two con- Well-Packed Categories straints: minimizing gaps and minimizing overlap. These constraints are understood in terms of the joint optimization Figure 3 shows three different sets of categories of including all experienced and potential instances in a cat- distributed uniformly within a (for exposition only 2- egory and discriminating instances of nearby categories. The dimensional) feature space. Figure 3a shows a geometry, inclusion-discrimination problem is illustrated with respect like that of young children; near categories have similar pat- to the simple case of two categories in Figure 4. Each cate- terns of feature distributions and far categories have different gory has a distribution of experienced instances indicated by ones. Such a geometry is not logically necessary (though it the diamonds and the crosses. We assume that the learner may be psychologically likely). One could have a geome- can be more certain about the category membership of some try of categories like that in Figure 3b, where each category instances than others can; that is, the probability that each has its own organization unrelated to those of near neighbors of these instances is in the category varies. If a category is and more specifically, a geometry in which near categories considered alone, it might be described in terms of its cen- do not share similar feature importance. The two spaces of tral tendency and estimated category distribution of instances categories illustrated in Figure 3a and 3b are alike in that in (or of the features over the instances). The solid both of these spaces there is little category overlap. That is, lines that indicate the confidence intervals around each cat- in both of these spaces, the categories discriminate among egory illustrate this. However, the learner needs to consider instances. However, the categories in 3b are not smooth in instances not just with respect to a single category, but also that near categories have different shapes. Moreover, this with respect to nearby categories. Therefore, such a learner structure leads to gaps in the space, possible instances (fea- might decrease the weight of influence of any instance on ture combinations) that do not belong to any known category. the estimated category structure by a measure of its confus- The categories in Figure 3b could be pushed close together ability between the categories. This is plausible because the to lessen the gaps. However, given the nonsmooth structure, shared instances in between these nearby categories are less there would always be some gaps, unless the categories are informative as exemplars of categories than the other non- pushed so close that they overlap as in Figure 3c. Figure 3c confusing instances. Thus, the estimation of category struc- then shows a space of categories with no gaps, but also one in ture may “discount” the instances in between the two cate- which individual categories do not discriminate well among gories. Doing this results in an estimated category distribu- instances. The main point is that if neighboring categories tion, that is shifted such that the generalization patterns for have different shapes there will either be gaps in the space the two categories are more aligned and more similar, which with no potential category corresponding to those instances is shown with dotted lines. This is the core idea of packing or there will be overlapping instances. A smooth space of theory. categories, a space in which nearby categories have similar shapes, can be packed in tighter. This is a geometry in which categories include and discriminate among all potential in- stances. 3

2

1

0 Dimension 2 (a) Smooth categories (b) Categories with gaps −1

−2

−3 −2 −1 0 1 2 3 4 5 Dimension 1 Figure 4. Two categories and their instances on two-dimensional (c) Overlapping categories feature space. The dots and crosses show the respective instances Figure 3. A cartoon of populations of categories in a feature space of the two categories. The broken and solid ellipses indicate equal- illustrating three different ways those might categories might fit into likelihood contours with and without consideration to category dis- the space. Each ellipsis indicates equal-likelihood contour of cate- crimination respectively. gory. The broken enclosure indicates the space of instances to be categorized.

1 PACKING: A GEOMETRIC ANALYSIS OF CATEGORIES 5

Formulation of the Theory This is the formal definition of the likelihoods of instances with respect to individual categories, which we call inclu- It is relatively difficult to describe the whole structure sion. In this formulation, the mean vectorsµ ˆ i and covariance formed by a large number of categories when they locally in- σ = , , ..., N(N−1) matrices ˆ i (i 1 2 N), maximize the log-likelihood teract across all categories at once. N categories have 2 (likelihood) as follows: possible pairs of categories. Moving one category to maxi- mize its distance from some other category may influence the ∑Ki other (N − 1) categories, and those other categories’ move- 1 ff µˆ = x (4) ments will have secondary e ects (even on the first one), i K k and so forth. Thus, two categories that compete with each k=1 other in a local region in a feature space influence the whole ∑Ki 1 T structure by chains of category interactions. The goal of the σˆ i = (xk − µˆ i)(xk − µˆ i) (5) K theory formulation is to describe the dynamics of category k=1 inclusion and discrimination in a general case, and to specify These maximum likelihood estimates are simply the mean a stable optimal state for the N-categories case. Mathemati- vectors and covariance matrices of all instances of a category. cally, the framework can be classified in one of variant of a broad sense of the Linear Discriminant Analysis (LDA, see Discrimination Duda, Hart, & Stork, 2000) , although we do not necessarily limit the theory to employ only linear transformation or as- Consider the simple case with two categories in a one- sume homoscedastic category distributions (See also, Ashby dimensional feature space as in the example from Rips & Townsend, 1986 for the similar formulation using normal (1989) in Figure 1. Likelihoods of category A and B are distribution in psychological field) . shown as the solid lines: category A has the central ten- dency on the left side in which the instance is most likely, Inclusion and category B has the central tendency on the right side. An We begin with a standard formalization of the distribu- optimal category judgment for a given instance is to judge tion of instances in a category as a multidimensional normal it as belonging to the most likely category. Thus the opti- distribution; that is, the conditional probabilistic density of mal solution is to judge an instance on the left side of the dashed line in Figure 1 to be in category A and otherwise to an instance having feature θ given category ci is assumed to follow a multi-dimensional : judge it to be in category B. Meanwhile, the error probability of discrimination in the optimal judgment is the sum of the non-maximum category likelihood, the shaded region in Fig- { } [ ] − 1 1 ure 1. Therefore, we formally define discriminability as the P(θ|c ) = (2π)2 |σ | 2 exp − (θ − µ )T σ−1(θ − µ ) (1) i i 2 i i i probability of discriminating error in this optimal category judgment. where µi and σi are respectively mean vector and covariance Although the minimum or maximum function is difficult matrix of the features that characterize the instances of cat- to solve, we can obtain the upper bound of the discriminating egory ci. The superscript “T” indicates transposition of the error (log-likelihood) Fi j for category ci and c j as follows: matrix. We identify the central tendency and distribution pat- [∫ ] { }− 1 tern of these features as the mean vector of the category, and 2 Fi j = log P(θ|ci)P(θ|c j) (6) the respectively. The motivation of this Ω formulation of category generalization derives from Shepard (1958) characterization of the generalization gradient as an In particular, when likelihoods of categories are normally exponential decay function of given psychological distance. distributed, it is called the Bhattecheryya bound (Duda, Hart, When we have Ki instances (k = 1, 2, ..., Ki) for category & Stork, 2000). In fact, the non-maximum likelihood of ci , the log-likelihood Gi that these instances come from cat- a given pair is the classification (instance, category) error egory ci is the joint probability of all instances rate and the non-maximum likelihood has the following up- α 1−α   per bound: min(P(θ|ci), P(θ|c j)) ≤ P(θ|ci) P(θ|c j) where ∏K  ∑Ki 0 ≤ α ≤ 1. Thus, Equation 6 is the upper bound of er- Gi = log  P(xk|ci)P(ci) = Gik (2) α = 1   ror in the optimal classification with 2 . The term k=1 k=1 α 1−α P(θ|ci) P(θ|c j) is the function of α, and a particular α α where Gik = log {P(xk|ci)P(ci)}. may allow the tightest upper bound with a general case of . Note that log(x) is a monotonic function for all x > 0. (Equation 6 is called the Chernoff bound). Here, we assume α = 1 Thus we can identify a solution of x that maximizes the 2 for simplification of formulation. log-likelihood. For∑ all categories (i = 1, 2, ..., N), the log- N likelihood is G = = Gi. i 1 1 T −1 Fi j = − (µi − µ j) (σi + σ j) (µi − µ j) (7) ∑N ∑Ki 4 ( ) G = log {P(xk|ci)P(ci)} (3) 1 1 1 − log (σi + σ j) + log |σi| σ j (8) i=1 k=1 2 2 4 6 HIDAKA AND SMITH

| µ − µ T σ + σ −1 µ − µ = P(ci)P(xk ci) The first term ( i j) ( i j) ( i j) indicates the egories to instances Rik ∑ ∑K , calculated in the N i P(c )P(x |c ) distance between mean vectors of the categories weighted i=1 k=1 i k i following derivation. Note that P(xk|ci) is a given binary con- by their covariance matrices, which is zero when µi = 1 1 stant variable, either one or zero, in case of supervised learn- µ j. And the second and third terms, − log (σi + σ j) + ( ) 2 2 ing, and it should be estimated from its likelihood in case of 1 |σ | σ 4 log i j , indicate the distance between the covari- unsupervised learning. ance matrices of categories which is zero when σi = σ j. Thus, obviously the discrimination error is maximized when Optimal solutions for covariance matrices µi = µ j and σi = σ j, when category ci and c j are iden- tical. Meanwhile the discriminating error is minimized Consider the case in which the mean of a category, but when the distance between( two central) ( tendencies) or dis- not its covariance, is specified. Packing Theory assumes a T µ − µ µ − µ → ∞ covariance for that category derives from the optimization of tributions[ goes to infinity] i j i j or ( ) ( ) discrimination and inclusion with respect to the known cate- µ − µ T µ − µ → ∞ tr i j i j where tr[X] is trace of the ma- gories. This is formally given as the solution of the optimal trix X. The minimum and maximum concern is the one com- covariance of an unknown category when the mean of the ponent of discrimination. category and the set of means and of the other known categories are given. Thus, in that case, we solve The packing metric the differential of the packing metric with respect to the co- matrix. We next derive these differentials to show The joint optimization of discrimination by reducing the that these optimal solutions imply a smooth organization of discriminating error and the optimization of the inclusion of categories in general. As a preview, the following deriva- instances with respect to the likelihood of known instances tions show that a solution that optimizes discrimination and results in a solution that is constrained to an inter region be- inclusion implies a particular pattern of organization, that of tween these two extremes. In the general case with N cat- N(N−1) smooth categories, that emerges out of chains of local cate- egories, we define the sum of all possible 2 pairs (in- gory interactions. In addition, the form of the optimal solu- cluding symmetric terms) of discriminating error as discrim- tion also suggests how known categories constrain the for- inability. mation of new categories.   The differential of the functions of likelihoods and dis- ∑N ∑N   1 1  σ   criminability with respect to covariance matrix i is, F = log  P(ci) 2 P(c j) 2 exp(Fi j) (9) = = i 1 j 1 ∂F { } i j = −1σ−1 µ − µ µ − µ T + σ − σ i ( i ¯)( i ¯) ˆ i j i (11) More formally, we obtain a set of optimal solutions by de- ∂σi j 4 riving the differential of Equation 3 (inclusion) and Equation 8 (discrimination). Since the desired organization of cate- σ = σ σ + σ −1σ µ = 1 σ σ−1µ + where ˆ i j 2 i( i j) j and ¯i j 2 ˆ i j( i i gories should maximize both discriminability and inclusion −1 −1 σ µ j) σ j, and simultaneously, we define the packing metric function as a j weighted summation of Equation (3) and Equation (8) with ∂G 1 { } a multiplier : ik = − σ−1 µ − µ − T − σ σ−1 ∂σ i ( i xik)( i xik) i i (12) L = G + λ(F − C) (10) i 2 where C is a particular constant, which indicates a level for . See Appendix for the detailed derivation of Equation (11) discriminability F to satisfy. According to the Lagrange and (12). Since a covariance matrix must be a positive def- ff ∂L = σ multiplier method, the di erential of parameters ∂X 0 inite, we parameterize the covariance matrix i using its l- gives the necessary condition for the optimal solution of th eigenvector yil and l-th eigenvalue ηil (l = 1, 2, ..., D), ∑ ∑ ∑ ∂ ⊂ {µ , σ , λ} D T ∂L N Ki Gik X i i on condition that the discrimination error is that is σi = = ηilyily . Solving ∂ = = = ∂ + = ∑ ∑ l 1 il yil i 1 k 1 yil a particular criterion F C. Since the probability of dis- N N ∂Fi j λ = = ∂ , we obtain the following generalized eigen- crimination error is calculated for all possible pairs (cate- i 1 j 1 yil gories i, j = 1, 2, ..., N), this is the maximization of likeli- value problem 1 (See also Appendix): hood with latent variables of the discriminated pairs. The   optimization is computed with the EM algorithm, in which ∑Ki ∑N    the E-step computes the expectation with the unknown pair-  Rik(S ik − σi) − λ Qi j(Sˆ i j + σˆ i j − σi) yil = 0 (13) ing probability, and the M-step maximizes the expectation k=1 j=1 of (Dempster, Laird, & Rubin, 1977). 1 Thus, the expected likelihood L, with respect to a variable Since the matrix in Equation (13) includesσ ˆ i j or Sˆ i j which ∑ ∑ ∂ ∂L N N Fi j has σ inside, it is not a typical eigenvalue problem, which has a for the i-th category Xi, is calculated: ∂ = Qi j ∂ + i ∑ ∑ Xi i j Xi fixed matrix. Equation (13) can be considered an eigenvalue prob- N Ki ∂Fik Rik ∂ , where the probability of category pairs are σ i k √Xi lem only when i is given. Thus, iterative method for eigenvalue P(ci)P(c j) exp(Fi j) problems such as the power iteration method would be preferable Q = ∑ ∑ √ , and the probability from cat- i j N N to calculate numerical values. i=1 j=1 P(ci)P(c j) exp(Fi j) PACKING: A GEOMETRIC ANALYSIS OF CATEGORIES 7

T T where S ik = (xik −µi)(xik −µi) and Sˆ i j = (¯µi j −µi)(¯µi j −µi) . (a) (b) Harmonic mean of Because Equation (13) is also an eigenvalue form with a par- σˆ adjacent covariance −1 ij ticular constant λ, we identify λ = η and, using the rela- matrices= il likelihood tionship between the eigenvector and its matrix σiyil = ηilyil, we obtain the quadratic eigenvalue problem. + + +… η2Φ + η Φ + Φ = ( il i2 il i1 i0)yil 0 (14) ∑ ∑ s where Φ = N Q (σ ˆ + Sˆ ), Φ = − Ki R σ − i Scatter matrix of ∑ i0 j=∑1 i j i j i j i1 k=1 ik ik instances belonging N Φ = Ki to the category i j=1 Qi j, i2 k=1 RikID, and ID is D-th order identity matrix. The Equation (13) indicates that the specific structure of ˆ Scatter matrix of Sij category distribution consists of three separate components. adjacent category centers The first component in Φi1 which includes S ik represents the deviation of instances from the central tendency of category tosimple toanimal like Shapeconstructed from Textures and materials represent (scatter matrix of instances). The first term in Φi0 represents xkinstances the “harmonic mean” of nearby covariance matricesσ ˆ i j. The Figure 5. An illustration of the local interactions among adjacent second term in Φi0 represents the scatter matrix Sˆ i j that char- categories according to the packing theory. Each ellipsis indicates acterizes local distribution of the central tendencies to nearby equal-likelihood contour of category. The star shows a central ten- categories. dency of a focal category, and triangles show the few experienced To understand the meaning of these components, we draw instances of this focal category. The covariance matrix of the focal the geometric interpretation of these three components, the category is estimated with deviation of instances (triangles; S ik), harmonic mean covariance matrix (ellipses) of adjacent categories scatter matrix of instances S ik, the scatter matrix of cate- (σ ˆ i j), and deviation of central tendencies (filled circles) of adjacent gories Sˆ i j, and the harmonic mean of covariance matrices categories (Sˆ i j). σˆ i j (Figure 5). The probabilistic density Qi j weighting to each component exponentially decays in proportion to the “weighted distance” Fi j between category ci and c j. This “weighted distance” is with respect to the distance between 1 central tendencies and also the distance between covariance distances of the central tendencies among categories increase matrices. That means that interactions among categories are (increasing gaps), the estimated category distribution will de- limited to a particular local region. In Figure 5b, the likeli- pend only on the covariance of instances S i. Meanwhile, if the number of experiences instances of categories decreases hood contours of the closest categories to a target category → → = , , ..., are shown as ellipses, and the probabilistic weighting Q be- (i.e., Ki 0 or Rik 0(k 1 2 Ki)) but proximity i j to other categories remains the same, the estimated category tween the target category (center) and others is indicated by ˆ +σ the shading. With respect to this locality, the scatter matrix distributions will depend more on discriminability, (S i j ¯ i j) 2. That is, when the number of known instances of some of categories Sˆ i j reflects the variance pattern from the cen- category is small, generalization to new instances will be ter of category ci to other relatively “close” category centers. more influenced by the known distributions of surrounding The harmonic means of covariance matricesσ ˆ i j indicate the averaged covariance matrices among the closest categories. categories. However, when there are already many known Note that it is a “harmonic” , not an “arithmetic” instances of a category, the experiences instances - and the ff one, because the inverse (reciprocal number) of the covari- known distribution of that category - will have a greater e ect ance matrix (and not the covariance matrix per se) is appro- on judgments of membership in that category (and a greater ff priate for the probabilistic density function with respect to e ect on surrounding categories). inclusion and discriminability. The point of all this is that Optimal solutions for mean vectors categories with similar patterns of covariance matrices that surround another category, will influence the surrounded cat- In the previous section, the optimal solution of covariance egory, distorting the feature weighting at the edges so that matrices was derived for a given set of fixed mean vectors. the surrounded category is more similar to the surrounding The optimal solutions for mean vectors may also be written categories in its instance distributions. as eigenvectors of a quadratic eigenvalue problem. The dif- These effects depend on the proximity of the categories ferential of discriminability and inclusion with respect to the and the need to discriminate instances at the edge of the dis- mean vectors are: tributions of adjacent categories S i is covariance of instances ∂F 1 belonging to a category, which is the natural statistical prop- i j = − σ + σ −1 µ − µ ∂µ ( i j) ( i j) (15) erty with respect to the likelihood of instances without dis- i 2 criminability. The magnitude of Qi j and Rik are quite influ- 2 Although too few instances may cause a non-full-rank covari- ential in determining the weighting between the likelihood ance matrix whose is zero (i.e., |σi| = 0), in this special of the instances of a category and the discriminability be- case, we still assume a particular variability |σi| = C. See also tween categories. If Qi j → 0( j = 1, 2, ..., N). Thus, if the Method in Analysis 4. 8 HIDAKA AND SMITH and K → 0 or R → 0(k = 1, 2, ..., K ) ), the optimal mean vec- ∂ i i j i Gik = −σ−1 µ − tor depends on both central tendencyx ¯ and distribution Σ of i ( i xik) (16) −1 −1 ∂µi instances (i.e., we obtain Φµ = λ Σ x¯ by assuming R¯ = 0). Thus in the latter extreme situation, the central tendencies µ . Then strongly depends on the estimated distribution of instances of each category Σ. In other words, this optimization of cen- ∂ ∑Ki ∑N LN = − σ−1 µ − + λ σ + σ −1 µ − µ tral tendencies also indicates the emergence of smooth cat- ∂µ Rik i ( i xik) Qi j( i j) ( i j) i k=1 j=1 egories, that is, there is predicted correlation between dis- (17) tances in central tendencies and distributions. This correla- The equation is rewritten with N times larger order of matrix tion between the distance of two categories and their feature as follows. distributions is a solution to the feature selection problem. Σ−1(R¯µ − x¯) − λΦµ = 0 (18) The learner can know the relevant features for any individual category from neighboring categories. In Equation 18, each term is as follows: µ = (µT , µT , ..., µT )T , ( 1) 2 N ∑ ∑ ∑ T = Ki T , Ki T , ..., Ki T x¯i k=1 R1k x1k k=1 R2k x2k k=1 RNk xNk , Analysis 1: Is the Geometry of Natural Categories Smooth?   σ ...  1 0 0   σ ...  If natural categories reside in a packed feature space in  0 2 0  Σ =  . . . .  (19) which both discrimination and inclusion are optimized, then  . . .. .   . . .  they should show a smooth structure. That is, near-by natural 0 0 . . . σN categories should not only have similar instances, but they should also have similar frequency distributions of features , across those instances. Analysis 1 provides support for this  ∑  prediction by examining the relation between the similarity  I K1 R 0 ... 0   D k=1 1k ∑  of instances and the similarity of feature distributions for 48  K2 ...   0 ID k=1 R2k 0  basic level categories. R¯ =  . . . .  (20)  . . .. .  A central problem for this analysis is the choice of fea-  ∑  0 0 ... I KN R tures across which to describe instances of these categories. D k=1 Nk One possibility that we considered and rejected was the use , and of features from feature generation studies (McRae et al., 2005; Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976;  ∑   N ¯ − ¯ ¯ ... ¯  Samuelson & Smith, 1999). In these studies, adults are given  i=1 Q1i Q11 ∑ Q12 Q1N   ¯ N ¯ − ¯ ... ¯  a category and asked to list the features characteristic of  Q21 i=1 Q2i Q22 Q2N  Φ =  . . . .  items in each category (e.g., has legs, made of wood, can  . . .. .   ∑  be sat on). The problem with this approach is that the fea- ¯ ¯ ... N ¯ − ¯ tures listed by adults as important to those queried categories QN1 QN2 i=1 QNi QNN (21) have (presumably) already been selected by whatever cogni- −1 where Q¯ i j = Qi j(σi + σ j) . tive processes make categories coherent. Thus, there is the Since this equation indicates a typical form of the least danger that the use of these generated features presupposes square error with a constraint, it is also rewritten as a typical the very phenomenon one seeks to explain. Accordingly, we quadratic eigenvalue problem as follows (Tisseur & Meer- chose to examine a broad set of polar dimensions unlikely bergen, 2001): to be specifically offered as important to any of these cate- gories. The specific features chosen do not need to be the 2 2 −2 T (λ ID − 2λΣ¯ − α Σ¯ x¯x¯ Σ¯)µ = 0 (22) exactly right features nor comprehensive. All they need to do is capture a portion of the similarity space in which in- where Σ¯ = R¯Σ−1R¯ and α2 = µT Φµ. Thus, the optimal mean stances and categories reside. If they do and if the packing vector µ is one of the eigenvectors given by this eigenvalue analysis is right, these features should nonetheless define an problem. This also suggests a similar structure as in the n-dimensional space of categories, which shows some de- optimal solution for the covariance. The magnitude of Qi j gree of smoothness: categories with instances similar to each and Rik are quite influential in determining the weighting be- other on these features should also show similar category tween the likelihood of the instances of a category and dis- likelihoods on these features. criminability between categories. If the distances between all To obtain this space, 16 polar opposites (e.g., wet-dry, pairs of categories are infinite (i.e., Qi j → 0( j = 1, 2, ..., N)), noisy-quiet, weak-strong) were selected that broadly encom- the optimal mean vector mainly depends on mean vectors of pass a wide of qualities (Osgood, Suci, & Tannenbaum, instances (and the matrix assigning instances to categories 1957; Hidaka & Saiki, 2004), that are also (by prior analyses) ) which is purely given by a set of instances (i.e., we obtain statistically uncorrelated (Hidaka & Saiki, 2004) but that nei- µ = R¯−1 x¯ by assuming Φ = 0 on Equation (22)). On the other ther by introspection nor by prior empirical studies seem to hand, if the number of instances of categories decreases (i.e., be specifically relevant to the particular categories examined PACKING: A GEOMETRIC ANALYSIS OF CATEGORIES 9 in this study. In this way, we let the packing metric select the tion 24). locally defined features. ∑M The analysis is based on the assumption that categories 1 with more variability in their feature distributions in the µ = d (23) i M ik world will yield more variability in the subjects’ judgments k=1 about the relevant features. Thus, the mean of the subjects’ ∑M 1 judgments for any category is used as an estimate of the mean σ = (d − µ )(d − µ )T (24) i M ik i ik i of the feature distributions for the category and the covari- k=1 ance of the subjects’ judgments is used as an estimate of co- variance. where M is number of subjects (104) and dik is 16 dimen- sional column vector having k-th subjects’ adjective ratings of i-th category. The smoothness index of the given cate- Method gories is defined by the correlation of central tendencies and covariance as follows: Participants ∑ , < (||µ || − ||µ¯||)(||σ || − ||σ¯ ||) = √ i j i i j i j The participants were 104 undergraduate and graduate S ∑ ∑ (25) ||µ || − ||µ|| 2 ||σ || − ||σ|| 2 students at Kyoto University and Kyoto Koka Women’s Uni- i, j

ing a correlation matrix instead of a covariance matrix. The Distances of paired corrected covariance matrices potential value of this approach is that it ignores artificial 1.5 correlations between means and (the absolute value of) the 0 2 4 6 8 10 12 14 variance. Distances of paired corrected mean vectors 3.4

Results and Discussion 3.2 (b) Figure 6a shows a of all possible pairs of cate- 3 gories; the x-axis is the Euclidian distance of the paired cor- 2.8 rected mean vectors and the y-axis is the Euclidian distance of the paired corrected covariance matrices. The correlation 2.6 between these two variables (with no dependence of paired 2.4 distances) is the Smoothness index (See Equation 25 for its definition). The median correlation was 0.537 (95% confi- 2.2 dence interval is from 0.223 to 0.756.). Figure 6b shows 2 the same scatter plot using the correlation matrix instead of 1.8 the covariance matrix as the measure of category likelihood; Distances of paired correlation matrices here the median correlation was 0.438 (95% confidence in- 1.6 terval is from 0.137 to 0.699). These positive correlations 1.4 between the distances of central tendencies and the distances 1 2 3 4 5 6 7 8 9 Distances of paired mean vectors of category likelihoods provide a first indication that natural Figure 6. If a space is smooth then the nearness of the categories categories may be smooth. (distances of the means) and similarity of the generalization patterns Figure 6 raises an additional possible insight. Not only do should be correlated. The two figures differ in their respective mea- categories near each other in feature space show similar pat- sures of the similarity of the generalization patterns of categories: terns of feature distribution, but across categories the changes (a) Scatter plot of the Euclidean distances of the covariance matri- in the feature distributions appear to be continuous, both in ces and the Euclidean distances of the means for pairs of categories. terms of location in the feature space and in terms of the fea- This correlation is the smoothness index, S=0.537. (b) Scatter plot ture likelihoods. This seamlessness of transitions within the of the Euclidean distances of the correlation matrices and the Eu- space of categories is suggested by the linear structure of the clidean distances of the means for pairs of categories, (S=0.438). scatter plot itself. This can emerge only if there are no big jumps or gaps in feature space or in the category likelihoods. Critically, the features analyzed in this study were not pre- Analysis 2: Learning a New selected to particularly fit the categories and thus the ob- Category served smoothness seems unlikely to have arrived from our choice of features or a priori notions about the kind of fea- According to Packing Theory, the generalization of a cat- tures that are relevant for different kinds of categories. In- egory to new instances depends not just on the instances that stead, the similarity of categories on any set of features (with have been experienced for that category but also on the dis- sufficient variance across the category) may be related to the tributions of known instances for nearby categories. From distribution of those features across instances. one, or very few new instances, generalizations of a newly PACKING: A GEOMETRIC ANALYSIS OF CATEGORIES 11 encountered category may be systematically aligned with the mean vectors and covariance matrices of the categories that distributions of instances from surrounding categories. This comprise the model’s background knowledge. Because chil- is illustrated in Figure 7: a learner who already knows some dren are unlikely to have complete knowledge of any cate- categories (shown as solid ellipses in Figure 7a) and observes gory, the mean and covariance for the background categories the first instance (a black star) of a novel category (a bro- are estimated from a random of 50% of the adult ken ellipsis) may predict the unknown generalization pattern judgments. This is done 50 times with each of the 48 noun shown by the broken ellipsis (Figure 7b). Because nearby categories from Analysis1 serving as the target category. categories have similar patterns of likelihoods, the system (via competition among categories and the joint optimization Estimation of a novel category from the first instance. of inclusion and discrimination) can predict the likelihood of Within the packing model, the variance (and covariance) of the unknown category, a likelihood that would also be similar the probabilistic density function is a critical determiner of to other known and nearby categories in the feature space. If the feature dimensions that are most important for a local categories did not have this property of smoothness, if they region and category. Thus, to predict the distribution of in- were distributed like that in Figure 7c, where each category stances for the target category, a category for which only one has a variance pattern unrelated to those of nearby categories, instance is given, we derive the covariance estimation for the the learner would have no basis on which to predict the gen- whole category. To do this, we let the scatter matrix of cate- ≈ eralization pattern. The goal of Analysis 2 is to show that gory i be zero (i.e., S i 0) by assuming the first instance is = = µ the packing metric can predict the feature distribution pat- close to the true mean (i.e., Ki 1 and xil i). In addition, terns of categories unknown to the model. In the simulation, we assume that the unknown likelihood of the novel category − the model is given the mean and covariance of 47 categories takes the form Gi log(C) where C is a particular constant. (from Analysis 1) and then is given a single instance of the In particular, in case . Based on this assumption, we can σ 48th category. The model’s prediction of the novel proba- obtain the covariance matrix of category Ci ( i) by solving bilistic density is calculated by an optimal solution with re- the Equation (13). Then the optimal covariance of the novel spect to the configuration of surrounding known noun cate- category is given as follows: gories. ∑N ( ) σi = Cˆ Qi j Sˆ i j + σˆ i j (26) (a) (b) (c) j=1 Likelihood ∑ ( ) ˆ = N ˆ + σ where C C j=1 Qi j S i j ˆ i j is derived from the con- Feature 2 Feature 2 ∂LN straint ∂λ = Gi −C = 0. Thus, estimated σi in Equation (26) optimize the packing metric, and it is considered as a special case of the general optimal solution when the covariance ma- trix of instances is collapsed to be zero (because there is the only instance). This equation indicates that a novel category with only instance can be estimated with harmonic mean of nearer known covariance matrices (σ ˆ i j) and nearer scatter matrix of weighted means (Sˆ i j). This directly means the co- variance matrix of the novel category is estimated from the Feature 1 Feature 1 other covariance matrices of nearby categories. Figure 7. (a) Each ellipsis indicates the equal-likelihood contour. We used Equation (26) in order to calculate a covariance Two schematic illustrations of a (b) smooth (c) and non-smooth matrix of a novel category σi from an instance sampled from space of categories. The broken ellipsis in each figure indicates the category (Ki = 1, xk = µi) and other known categories (µ j the equal-likelihood contour of the unknown category, and the star and σ j, j = 1, 2, ..., i−1, i+1, ..., 48). The scaling constant in indicates a given first instance of that category. The solid ellipses Equation (26) is assumed to have the same determinant of co- indicate equal-likelihood contour of known categories. A smooth variance matrix as the target category as the adult judgment space of categories provides more information for predicting the = | | |σ | = | | likelihood of the novel category contour. has (i.e, C S i so as to have i S i ). Control comparisons. The packing metric predicts the dis- tribution of instances in the target category by taking into Method 1account its general location (indicated by the one given in- stance) and the distributions of known instances for nearby On each trial of the simulation, one target category is as- categories. It is thus a geometric solution that derives not signed as unknown; the other 47categories serve as the back- from what is specifically known about the individual cate- ground categories that are assumed to be already learned. gory but from its position in a geometry of many categories. Each of the background-knowledge categories is assumed to Accordingly, we evaluate this central idea and the packing have a normal distribution, and the model predicts the covari- model’s ability to predict the unknown distribution of in- ance matrix of the target category of the base on the given stances by comparing the predictions of the packing model 12 HIDAKA AND SMITH to two alternative measures of the distribution of instances the simulation is that the accuracy of novel word generaliza- in feature space for that target category that take into ac- tions will be monotonic increasing function of number of cat- count only information about the target category and not in- egories. But here is what we do not know: As children learn formation about neighboring categories. These two alterna- categories, are some regions dense (e.g., dense animal cate- tive measures are: (1) the actual distribution of all instances gories) and other sparse (e.g., tools)? Are some regions of of the target category as given by the subjects in Analysis 1 the space -even those with relatively many categories -sparse and (2) three randomly selected instances from that subject in the sense of relatively few experienced instances of any generated distribution of instances. The comparison of the one category? Knowing just how young children’s category predictions of the packing metric to the actual distribution knowledge ”scales up” is critical to testing the role of the answers the question of how well the packing metric gener- joint optimization proposed by Packing Theory in children’s ates the full distribution given only a single instance but in- category development. formation about the distributions of neighboring categories. The formal analyses show that for the inherent in The second comparison answers the question of whether a the joint optimization of discrimination and inclusion require single instance in the context of a whole geometry of cat- many categories (crowding) and relatively many instances egories provides better information about the shape of that in these categories. This crowding will also depend on the category than more instances with no other information. dimensionality of the space as crowding is more likely in a lower than in a higher dimensional space, and we do not Results and Discussion know the dimensionality of the feature space for human cat- egory judgments. This limitation does not matter for test- The predicted covariance of the target category by the ing general predictions since the optimization depends only packing model correlates strongly with the actual distribu- on distance relations in the space (and thus on the number tion of instances as generated by the subjects in Analysis of orthogonal, that is uncorrelated, dimensions but not on 1. Specifically, the correlations between the packing met- any assumptions about what orthogonal directions in that ric predictions for the target category and measures of the space constitute the dimensions) and since the prediction of actual distributions from Analysis 1 were 0.876, 0.578 and smoothness should hold in any lower-dimensional charac- 0.559 for the covariances and (136 dimensions), terization of the space. The specification of the actual di- the variances considered alone (16 dimensions) and the co- mensionality of the space also may not matter for relative variances considered alone (120 dimensions). These are ro- predictions about more and less crowded regions of people’s bust correlations overall; moreover, they are considerably space of categories. Still, insight into the dimensionality of greater than those derived from an estimation of the target the feature space of human categories would benefit an un- category from three randomly chosen instances. For this derstanding of the development of the ability to generalize a ”control” comparison, we analyzed the correlation of covari- new category from very few instances. ance matrix for each category calculated from randomly cho- sen three instances of adults’ judgment with that calculated General Discussion from the whole set of instances. Their average correlations of 50 different random set of samples were 0.2266 in vari- A fundamental problem in category learning is knowing ances (S.D.=0.2610), 0.2273 in covariance (S.D.=0.1393) the relevant features for to-be-learned categories. Although and 0.4456 in both variance and covariance (S.D.=0.0655). this is a difficult problem for theories of categorization, peo- The packing metric –given one instance and information ple, including young children, seem to readily solve the prob- about neighboring categories– does a better job predicting lem. The packing model provides a unified account of feature category shape than a prediction from three instances. In selection and fast mapping that begins with the insight that sum, the packing metric can generate the distribution of in- the feature distributions across the known instances of a cat- stances in a category using its location in a system of known egory play a strong role, one that trumps overall similarity, in categories. This result suggests that a developing system judgments as to whether some instance is a member of that of categories should, when enough categories and their in- category. This fact is often discussed in the categorization stances are known, enable the learner to infer the distribution literature in terms of the question of whether instance dis- of newly encountered categories from very few instances. tributions or similarity matter to category formation (Rips, A geometry of categories –and the local interactions among 1989; Rips & Collins, 1993; Holland et al., 1986; Nisbett them– creates knowledge of possible categories. et al., 1983; Thibaut et al., 2002). The packing model takes There are several open questions with respect to the joint role of instance distributions and ties it to similarity in a ge- optimization of inclusion and discrimination should influ- ometry of category in which nearby categories having sim- ence category development in children who will have sparser ilar category-relevant features, showing how this structure instances and sparser categories than do adults. The pro- may emerge and how it may be used to learn new categories cesses presumed by Packing Theory may be assumed to al- from very few instances. The packing model thus provides ways be operation as they seem likely to reflect core oper- a bridge that connects the roles of instance distributions and ating characteristics (competition) of the cognitive system. similarity. The fitting of categories into a feature space is But their effects will depend on the density of categories and construed as the joint optimization of including known and instances in local regions of the space. An implication of possible instances and discriminating the instances belong- PACKING: A GEOMETRIC ANALYSIS OF CATEGORIES 13 ing to different categories. The joint optimization of inclu- tion unites but of a fundamentally different kind. In the SOM sion and discrimination aligns nearby categories such that algorithm, information coders do not explicitly have “their their distributions of instances in the feature space are more own type” of information. Rather it is the topological rela- alike. The chain reaction of these local interactions across the tion among information coders that implicitly specifies their population of categories creates a smooth space. Categories gradients of data distribution. In the SOM learning process, that are similar (near in the space) have similar distributions the closest information units to an input is gradually moved of instances; categories that are dissimilar (far in the space) to better fit the data point, and nearby points are moved to have more dissimilar distributions of instances. fit similar inputs. Thus nearby units end up coding similar In this way, the packing model provides the missing link inputs. that connects similarity to the likelihood of instances. Both Topological algorithms such as SOM assume that a similarity and feature distributions are deeply relevant to un- smooth structure is a good way to represent information and derstanding how and why human categories have the struc- this assumption is well supported by the many successful ap- tures that they do. However, the relevance is not with respect plications of these algorithms (Kohonen, 1995). However, to the structure of a single category, but with respect to the just why a smooth structure is “good” is not well specified. structure of a population of categories. Smoothness implies The packing metric might provide an answer from a psycho- a higher order structure, a gradient of changing feature rel- logical perspective. The packing neither assumes topological evance across the whole that is made out of, but transcends, relations nor a smooth structure, but rather produces them the specific instances and the specific features of individual through the joint optimization of discriminability and inclu- categories. It is this higher order structure that may be use- sion. Thus, a smooth space might be a good form of repre- able by learners in forming new categories. This higher order sentation because of the trade off between discrimination and structure in the feature space aligns with what are sometimes generalization. called ”kinds” or ”superordinate categories” in that similar categories (clothes versus food for example) will have simi- Packing Theory in relation to other accounts of fast lar features of importance and be near each other in the space. mapping However, there are no hard and fast boundaries and the pack- ing does not directly represent these higher order categories. Fast Mapping is the term sometimes used to describe Instead, they are emergent in the patchwork of generaliza- young children’s ability to map a noun to a whole category tion gradients across the feature space. How such a space given just one instance (Carey & Bartlett, 1978). Packing of probabilistic likelihoods of instances as members of basic Theory shares properties with two classes of current explana- level categories relates to higher and lower levels of cate- tions of fast mapping in children: connectionist (Colunga & gories (and the words one learns to name those categories) is Smith, 2005; Roger & McClelland, 2004, see also, Hanson & an important question to be pursued in future work. Negishi, 2002 for a related model) and Bayesian approaches (Kemp, Perfors, & Tenenbaum, 2007; Xu & Tenenbaum, Packing Theory in relation to other topographic 2007). Like connectionist accounts, the packing model views approaches knowledge about the different organization of different kinds as emergent and graded. Like rationalist accounts, the pack- The packing model shares some core ideas with other to- ing model is not a process model. Moreover, since the pack- pographic approaches such as self-organizing maps (SOM, ing model is build upon a statistical optimality, it could be see Kohonen, 1995; see also Tenenbaum, 2000; etc.) . The formally classified as a rationalist model (Anderson, 1990). central assumption underlying the algorithms used in SOM is Despite these differences, there are important similarities that information coding is based on a continuous and smooth across all three approaches. Both the extant connectionist projection that preserves a particular topological structure. and Bayesian accounts of children’s smart noun generaliza- More particularly, within this framework, information coders tions consider category learning and generalization as a form (e.g., receptive fields, categories, memories) that are near of . Thus, all three classes of models are each other code similar information whereas coders that are sensitive to the feature variability within a set of instances. more distant code different types of information. Thus, SOM All agree on the main idea behind the packing model that and other topographical representations posit a smooth rep- feature variability within categories determines in cat- resentational space, just as the packing metric. egory generalization. All three also agree that the most im- However, there are differences between the packing model portant issue to be explained is higher order feature selec- and algorithms, such as SOM. In the packing metric, cate- tion, called variously second order generalizations (Smith, gories may be thought of as the information coders, but un- Jones, Landau, Gershkoff-Stowe, & Samuelson, 2002; Col- like the information coders in SOM, these categories begin unga & Smith, 2005), overhypotheses (Kemp, Perfors, & with their own feature importance and their own location in Tenenbaum, 2007), and smoothness (the packing model). the map, which is specified by the feature distributions of ex- Using the terms of Colunga and Smith (2005), the first or- perienced instances. Within the packing model, local com- der of generalization is about individual categories and it is a petition “tunes” feature importance across causes of the pop- generalization over instances. The second order generaliza- ulation of categories and creates a smooth space of feature tion is generalization of distribution of categories over cate- relevance. SOM also posits a competition among informa- gories. The central goal of all three approaches is to explain 14 HIDAKA AND SMITH how people form higher-order generalizations. the kind of mechanisms or neural substrates in which such There are also important and related differences among hierarchical pre-ordained knowledge resides is also far from these approaches. The first set of differences concern obvious. whether or not the different levels are explicitly represented The packing model provides answers and new insights to in the theory. Colunga and Smith’s (2005) connectionist these issues that put smoothness neither in the data nor as account represents only input and output associations, the a pre-specified outcome. Instead, smoothness is emergent in higher order representations of kind – that shape is more rel- the local interactions of fundamental processes of categoriza- evant for solid things than for nonsolid things, for example – tion, inclusion, and discrimination. As the proof and analy- are implicit in the structure of the input-output associations. ses show, the joint optimization of discriminability and in- They are not explicitly represented and they do not pre-exist clusion leads to smooth categories, regardless of the starting in the learner prior to learning. In contrast, the Bayesian ap- point. The packing model thus provides answer as to why proach for the novel word generalization (Kemp et al., 2007; categories are the way they are and why they are smooth. Xu & Tenenbaum, 2007) has assumed categories structured The answer is not that categories have the structure they do as a hierarchical tree. The learner knows from the start that in order to help children learn them; the smoothness of cate- there are higher order and lower order categories in a hierar- gories in feature space is not a pre-specification of what the chy. Although the packing model is rationalist in approach, it system has to learn as in the current Bayesian accounts of is emergentist in spirit: Smoothness is not an a priori expec- children’s early word learning (although the smoothness of tation and is not explicitly represented as higher order vari- geometry of categories is clearly exploitable). Rather, ac- able but is an emergent and graded property of the popula- cording to Packing Theory, the reason categories have the tion as a whole. As it stands, the Packing model also makes structure they do lies in local function of categories, in the no explicit distinction between learned categories at different first place: to include known and possible instances but to levels such as the learning of categories of animal as dog, discriminate among instances falling in different categories. for example. The present model is considered only basic The probabilistic nature of inclusion and discrimination, the level categories and thus is moot on this point. However, frequency distributions of individual categories, the joint op- one approach that the packing metric could take with respect timization of discrimination, and inclusion in a connected to this issue is to represent all levels of categories in the same geometry of many categories creates a gradient of feature geometry, with many overlapping instances, letting the joint relevance that is then useable by learners. For natural cat- optimization of inclusion and discrimination find the stable egory learning, for categories that are passed on from one solution given the distributional evidence on the inclusion generation to the next, the optimization of inclusion and dis- and discrimination of instances in the overlapping categories. crimination over these generations may make highly com- This approach might well capture some developmental phe- mon and early-learned categories particularly smooth. Al- nomena. For example, children’s tendency to agree that un- though the packing model is not a process model, processes known animals are “animals” but that well known ones (e.g., of discrimination and inclusion and processes of competition dogs) are not. Within this extended framework, one might in a topographical representation are well studied at a variety also want to include, outside of the packing model itself , of levels of analysis and thus bridges between this analytic real-time processes that perhaps activate a selected map of account and process accounts seem plausible. categories in working memory or that perhaps contextually shift local feature gradients, enabling classifiers to flexibly Testable Predictions shift between levels and kinds of categories and to form ad hoc categories (Barsalou, 1985; Spencer, Perone, Smith, & The specific contribution of this paper is a mathematical Samuelson, in preparation). analysis that shows that the joint optimization of inclusion The second and perhaps most crucial difference between and discrimination yields a smooth space of categories and packing theory and the other two accounts is the ultimate that given such a smooth space that optimization can also ac- origin of the higher order knowledge about kinds. For con- curately predict the instance distributions of a new category nectionist accounts, the higher order regularities are latent specified only by the location of a single instance. What is structure in the input itself. If natural categories are smooth, needed beyond this mathematical proof is empirical evidence by this view, it is solely because the structure of the cate- that shows that the category organizations and processes pro- gories in the world is smooth and the human learning system posed by the packing model are actually observable in human has the capability to discover that regularity. However, if this behavior. The present paper provide a first step by indicat- is so, one needs to ask (and answer) why the to-be-learned ing that the feature space of early-learned noun categories categories have the structure that they do. For the current may be smooth (and smooth enough to support fast map- Bayesian accounts, a hierarchical representational structure ping). Huttenlocher et al (2007) have reported empirical (with variabilized over-hypotheses) is assumed and fixed (but evidence that also provides support for local competitions see the other approach that learns the structure (Kemp & among neighboring categories. Huttenlocher et al’s (2007) Tenenbaum, 2008). These over-hypotheses create a tree of method provides a possible way to test specific predictions categories in which categories near the tree will have simi- from Packing Theory in adults. lar structure. Again, why the system would have evolved to The local interactions that create smoothness also raise have such an innate structure is not at all clear. Moreover, new and testable hypotheses about children’s developing cat- PACKING: A GEOMETRIC ANALYSIS OF CATEGORIES 15 egory knowledge. Because these local competitions depend also Turkington, 2002). For d × d matrices X, Y, Z and a on the frequency distributions over known instances and the constant matrix A which is not function of X, local neighborhood of known categories, there should be | | partial X = | | −1T observable and predictable changes as children’s category ∂ X v(X ) knowledge “scales up”. Several developmental predictions v(X) −1 ( ) follow: (1) Learners who know (or are taught) a sufficiently ∂v(X ) − − = − X 1 ⊗ X 1T large and dense set of categories, should form and general- ∂v(X) ize a geometry of categories that is smoother than that given ∂tr(AX) by the known instances. (2) The generalization of any cat- = v(AT ) ∂v(X) egory trained with a specific set of instances should depend ∂ ( ) on the instance distributions of surrounding categories and v(X) = λ − T ⊗ ( kId X) xk Id be distorted in the direction of the surrounding categories; ∂xk ∂ ∂ ∂ thus, children should show smoother category structure and v(Z) = v(Y) v(X) smarter novel noun generalizations in denser category re- ∂v(X) ∂v(Z) ∂v(Y) gions than sparser ones. (3) The effects of learning a new λ category on surrounding categories or surrounding on new where xk and k are respectively k-th eigenvector and eigen- categories should depend in formally predictable ways on the value of a d-th order real symmetric matrix X. We derive feature distributions of those categories. Equation (11) from Equation (8) using formulae above,

∂ ∆µ ∆µT σ−1 ∂ −1σ −4∂Fi j v( i j i j ¯ i j ) log 2 ¯ i j ∂ log |σ | Conclusion = + 2 − i ∂v(σi) ∂v(σi) ∂v(σi) ∂v(σi) Categories (and their instances) do not exist in isolation ∂v(σ ¯ −1) ( ) ∂ σ ( ) ∂ σ i j T v( i) −1 v( i) −1T but reside in a space of many other categories. The local in- = v ∆µi j∆µ + 2 v σ¯ − v(σ ) ∂v(σ ) i j ∂v(σ ) i j ∂v(σ ) i teractions of these categories create a gradient of higher or- i ( ) ( i ) ( ) i = − σ−1 ⊗ σ−1T ∆µ ∆µT + σ−1 − σ−1T der structure - different kinds with different feature distribu- ¯ i j ¯ i j v i j i j 2v ¯ i j v( i ) tions. This structure emergent from the interactions of many − − − − = −v(σ 1∆µ ∆µT σ 1 − 2σ ¯ 1 − σ 1) categories in a representational space constrains the possible i i j i j i i j i = − σ−1 µ − µ µ − µ T σ−1 + σ−1σ−1σ−1 − σ−1 structure of both known and unknown categories. Packing v( i ( i ¯i j)( i ¯i j) i i ˆ i j i i ) Theory captures these ideas in the joint optimization of dis- crimination and generalization. whereσ ¯ i j = σi + σ j and ∆µi j = µi − µ j. And note that σ = σ−1 − −1σ−1σ σ−1 σ ∆µ = σ−1σ−1 µ − µ Acknowledgements ¯ i j i 2 i ˆ i j i and ¯ i j i j i i ( i ¯i j) are used for the last line. Thus, we obtain Equation (11). This study is based on in part on the PhD thesis of the first Likewise the derivation of Equation (11), we derive Equation author. The authors thank Dr. Jun Saiki and Dr. Toshio Inui (12) from Equation (2) as follows. for their discussion on this work. This study was supported − − ∂ ∂tr(S σ 1) ∂ |σ | by grants from NIH MH60200. 2 Gik = ik i + log i ∂v(σi) ∂v(σi) ∂v(σi) ∂⃗ σ−1 Appendix: Derivation of tr( ) ∂v(σi) − = i v(S ) + v(σ 1T ) differential with respect to ∂ σ ik ∂ σ i (v( i) ) v( i) covariance matrix = − σ−1 ⊗ σ−1T + σ−1T ¯ i j ¯ i j v(S ik) v( i ) ∂Fi j ( ) We derive Equation (11) and (12) by expanding and −1 −1T −1T ∂σi = −σ σ + σ ∂ v ¯ i j S ik ¯ i j i Gik . For the derivation, we use the vectorizing operator v(X) ∂σi which form a column vector from a given matrix X (See also Next, we derive Equation (13) using formula for the dif- Magnus & Neudecker, 1988; Turkington, 2002 for the ma- ferential with respect to eigenvector as follows. trix algebra). A useful formula on vectorizing operator is as ∂ ∂ σ ∂ LN = v( i) LN follows. For A: m × n matrix and B: n × p matrix, ∂ ∂ ∂ σ sil sil( v( i)) ( ) ∂ = = T ⊗ = η − σ T ⊗ LN v(AB) v(ImAB) B Im v(A) ( ilID i)(sil ID)v ( ) ∂v(σi) T = v(AImB) = B ⊗ A v(Im) ∂ ( ) = η − σ LN T ( ilID i)∂ σ sil = v(ABIp) = Ip ⊗ A v(B) v( i)

∂LN Since |ηilID − σi| = 0 is obvious by definition, ∂σ = 0 is ⊗ i where Im is m-th order identity matrix and denotes Kro- ∂ necessary in order to obtain non-obvious solution for LN = necker product. Moreover, we use the following formulae in ∂sil order to expand the differential with respect to a matrix (See 0. Therefore, we obtain Equation (13). 16 HIDAKA AND SMITH

References and knowledge in early lexical learning. Child development, 62, 499-516. Anderson, J. R. (1990). The adaptive character of thought. Hills- Keil, F. C. (1994). Mapping the mind: Domain specificity in cog- dale, NJ: Lawrence Erlbaum Associates. nition and culture. In L. A. Hirschfeld & S. A. Susan A Gelman Ashby, G. F., & Townsend, J. T. (1986). Varieties of perceptual (Eds.), (chap. The birth and nurturance of concepts by domains: independence. Psychological Review, 93, 154-179. The origins of concepts of living things). MA: Cambridge Uni- Barsalou, L. W. (1985). Ideals, central tendency, and frequency of versity Press. instantiation as of graded structure in categories. Kemp, C., Perfors, A., & Tenenbaum, J. B. (2007). Learning over- Journal of experimental psychology. Learning, memory, and hypotheses with hierarchical bayesian models. Developmental cognition, 11, 629-654. Science, 10(3), 307-321. Beck, D. M., & Kastner, S. (2009). Top-down and bottom-up Kemp, C., & Tenenbaum, J. B. (2008). The discovery of struc- mechanisms in biasing competition in the human brain. Vision tural form. Proceedings of the National Academy of Sciences, Research, 49, 1154–1165. 105(31), 10687-10692. Booth, A. E., & Waxman, S. (2002). Word learning is ’smart’: Kloos, H., & Sloutsky, V. M. (2008). What’s behind different kinds Evidence that conceptual information affects preschoolers’ ex- of kinds: Effects of statistical density on learning and represen- tension of novel words. Cognition, 84, B11-B22. tation of categories. Journal of Experimental Psychology: Gen- Carey, S., & Bartlett, E. (1978). Acquiring a single new word. eral, 137, 52-75. Papers and reports on child language development, 15, 17-29. Kobayashi, H. (1998). How 2-year-old children learn novel part Colunga, E., & Smith, L. (2005). From the lexicon to expecta- names of unfamiliar objects. Cognition, 68, B41-B51. tions about kinds: A role for associative learning. Psychological Kohonen, T. (1995). Self-organizing maps. Heidelberg: Springer. Review, 112, 347-382. Landau, B., Smith, L. B., & Jones, S. (1992, Dec). Syntactic con- Colunga, E., & Smith, L. B. (2008). Flexibility and variability: text and the shape bias in children’s and adults’ lexical learning. Essential to human cognition and the study of human cognition. Journal of Memory and Language, 31(6), 807-825. New Ideas in Psychology, 26(2), 174-192. Landau, B., Smith, L. B., & Jones, S. S. (1998). Object shape, Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum object function, and object name. Journal of Memory and Lan- likelihood from incomplete data via the em algorithm. Journal guage, 38, 1-27. of Royal Statistical Society Series B, 39, 1-38. Macario, J. F. (1991). Young children’s use of color in classifica- Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classifica- tion: Foods and canonically colored objects. Cognitive Devel- tion (2nd ed.). New York: John Wiley & Sons. opment, 6, 17-46. Duncan, J. (1996). Cooperating brain systems in selective percep- Magnus, J. R. (1988). Linear structure. Oxford: Oxford University tion and action. Attention and performance XVI: information Press. integration, 18, 193–222. Markman, E. M. (1989). Categorization and naming in children: Fenson, L., Dale, P. S., Reznick, J. S., Bates, E., Thal, D. J., & Problems of induction. Cambridge, MA: MIT Press. Pethick, S. J. (1994). Variability in early communicative devel- Marslen-Wilson, W. D. (1987). Functional parallelism in spoken opment. Monogr. Soc. Res. Child Dev., 59, 1–173. word-recognition. Cognition, 25, 71–102. Gathercole, V. C. M., & Min, H. (1997). Word meaning biases or McCullagh, P., & Nelder, J. A. (1989). Generalized linear models language-specific effects? evidence from english, spanish, and (2nd ed.). Chapman & Hall. korean. First Language, 17(49), 31-56. McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Gelman, S. A., & Coley, J. D. (1991). Perspectives on language Semantic feature production norms for a large set of living and and thought: Interrelations in development. In S. A. German nonliving things. Behavior Reserch Methods, Instruments, & & J. P. Byrnes (Eds.), (chap. Language and categorization: the Computers, 37, 547-559. acquisition of natural kind terms). Cambridge: Cambridge Uni- Murphy, G., & Medin, D. L. (1985). The role of theories in con- versity Press. ceptual coherence. Psychological Review, 92, 289-316. Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics Nisbett, R., Krantz, D., Jepson, C., & Kunda, Z. (1983). The use in semantic representation. Psychological Review, 114, 2. of statistical heuristics in everyday . Psycho- Hanson, S. J., & Negishi, M. (2002). On the emergence of rules in logical Review, 90, 339-363. neural networks. Neural computation, 14, 2245–2268. Nosofsky, R. M. (1986). Attention, similarity and the identification- Hidaka, S., & Saiki, J. (2004). A mechanism of ontological bound- categorization relationship. Journal of Experimental Psychol- ary shifting. In The twenty sixth annual meeting of the cognitive ogy: Learning Memory, and Cognitition, 15, 39-57. science society (p. 565-570). Nosofsky, R. M., Palmeri, T. J., & McKinley, S. C. (1994). Rule- Holland, J. H., Holyoak, K. J., Nisbett, R. E., & Thagard, P. R. plus-exception model of classification learning. Psychological (1986). Induction. Cambridge, MA: MIT Press. Review, 101, 53-79. Huttenlocher, J., Hedges, L. V., Lourenco, S. F., Crawford, L. E., & Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The mea- Corrigan, B. (2007). Estimating stimuli from contrasting cate- surement of meaning. Urbana, IL: University of Illinois Press. gories: Truncation due to boundaries. Journal of Experimental Rips, L. J. (1989). Similarity, typicality, and categorization. In Psychology: General, 136(3), 502-519. S. Vosniadou & A. Ortony (Eds.), Similarity and analogical rea- Imai, M., & Gentner, D. (1997). A cross-linguistic study of early soning. Cambridge, England: Cambridge University Press. word meaning: universal ontology and linguistic influence. Cog- Rips, L. J., & Collins, A. (1993). Categories and resemblance. nition, 62, 169–200. Journal of Experimental Psychology: General, 122, 468-486. Jones, S. S., & Smith, L. (2002). How children know the rele- Rogers, T. T., & McClelland, J. M. (2004). Semantic cognition: A vant properties for generalizing object names. Developmental distributed processing approach. Cambridge, MA: The Science, 5, 219–232. MIT Press. Jones, S. S., Smith, L. B., & Landau, B. (1991). Object properties Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes- PACKING: A GEOMETRIC ANALYSIS OF CATEGORIES 17

Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382-439. Samuelson, L. K., & Smith, L. B. (1999). Early noun vocabularies: do ontology, category structure and syntax correspond? Cogni- tion, 73, 1–33. Shepard, R. N. (1958). Stimulus and response generalization: Tests of a model relating generalization to distance in psychological space. Journal of Experimental Psychology, 55, 509-523. Smith, L. B., Jones, S. S., Landau, B., Gershkoff-Stowe, L., & Samuelson, L. (2002). Object name learning provides on-the- job training for attention. Psychological Science, 13, 13–19. Soja, N. N., Carey, S., & Spelke, E. S. (1991). Ontological cate- gories guide young children’s inductions of word meanings: ob- ject terms and substance terms. Cognition, 38, 179–211. Spencer, J. P., Perone, S., Smith, L. B., & Samuelson, L. (in prepa- ration). Non-bayesian noun generalization from a capacity- limited system. Manuscript in preparation. Stewart, N., & Chater, N. (2002). The effect of category variability in perceptual categorization. Journal of Experimental Psychol- ogy: Learning, Memory, and Cognition, 28, 893-907. Steyvers, M., & Tenenbaum, J. B. (2005). The large-scale struc- ture of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29, 41-78. Swingley, D., & Aslin, R. N. (2007). Lexical competition in young children’s word learning. Cognitive Psychology, 54, 99-132. Tenenbaum, J. B., Silva V. de, & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319-2323. Thibaut, J. P., Dupont, M., & Anselme, P. (2002). Dissociations between categorization and similarity judgments as a result of learning feature distributions. Memory & Cognition, 30(4), 647- 656. Tisseur, F., & Meerbergen, K. (2001). The quadratic eigenvalue roblem. SIAM Review, 43(2), 235-286. Turkington, D. A. (2002). Matrix calculus and zero-one matri- ces: Statistical and econometric applications. Cambridge, MA: Cambridge University Press. Tversky, A., & Hutchinson, J. W. (1986). Nearest neighbor analysis of psychological spaces. Psychological Review, 93, 3-22. Xu, F., & Tenenbaum, J. (2007). Word learning as bayesian infer- ence. Psychological Review, 114, 245-272. Yoshida, H., & Smith, L. B. (2001). Early noun lexicons in english and japanese. Cognition, 82, 63–74. Yoshida, H., & Smith, L. B. (2003). Shifting ontological bound- aries: how japanese- and english- speaking children generalize names for animals and artifacts. Developmental Science, 6, 1– 34. Zeigenfuse, M. D., & Lee, M. D. (2009). Finding the features that represent stimuli. Acta Psychologica, 133, 283-295.