Human-Level Concept Learning Through Probabilistic Program Induction Brenden M

RESEARCH ◥ new concept, and even children can make mean- RESEARCH ARTICLES ingful generalizations via “one-shot learning” (1–3). In contrast, many of the leading approaches in machine learning are also the most data-hungry, COGNITIVE SCIENCE especially “deep learning” models that have achieved new levels of performance on object and speech recognition benchmarks (4–9). Sec- Human-level concept learning ond, people learn richer representations than machines do, even for simple concepts (Fig. 1B), using them for a wider range of functions, in- through probabilistic cluding (Fig. 1, ii) creating new exemplars (10), (Fig. 1, iii) parsing objects into parts and rela- program induction tions (11), and (Fig. 1, iv) creating new abstract categories of objects based on existing categories Brenden M. Lake,1* Ruslan Salakhutdinov,2 Joshua B. Tenenbaum3 (12, 13). In contrast, the best machine classifiers do not perform these additional functions, which People learning new concepts can often generalize successfully from just a single example, are rarely studied and usually require special- yet machine learning algorithms typically require tens or hundreds of examples to ized algorithms. A central challenge is to ex- perform with similar accuracy. People can also use learned concepts in richer ways than plain these two aspects of human-level concept conventional algorithms—for action, imagination, and explanation. We present a learning: How do people learn new concepts computational model that captures these human learning abilities for a large class of from just one or a few examples? And how do simple visual concepts: handwritten characters from the world’s alphabets. The model people learn such abstract, rich, and flexible rep- represents concepts as simple programs that best explain observed examples under a resentations? An even greater challenge arises Bayesian criterion. On a challenging one-shot classification task, the model achieves when putting them together: How can learning human-level performance while outperforming recent deep learning approaches. We also succeed from such sparse data yet also produce present several “visual Turing tests” probing the model’s creative generalization abilities, such rich representations? For any theory of which in many cases are indistinguishable from human behavior. 1Center for Data Science, New York University, 726 on December 10, 2015 espite remarkable advances in artificial from just one or a handful of examples, whereas Broadway, New York, NY 10003, USA. 2Department of intelligence and machine learning, two standard algorithms in machine learning require Computer Science and Department of Statistics, University aspects of human conceptual knowledge tens or hundreds of examples to perform simi- of Toronto, 6 King’s College Road, Toronto, ON M5S 3G4, 3 have eluded machine systems. First, for larly. For instance, people may only need to see Canada. Department of Brain and Cognitive Sciences, D Massachusetts Institute of Technology, 77 Massachusetts most interesting kinds of natural and man- one example of a novel two-wheeled vehicle Avenue, Cambridge, MA 02139, USA. made categories, people can learn a new concept (Fig. 1A) in order to grasp the boundaries of the *Corresponding author. E-mail: [email protected] www.sciencemag.org Downloaded from Fig. 1. People can learn rich concepts from limited data. (A and B) A single example of a new concept (red boxes) can be enough information to support the (i) classification of new examples, (ii) generation of new examples, (iii) parsing an object into parts and relations (parts segmented by color), and (iv) generation of new concepts from related concepts. [Image credit for (A), iv, bottom: With permission from Glenn Roberts and Motorcycle Mojo Magazine] 1332 11 DECEMBER 2015 • VOL 350 ISSUE 6266 sciencemag.org SCIENCE RESEARCH | RESEARCH ARTICLES learning (4, 14–16), fitting a more complicated ties of real-world generative processes operating positionally from parts (Fig. 3A, iii), subparts model requires more data, not less, in order to on multiple scales. (Fig. 3A, ii), and spatial relations (Fig. 3A, iv). achieve some measure of good generalization, In addition to developing the approach sketched BPL defines a generative model that can sam- usually the difference in performance between above, we directly compared people, BPL, and ple new types of concepts (an “A,”“B,” etc.) by new and old examples. Nonetheless, people seem other computational approaches on a set of five combining parts and subparts in new ways. to navigate this trade-off with remarkable agil- challenging concept learning tasks (Fig. 1B). The Each new type is also represented as a genera- ity, learning rich concepts that generalize well tasks use simple visual concepts from Omniglot, tive model, and this lower-level generative model from sparse data. a data set we collected of multiple examples of produces new examples (or tokens) of the con- This paper introduces the Bayesian program 1623 handwritten characters from 50 writing cept (Fig. 3A, v), making BPL a generative model learning (BPL) framework, capable of learning systems (Fig. 2) (see acknowledgments). Both im- for generative models. The final step renders a large class of visual concepts from just a single ages and pen strokes were collected (see below) as the token-level variables in the format of the raw example and generalizing in ways that are mostly detailed in section S1 of the online supplementary data (Fig. 3A, vi). The joint distribution on types indistinguishable from people. Concepts are rep- materials. Handwritten characters are well suited y, a set of M tokens of that type q(1),...,q(M), resented as simple probabilistic programs—that for comparing human and machine learning on a and the corresponding binary images I(1),...,I(M) is, probabilistic generative models expressed as relatively even footing: They are both cognitively factors as structured procedures in an abstract description natural and often used as a benchmark for com- Pðy; qð1Þ; …; qðMÞ; Ið1Þ; …; IðMÞÞ language (17, 18). Our framework brings together paring learning algorithms. Whereas machine — M ð1Þ three key ideas compositionality, causality, and learning algorithms are typically evaluated after ¼ PðyÞ∏ PðIðmÞjqðmÞÞPðqðmÞjyÞ learning to learn—that have been separately influ- hundreds or thousands of training examples per m¼1 ential in cognitive science and machine learning class (5), we evaluated the tasks of classification, over the past several decades (19–22). As pro- parsing (Fig. 1B, iii), and generation (Fig. 1B, ii) of The generative process for types P(y) and grams, rich concepts can be built “composition- new examples in their most challenging form: after tokens P(q(m)|y) are described by the pseudocode ally” from simpler primitives. Their probabilistic just one example of a new concept. We also in- in Fig. 3B and detailed along with the image semantics handle noise and support creative vestigated more creative tasks that asked people and model P(I (m)|q(m)) in section S2. Source code is generalizations in a procedural form that (unlike computational models to generate new concepts available online (see acknowledgments). The other probabilistic models) naturally captures (Fig. 1B, iv). BPL was compared with three deep model learns to learn by fitting each condition- the abstract “causal” structure of the real-world learning models, a classic pattern recognition al distribution to a background set of characters processes that produce examples of a category. algorithm, and various lesioned versions of the from 30 alphabets, using both the image and the Learning proceeds by constructing programs that model—a breadth of comparisons that serve to stroke data, and this image set was also used to best explain the observations under a Bayesian isolate the role of each modeling ingredient (see pretrain the alternative deep learning models. criterion, and the model “learns to learn” (23, 24) section S4 for descriptions of alternative models). Neither the production data nor any alphabets by developing hierarchical priors that allow pre- We compare with two varieties of deep convo- from this set are used in the subsequent evalu- vious experience with related concepts to ease lutional networks (28), representative of the cur- ation tasks, which provide the models with only learning of new concepts (25, 26). These priors rent leading approaches to object recognition (7), raw images of novel characters. represent a learned inductive bias (27) that ab- and a hierarchical deep (HD) model (29), a prob- Handwritten character types y are an abstract stractsthekeyregularitiesanddimensionsof abilistic model needed for our more generative schema of parts, subparts, and relations. Reflecting variation holding across both types of concepts tasks and specialized for one-shot learning. the causal structure of the handwriting process, and across instances (or tokens) of a concept in a character parts Si are strokes initiated by pres- given domain. In short, BPL can construct new Bayesian Program Learning sing the pen down and terminated by lifting it up s ... s programs by reusing the pieces of existing ones, The BPL approach learns simple stochastic pro- (Fig. 3A, iii), and subparts i1, , ini are more capturing the causal and compositional proper- grams to represent concepts, building them com- primitive movements separated by brief pauses of Fig. 2. Simple visual concepts for comparing human and machine learning. 525 (out of 1623) character concepts, shown with one example each. SCIENCE sciencemag.org 11 DECEMBER 2015 • VOL 350 ISSUE 6266 1333 RESEARCH | RESEARCH ARTICLES the pen (Fig. 3A, ii). To construct a new character measured from the background set. Second, a such that the probability of the next action type, first the model samples the number of parts template for a part Si is constructed by sampling depends on the previous. Third, parts are then k and the number of subparts ni, for each part subparts from a set of discrete primitive actions grounded as parameterized curves (splines) by i = 1, ..., k, from their empirical distributions as learned from the background set (Fig.

Load more