<<

CBMM: the Science and Engineering of Intelligence

The Center for Brains, Minds and Machines (CBMM) is a multi- institutional NSF Science and Technology Center dedicated to the study of intelligence - how the brain produces intelligent behavior and how we may be able to replicate intelligence in machines.

Cognitive , Neuroscience, Funding 2013-2023 ~$50M Science Computer Science Computational Research Institutions ~4

Educational Institutions 12

Faculty (CS+BCS+…) ~23

Researchers ~100 Science + Engineering Publications ~500 CBMM’s focus is the Science and the Engineering of Intelligence We aim to make progress in understanding the greatest of all problems in science — the problem of intelligence. This means understanding how the brain makes the mind, how the brain works and how to build intelligent machines. We believe that the science of intelligence will enable better engineering of intelligence in the long term.

Key recent advances in the engineering of intelligence have their roots in basic research on the brain The problem of intelligence is the greatest problem in science

The CBMM bet (different from Deep Mind):

understand how the brain works, (then) make intelligent machines CBMM Organizational Chart (future)

Director Tomaso Poggio EAC

Managing Education Diversity Deputy Associate Director Research Education KT Director Evaluation Coordinator Director & Trainee Director Coordinator Coordinator Kathleen Lizanne Mandana Gabriel Coordinator Kenneth Ellen Hildreth Boris Katz Sullivan DeStefano Sassanfar Kreiman Matt Wilson Blum (WC) (MIT) (MIT) (GT) (MIT) (HU) (MIT) (HU)

Module 1: Module 3: Module 4: VISUAL Module 2: COGNITIVE TOWARDS Administrative Technology STREAM BRAIN OS CORE SYMBOLS Assistant Director Tomaso Poggio, Gabriel Kreiman Nancy Kanwisher, Boris Katz, ShimonJim DiCarlo Ullman (HU) Joshua Tenenbaum Shimon Ullman (MIT) (MIT) (MIT)

EAC- May, 2020 CBMM Participants

150 140 130 120 110 Year 1 100 Year 2 90 Year 3 Year 4 80 Year 5 70 Year 6 60 (Year 7) 50 40 30 20 10

Total Faculty Postdocs Staff/Other Undergrads Grad Students Research Scientist EAC

Demis Hassabis, DeepMind Charles Isbell, Jr., Georgia Tech , Allen Institute Fei-Fei Li, Stanford Lore McGovern, MIBR, MIT Joel Oppenheim, NYU Pietro Perona, Caltech Marc Raibert, Boston Dynamics Judith Richter, Medinol Kobi Richter, Medinol Amnon Shashua, Mobileye David Siegel, Two Sigma Susan Whitehead, MIT Corporation Jim Pallotta, The Raptor group Research, Education & Diversity Partners

MIT Harvard Boyden, Desimone, DiCarlo, Kaelbling, Kanwisher, Blum, Gershman, Kreiman, Livingstone, Katz, McDermott, Oliva, Poggio, Roy, Sassanfar, Saxe, Sompolinsky, Spelke Schulz, Tegmark, Tenenbaum, Ullman, Wilson, Torralba

Boston Children’s Harvard Florida International U. Howard U. Hospital Medical School Chouika, Manaye, Kreiman Finlayson Kreiman, Livingstone Rwebangira, Salmani

Hunter College Johns Hopkins U. Queens College Rockefeller U. Chodorow, Epstein, Sakas, Zeigler Isik Brumberg Freiwald Universidad Central University of Museum of Science, Stanford U. Del Caribe (UCC) Central Florida Boston Goodman Jorquera McNair Program

UMass Boston UPR - Mayagüez UPR – Río Piedras Wellesley College Blaser, Ciaramitaro, Garcia-Arraras, Maldonado-Vlaar, Santiago, Vega-Riveros Hildreth, Wiest, Wilmer Pomplun, Shukla Megret, Ordóñez, Ortiz-Zuazaga International and Corporate Partners

A*STAR U. Hebrew U. Kaist Chuan Poh Lim Verri, Rosasco Weiss Sangwan Lee

IIT MPI Weizmann Cingolani Bülthoff Ullman

Google IBM Microsoft Orcam Siemens Honda Fujitsu NVIDIA

Boston DeepMind GE Schlumberger Mobileye Intel Dynamics Videos - ~950 (May 2014 - April 2020)

(of Youtube subscribers only - 18% of viewers) Ellen Hildreth

Diversity Program

Mandana Sassanfar Code, Software and Datasets

ObjectNet: A new benchmark for object There’s Waldo! A recognition (in prep.) Normalization Model of Andrei Barbu, David Mayo, Josh Tenenbaum, Boris Katz

Visual Search Predicts Existing object detection benchmarks overstate the performance of machines, and understate the performance of humans. We are creating a dataset that removes biases Single-Trial Human and shows that machines are far inferior to humans when detecting objects. Fixations in an Object Search Task

Thomas Miconi, Laura Groomes and Gabriel

Kreiman

Cerebral Cortex 2016 Partially Occluded - See more at: http://klab.tch.harvard.edu/resources/ Hands miconietal_visualsearch_2016.html#sthash.KmHoBP B. Myanganbayar, C. Mata, G. Dekel, B. Katz, G. Ben-Yosef, A. Barbu sk.xwHtrTkJ.dpuf A dataset of RGB images of hands holding objects and interacting with objects. Measured human accuracy on reconstructing occluded portions of hands. People are extremely good at this task while networks are at near chance-level performance.

EAC- May, 2020 Summer Course at Woods Hole: Our flagship initiative

Brains, Minds & Machines Summer Course An intensive three-week course gives advanced students a “deep” introduction to the problem of intelligence

Directors

Ellen Hildreth Kathleen Sullivan

Gabriel Kreiman

Lizanne Distefano

Kris Brewer Boris Katz A self-reproducing community of scholars is being formed ~>300 applicants, ~30 accepted Sponsored fellowships by GoogleX, Hidary Foundation + Fujitsu Kenny Blum CBMM Summer School

• Signature CBMM (Education/Knowledge Transfer) activity aimed at creating an intergenerational community around the science and engineering of intelligence. • Students reported strong influence of lectures, working on projects, and interactions among faculty, TA’s, and peers on their own thinking and research development.

EAC, May, 2020 Our vision and mission understand how the brain works, (then) make intelligent machines

WHY? Recent Success Stories in AI are based on RL and DL R DL and RL come from neuroscience L

Minsky’s SNARC D L Vision for the BMM SummerSchool

We focus on the combination of neuroscience and engineering to make progress on the problem of intelligence because, as in the recent past, it is likely that several of the next breakthroughs in ML and AI are likely to come from neuroscienceANDengineering A quick recap of 40 of the last ~50 years of neuroscience and ML, through my eyes 1972-2013 Tuebingen, MPI fuer BK (1972-1981) Werner Reichardt’s PhD

Werner with Dr. Ruska (center) Photo dated Nov. 17, 1952 (courtesy B. Reichardt) The four directors of the MPI fuer Biologische Kybernetik The beautiful eyes of flies

23 Work at 3 levels

• Fixation and tracking behavior of the fly

• Motion algorithms and circuits: the beetle (and the fly); relative motion algorithms and circuits

• Biophysics of computation Fixation and tracking behavior: Reichardt’s closed loop flight simulator Fixation and tracking behavior

26

Poggio, T. and W. Reichardt. A Theory of Pattern Induced Flight Orientation of the Fly, Musca Domestica, Kybernetik, 12, 185-203, 1972. Cognition in flies: probabilistic theories then (coming only now to humans)

27 The beginning of untethered flight analysis Bülthoff, Poggio & Wehrhahn Z. Naturforsch. 35c, 811-815 (1980)

▪ most behavioral fly research was done with the Götz torque meter ▪ in 1976, based on this recording technology, Reichardt & Poggio developed their theory for: Visual control of orientation behaviour in the fly, Part I +II. Quart. Rev. Biophysics 9(3), 311-375

▪ open question: how well does this theory describe fly behavior of natural flight

▪ in 1980 Wehrhan started high-speed film recording of flies chasing each other ▪ single frame analysis ▪ 3D stereo reconstruction Cognitive theory of basic fly instincts predicts trajectory of chasing fly …

Wehrhahn, C., T. Poggio and H. Bülthoff, Biological Cybernetics, 45, 123-130, 1982. Cognition in flies

30

Geiger, G. and T. Poggio. The Muller-Lyer Figure and the Fly, Science, 190, 479-480, 1975. Work at 3 levels

• Fixation and tracking behavior of the fly (cognition in the fly…similar to Bayesian approach to cognition in humans…no neurons!)

• Motion algorithms and circuits: the beetle (and the fly); relative motion algorithms and circuits

• Biophysics of computation Motion algorithm: the beetle Clorophanus and Reichardt’s motion detector Motion algorithm: the beetle and the fly

• The beetle follows the motion • Each photoreceptor sees only an alternation of dark and light: how is motion computed? • Reichardt and Hassenstein (and Peter Kunze) found the rules used by neural circuits. The algorithm (refined by D. Varju) explained many data : Reichardt detector. • The same model describes motion perception in flies: beautiful papers on anatomy, optics and organization of motion perception by Braitenberg, Kirschfeld, Goetz.

• An equivalent (“energy”) model (Adelson) describes motion cells in primate cortex. • A form of it has been used by Matsushita in the first chips to stabilize videocameras (see also Buelthoff, Little and Poggio, Nature, 1989) Relative motion and figure-ground discrimination: the fly

Work by Werner Reichardt (with Poggio and Hausen and later with M. Egelhaaf and A. Borst) Motion discontinuities and figure-ground discrimination: neural circuitry

Towards the neural circuitry, Reichardt, Poggio, Hausen, 1983 Relative motion

36 Two of the neurons….

Hermann Cuntz, Ju¨rgen Haag, and Alexander Borst, 2003 Work at 3 levels

• Fixation and tracking behavior of the fly

• Motion algorithms and circuits: the beetle (and the fly); relative motion algorithms and circuits (similar in spirit to HMAX and DiCarlo) • Biophysics of computation Biophysics of computation (motion detection)

39 Biophysics of Computation

______

Computational vision and regularization theory

Tomaso Poggio, Vincent Torre * & Christof Koch Laboratory and Center for Biological Information Processing, Massachusetts Institute of Technology, 545 Technology Square, Cambridge, Massachusetts 02193, USA * Istituto di Fisica, Universita di Genova, Genova, Italy

Descriptions of physical properties of visible surfaces, such as their distance and the presence of edges, must be recovered from the primary image data. Computational vision aims to understand how such descriptions can be obtained from inherently ambiguous and noisy data. A recent development in this field sees early vision as a set of ill-posed problems, which can be solved by the use of regularization methods. These lead to algorithms and parallel analog circuits that can solve 'ill-posed problems' and which are suggestive of neural equivalents in the brain.

COMPUTATIONAL vision denotes a new field in artificial intel- generic constraints about the physical word and the imaging ligence, centred on theoretical studies of visual information stage (see box). They represent conceptually independent processing. Its two main goals are to develop image understand- modules that can be studied, to a first approximation, in isola- ing systems, which automatically construct scene descriptions tion. Information from the different processes, however, has to from image input data, and to understand human vision. be combined. Furthermore, different modules may interact early Early vision is the set of visual modules that aim to extract on. Finally, the processing cannot be purely 'bottom-up': specific the physical properties of the surfaces around the viewer, that knowledge may trickle down to the point of influencing some is, distance, surface orientation and material properties (reflect- of the very first steps in visual information processing. ance, colour, texture). Much current research has analysed pro- Computational theories of early vision modules typically deal cesses in early vision because the inputs and the goals of the with the dual issues of representation and process. They must computation can be well characterized at this stage (see refs 1-4 specify the form of the input and the desired output (the rep- for reviews). Several problems have been solved and several resentation) and provide the algorithms that transform one into specific algorithms have been successfully developed. Examples the other (the process). Here we focus on the issue of processes are stereomatching, the computation of the optical flow, and algorithms for which we describe the unifying theoretical structure from motion, shape from shading and surface framework of regularization theories. We do not consider the reconstruction. equally important problem of the primitive tokens that represent A new theoretical development has now emerged that unifies the input of each specific process. much of these results within a single framework. The approach A good definition of early vision is that it is inverse optics. Proc. R. Soc. Lond. B. 202, 409-416 (1978) has its roots in the recognition of a common structure of early In classical optics or in computer graphics the basic problem is vision problems. Problems in early vision are 'ill-posed', requir- to determine the images of three-dimensional objects, whereas ing specific algorithms and parallel hardware. Here we introduce vision is confronted with the inverse problem of recovering Printed in Great Britain a specific regularization approach, and discuss its implications surfaces from images. As so much information is lost during for computer vision and parallel computer architectures, includ- the imaging process that projects the three-dimensional world ing parallel hardware that could be used by biological visual into the two-dimensional images, vision must often rely on systems. natural constraints, that is, assumptions about the physical world, to derive unambiguous output. The identification and Early vision processes use of such constraints is a recurring theme in the analysis of Early vision consists of a set of processes that recover physical specific vision problems. properties of the visible three-dimensional surfaces from the Two important problems in early vision are the computation A synaptic mechanism possibly underlyingtwo-dimensional directional intensity arrays. Their combined output of motion and the detection of sharp changes in image intensity roughly corresponds to Marr's 2-1/2D sketch\ and to Barrow (for detecting physical edges). They illustrate well the difficulty 5 and Tennenbaum's intrinsic images • Recently, it has been cus- of the problems of early vision. The computation of the two- tomary to assume that these early vision processes are general dimensional field of velocities in the image is a critical step in selectivity to motion and do not require domain-dependent knowledge, but only several schemes for recovering the motion and the three- dimensional structure of objects. Consider the problem of deter- mining the velocity vector V at each point along a smooth 6 Examples of early vision processes contour in the image. Following Marr and Ullman , one can assume that the contour corresponds to locations of significant • Edge detection intensity change. Figure 1 shows how the local velocity vector • Spatio-temporal interpolation and approximation is decomposed into a normal and a tangential component to B y V. TORREf AND T. POGGIOj • Computation of optical flow the curve. Local motion measurements provide only the normal • Computation of lightness and albedo component of velocity. The tangential component remains • Shape from contours 'invisible' to purely local measurements (unless they refer to • Shape from texture some discontinuous features of the contour such as a corner). f Universita di Genova, Istituto• Shape from shadingdi Fisica, Genoa,The problemItaly of estimating the full velocity field is thus, in • Binocular stereo matching general, underdetermined by the measurements that are directly • Structure from motion available from the image. The measurement of the optical flow t Max-Planck-Institutfur biologische Tubingen,• Structure from stereo is inherently ambiguous. It can be made unique only by adding • Surface reconstruction information or assumptions. • Computation of surface colour The difficulties of the problem of edge detection are somewhat different. Edge detection denotes the process of identifying the (Communicated by B. B. Boycott, F.R.8. - Received 1 February 1978) © 1985 Nature Publishing Group A specific synaptic interaction is proposed as the mechanism underlying the directional selectivity to motion of several nervous cells. It is shown that the hypothesis is consistent with previous behavioural and phy- siological studies of the motion detection process. Detection of movement is one of the most basic and elementary computations performed by visual systems. Hence it is not surprising that the mechanisms and principles underlying movement detection have been approached in various species with a variety of techniques from behavioural analysis and psychophysics to physiology. Although several investigators have provided a wealth of infor- mation in the last years, the early analyses of Hassenstein & Reichardt (1956), Reichardt (1957, 1961), Barlow & Hill (1963), and Barlow & Levick (1965) still represent the extent of our understanding of this function. These studies are in many respects complementary. Those of Reichardt & Hassenstein are centred on the functional principles of movement detection as inferred from the average optomotor behaviour of a whole insect, whereas Barlow & Levick attack the problem of the neural circuitry underlying directional selectivity in the ganglion cells of a vertebrate retina. Figure 1 a and b summarize the main conclusions of the two approaches. Both models postulate the existence of two types of channels (1 and 2, from two adjacent receptor regions) with different conduction properties. In figure 1 a, channel 1 and channel 2 are low pass filters with a short and a long time constant, respectively, while in figure 1 b, channel 2 simply contains a delay. Perhaps the most significant contribution of Barlow & Levick consists of the experimental recognition that movement detection, at the level of direction selectivity of the ganglion cells, results primarily from an inhibitory mechanism that ‘vetoes’ the response to simultaneous signals from the receptors (after appropriate asymmetric delay) rather than from the detection of the conjunction of excitation from two regions (see figure 1). On the other hand, the main thrust of Hassenstein & Reichardt’s analysis is the demonstration that the interaction underlying movement detection must be nonlinear and, in particular, of a multi- plicative type. Many experimental data suggest that this is indeed the functional scheme underlying movement detection in insects (Poggio & Reichardt 1976). 14 L 409 ] Vol. 202. B. Proc. R. Soc. Lond. B. 202, 409-416 (1978) Printed in Great Britain

A synaptic mechanism possibly underlying directional selectivity to motion

B y V. TORREf AND T. POGGIOj f Universita di Genova, Istituto di Fisica, Genoa, Italy t Max-Planck-Institutfur biologische Tubingen, Germany

(Communicated by B. B. Boycott, F.R.8. - Received 1 February 1978)

A specific synaptic interaction is proposed as the mechanism underlying the directional selectivity to motion of several nervous cells. It is shown that the hypothesis is consistent with previous behavioural and phy- siological studies of the motion detection process.

Detection of movement is one of the most basic and elementary computations performed by visual systems. Hence it is not surprising that the mechanisms and principles underlying movement detection have been approached in various species with a variety of techniques from behavioural analysis and psychophysics to physiology. Although several investigators have provided a wealth of infor- mation in the last years, the early analyses of Hassenstein & Reichardt (1956), Reichardt (1957, 1961), Barlow & Hill (1963), and Barlow & Levick (1965) still represent the extent of our understanding of this function. These studies are in many respects complementary. Those of Reichardt & Hassenstein are centred on the functional principles of movement detection as inferred from the average optomotor behaviour of a whole insect, whereas Barlow & Levick attack the problem of the neural circuitry underlying directional selectivity in the ganglion cells of a vertebrate retina. Figure 1 a and b summarize the main conclusions of the two approaches. Both models postulate the existence of two types of channels (1 and 2, from two adjacent receptor regions) with different conduction properties. In figure 1 a, channel 1 and channel 2 are low pass filters with a short and a long time constant, respectively, while in figure 1 b, channel 2 simply contains a delay. Perhaps the most significant contribution of Barlow & Levick consists of the experimental recognition that movement detection, at the level of direction selectivity of the ganglion cells, results primarily from an inhibitory mechanism that ‘vetoes’ the response to simultaneous signals from the receptors (after appropriate asymmetric delay) rather than from the detection of the conjunction of excitation from two regions (see figure 1). On the other hand, the main thrust of Hassenstein & Reichardt’s analysis is the demonstration that the interaction underlying movement detection must be nonlinear and, in particular, of a multi- plicative type. Many experimental data suggest that this is indeed the functional scheme underlying movement detection in insects (Poggio & Reichardt 1976). 14 L 409 ] Vol. 202. B. Cooperative Computation of Stereo Disparity Cooperative neural network for stereo D. Marr; T. Poggio Science, New Series, Vol. 194, No. 4262. (Oct. 15, 1976), pp. 283-287. Stable URL: http://links.jstor.org/sici?sici=0036-8075%2819761015%293%3A194%3A4262%3C283%3ACCOSD%3E2.0.CO%3B2-1

Science is currently published by American Association for the Advancement of Science. ~ 1979 , T. Poggio and D. Marr, MPI, Tuebingen Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/aaas.html . Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For more information regarding JSTOR, please contact [email protected].

http://www.jstor.org Mon Jan 22 12:49:53 2007 Vision: what is where

Vision A Computational Investigation into the Human Representation and Processing of Visual Information

Foreword by Afterword by Tomaso Poggio

David Marr's posthumously published Vision (1982) influenced a generation of brain and cognitive scientists, inspiring many to enter the field. In Vision, Marr describes a general framework for understanding visual perception and touches on broader questions about how the brain and its functions can be studied and understood. Researchers from a range of brain and cognitive sciences have long valued Marr's creativity, intellectual power, and ability to integrate insights and data from neuroscience, psychology, and computation. This MIT Press edition makes Marr's influential work available to a new generation of students and scientists.

In Marr's framework, the process of vision constructs a set of representations, starting from a description of the input image and culminating with a description of three-dimensional objects in the surrounding environment. A central theme, and one that has had far-reaching influence in both neuroscience and , is the notion of different levels of analysis—in Marr's framework, the computational level, the algorithmic level, and the hardware implementation level.

Now, thirty years later, the main problems that occupied Marr remain fundamental open problems in the study of perception. Vision provides inspiration for the continui A complex system must be understood at several different levels

Werner Reichardt’s scientific legacy: Integrative Neuroscience

• Marr’s book Vision (Marr, 1982) had a great impact on computational neuroscience: a system as complex as the brain must be understood at several different levels: — computation — algorithms — biophysics and circuits

• The argument came from “From Understanding Computation to Understanding Neural Circuits”, Marr and Poggio, 1977…

• …part of which comes from Reichardt and Poggio, 1976 (Q. Rev. Biophysics, Part I)…

• …which is a follow-up of Werner’s argument for starting the Max-Planck- Institute fuer Biologische Kybernetik! MIT, (1981-) 43rd Stated Meeting of the NRP Associates, March 14-17, 1982 MIT, (1981-)

! ⎡1 2 ⎤ min V (yi , f (xi )) + µ f f ∈H ⎢ ∑ K ⎥ ⎣! i=1 ⎦ Predictive regularization algorithms Learning theory + algorithms

Theorems on foundations of learning:

• Bioinformatics ENGINEERING • Computer vision APPLICATIONS • Computer graphics, speech synthesis, creating a virtual actor

How visual cortex works – Computational and how it may suggest Neuroscience: better computer vision models+experiments systems BULLETIN (New Series) OF THE AMERICAN MATHEMATICAL SOCIETY Volume 39, Number 1, Pages 1–49 S0273-0979(01)00923-5 Article electronically published on October 5, 2001

ON THE MATHEMATICAL FOUNDATIONS OF LEARNING

FELIPE CUCKER AND STEVE SMALE

The problem of learning is arguably at the very core of the problem of intelligence, both biological and artificial. T. Poggio and C.R. Shelton

Introduction (1) Amainthemeofthisreportistherelationshipofapproximationtolearningand the primary role of sampling (inductive inference). We try to emphasize relations of the theory of learning to the mainstream of mathematics. In particular, there are large roles for probability theory, for algorithms such as least squares,andfor tools and ideas from linear algebra and linear analysis. An advantage of doing this is that communication is facilitated and the power of core mathematics is more easily brought to bear. We illustrate what we mean by learning theory by giving some instances. (a) The understanding of language acquisition by children or the emergence of languages in early human cultures. (b) In Manufacturing Engineering, the design of a new wave of machines is an- ticipated which uses sensors to sample properties of objects before, during, and after treatment. The information gathered from these samples is to be analyzed by the machine to decide how to better deal with new input objects (see [43]). (c) Pattern recognition of objects ranging from handwritten letters of the alpha- bet to pictures of animals, to the human voice. Understanding the laws of learning plays a large role in disciplines such as (Cog- nitive) Psychology, Animal Behavior, Economic Decision Making, all branches of Engineering, Computer Science, and especially the study of human thought pro- cesses (how the brain works). Mathematics has already played a big role towards the goal of giving a univer- sal foundation of studies in these disciplines. We mention as examples the theory of Neural Networks going back to McCulloch and Pitts [25] and Minsky and Pa- pert [27], the PAC learning of Valiant [40], Statistical Learning Theory as devel- oped by Vapnik [42], and the use of reproducing kernels as in [17] among many other mathematical developments. We are heavily indebted to these developments. Recent discussions with a number of mathematicians have also been helpful. In

Received by the editors April 2000, and in revised form June 1, 2001. 2000 Mathematics Subject Classification. Primary 68T05, 68P30. This work has been substantially funded by CERG grant No. 9040457 and City University grant No. 8780043.

c 2001 American Mathematical Society ! 1 letters to nature

...... What we assume in the above examples is a machine that is trained, instead of programmed, to perform a task, given data of the General conditions for predictivity n form S xi;yi i 1: Training means synthesizing a function that ¼ð Þ ¼ in learning theory best represents the relation between the inputs x i and the corre- sponding outputs y i. Tomaso Poggio1, Ryan Rifkin1,4, Sayan Mukherjee1,3 & Partha Niyogi2 The basic requirement for any learning algorithm is generaliz- ation: the performance on the training examples (empirical error) 1Center for Biological and Computational Learning, McGovern Institute must be a good indicator of the performance on future examples Computer Science Artificial Intelligence Laboratory, Brain Sciences Department, (expected error), that is, the difference between the two must be MIT, Cambridge, Massachusetts 02139, USA 2 ‘small’ (see Box 1 for definitions; see also Fig. 1). Departments of Computer Science and Statistics, University of Chicago, Chicago, Probably the most natural learning algorithm is ERM: the Illinois 60637, USA 3Cancer Genomics Group, Center for Genome Research/Whitehead Institute, MIT, algorithm ‘looks’ at the training set S, and selects as the estimated Cambridge, Massachusetts 02139, USA function the one that minimizes the empirical error (training error) 4Honda Research Institute USA Inc., Boston, Massachusetts 02111, USA over the functions contained in a hypothesis space of candidate ...... Developing theoretical foundations for learning is a key step towards understanding intelligence. ‘Learning from examples’ is aparadigminwhichsystems(naturalorartificial)learna Box 1 functional relationship from a training set of examples. Within Formal definitions in supervised learning this paradigm, a learning algorithm is a map from the space of training sets to the hypothesis space of possible functional Convergence in probability. A sequence of random variables {X n} solutions. A central question for the theory is to determine converges in probability to a random variable X (for example, conditions under which a learning algorithm will generalize lim X 2 X 0 in probability) if and only if for every e . 0, n j n j¼ from its finite training set to novel examples. A milestone in lim!1P X 2 X . e 0. n ðj n j Þ¼ learning theory1–5 was a characterization of conditions on the Training!1 data. The training data comprise input and output pairs. The hypothesis space that ensure generalization for the natural class input data X is assumed to be a compact domain in an euclidean of empirical risk minimization (ERM) learning algorithms that space and the output data Y is assumed to be a closed subset of Rk. are based on minimizing the error on the training set. Here we There is an unknown probability distribution m(x,y) on the product space Z X Y. The training set S consists of n independent and provide conditions for generalization in terms of a precise ¼ £ stability property of the learning process: when the training set identically drawn samples from the distribution on Z: S {z x ;y ;…;z x ;y } is perturbed by deleting one example, the learned hypothesis ¼ 1 ¼ð 1 1Þ n ¼ð n nÞ does not change much. This stability property stipulates con- Learning algorithms. A learning algorithm takes as input a data set S ditions on the learning map rather than on the hypothesis space, and outputs a function fS that represents the relation between the subsumes the classical theory for ERM algorithms, and is appli- input x and output y. Formally the algorithm can be stated as a map n L :

NATURE|VOL 428|25 MARCH 2004|www.nature.com/nature 419 © 2004 Nature Publishing Group Why do hierarchical architectures work? ~15 year old CBCL computer vision research: face detection; since 2006 on the market (digital cameras...)

• Training Database • 1000+ Real, 3000+ VIRTUAL • 50,0000+ Non-Face Pattern

Sung & Poggio 1995

Moore-like law for ML (1995-2018)

Third Annual NSF Site Visit, June 8 – 9, 2016 Vision: what is where

• Human Brain – 1010-1011 neurons (~1 million flies) – 1014- 1015 synapses

• Ventral stream in rhesus monkey – ~109 neurons in the ventral stream (350 106 in each emisphere) – ~15 106 neurons in AIT (Anterior InferoTemporal) cortex

Van Essen & Anderson, 1990 Vision: ventral stream

The ventral stream hierarchy: V1, V2, V4, IT A gradual increase in the receptive field size, in the “complexity” of the preferred stimulus, in “invariance” to position and scale changes

Kobatake & Tanaka, 1994 Cognition in people

Shape representation in the inferior temporal cortex of monkeys Nikos K. Logothetis*, Jon Pauls* and Tomaso Poggiot *Division of Neuroscience, Baylor College of Medicine, One Baylor Plaza, Houston, Texas 77030, USA. tCenter for Computational and Biological Learning, and Department of Brain Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.

Background: The inferior temporal cortex (IT) of the view-plane rotations. Some neurons were found to be monkey has long been known to play an essential role in tuned around two views of the same object, and a very visual object recognition. Damage to this area results in small number of cells responded in a view-invariant man- severe deficits in perceptual learning and object recog- ner. For the five different objects that were used exten- nition, without significantly affecting basic visual capaci- sively during the training of the animals, and for which ties. Consistent with these ablation studies is the discovery behavioral performance became view-independent, mul- of IT neurons that respond to complex two-dimensional tiple cells were found that were tuned around different visual patterns, or objects such as faces or body parts. views of the same object. A number of view-selective units What is the role of these neurons in object recognition? Is showed response invariance for changes in the size of the 74 such a complex configurational selectivity specific to bio- object or the position of its image within the parafovea. logically meaningful objects, or does it develop as a result Conclusion: Our results suggest that IT neurons can of extensive exposure to any objects whose identification develop a complex receptive field organization as a con- relies on subtle shape differences? If so, would IT neurons sequence of extensive training in the discrimination and respond selectively to recently learned views or features of recognition of objects. None of these objects had any novel objects? The present study addresses this question prior meaning for the animal, nor did they resemble any- by using combined psychophysical and electrophysiologi- thing familiar in the monkey's environment. Simple cal experiments, in which monkeys learned to classify and geometric features did not appear to account for the recognize computer-generated three-dimensional objects. neurons' selective responses. These findings support the Results: A population of IT neurons was found that idea that a population of neurons - each tuned to a dif- responded selectively to views of previously unfamiliar ferent object aspect, and each showing a certain degree objects. The cells discharged maximally to one view of of invariance to image transformations - may, as an en- an object, and their response declined gradually as the semble, encode at least some types of complex three- object was rotated away from this preferred view. No dimensional objects. In such a system, several neurons may selective responses were ever encountered for views that be active for any given vantage point, with a single unit the animal systematically failed to recognize. Most neu- acting like a blurred template for a limited neighborhood rons also exhibited orientation-dependent responses during of a single view.

Current Biology 1995, 5:552-563

Background complete three-dimensional description of an object [1], or a structural description of the image that specifies Object recognition can be thought of as the process of the relationships among viewpoint-invariant volumetric matching the image of an object to its representation primitives [2,3]. In such theories, the locations are speci- stored in memory. Because different viewing, illumina- fied in a coordinate system defined by the viewed object. tion and context conditions generate different retinal In contrast, theories assuming perceptual learning are images, understanding the nature of the stored represen- viewer-centered, postulating that three-dimensional ob- tation and the process by which sensory input is normal- jects are modelled as a set of familiar two-dimensional ized is one of the greatest challenges in research on visual views, or aspects, and that recognition consists of object recognition. It is well known that familiar objects matching image features against the views held in this set. are recognized regardless of viewing angle, scale or posi- tion in the visual field. How is such perceptual object Whereas object-centered theories correctly predict the constancy accomplished? Does the brain transform the view-independent recognition of familiar objects [3], sensory or stored representation to discard the image they fail to account for performance in recognition tasks variability resulting from different viewing conditions, with certain types of novel objects [4-8]. Viewer-cen- or does generalization occur as a consequence of percep- tered models, on the other hand, which can account for tual learning, that is, of being acquainted with different the performance of human subjects in any recognition instances of any given object? task, are usually considered implausible because of the amount of memory a system would require to store all Most theories which postulate that transformations of an discriminable views of many objects. These objections, image representation precede matching assume either a however, have recently been challenged by computer

Correspondence to: Nikos K. Logothetis. E-mail address: [email protected]

552 Current Biology 1995, Vol 5 No 5 Model’s early predictions: neurons become view-tuned during recognition

Logothetis, Pauls, and Poggio, 1995; Logothetis, Pauls, 1995

9.520, spring 2003 Poggio, Edelman, Riesenhuber (1990, 2000) A model of the ventral stream in visual cortex (starting from work with Buelthoff and Logothetis)

Riesenhuber & Poggio 1999, 2000; Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005; Serre Oliva Poggio 2007 Psychophysics of rapid categorization

Database collected by Oliva & Torralba Rapid categorization task (with mask to test feedforward model) Image Interval Image-Mask Mask 20 ms 1/f noise

30 ms ISI

80 ms Animal present or not ? Thorpe et al 1996; Van Rullen & Koch 2003; Bacon-Mace et al 2005 Hierarchical feedforward models of the ventral stream

Feedforward Models: “predict” rapid categorization (82% model vs. 80% humans) Decoding the neural code: Matrix-like read-out from the brain Agreement of model w| IT Readout data Reading out category and identity invariant to position and scale

Hung Kreiman Poggio DiCarlo 2005

Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005 ……… in 2013……