On the Prospects for Building a Working Model of the Visual Cortex
Total Page:16
File Type:pdf, Size:1020Kb
On the Prospects for Building a Working Model of the Visual Cortex Thomas Dean Glenn Carroll and Richard Washington Brown University, Providence, RI Google Inc., Mountain View, CA [email protected] {gcarroll,rwashington}@google.com Abstract new or recycled? How important is the role of Moore’s law in our pursuit of human-level perception? The full story Human visual capability has remained largely beyond the delves into the role of time, hierarchy, abstraction, complex- reach of engineered systems despite intensive study and con- siderable progress in problem understanding, algorithms and ity, symbolic reasoning, and unsupervised learning. It owes computing power. We posit that significant progress can be much to insights that have been discovered and forgotten at made by combining existing technologies from computer vi- least once every other generation for many decades. sion, ideas from theoretical neuroscience and the availability Since J.J. Gibson (1950) presented his theory of ecologi- of large-scale computing power for experimentation. From cal optics, scientists have followed his lead by trying to ex- a theoretical standpoint, our primary point of departure from plain perception in terms of the invariants that organisms current practice is our reliance on exploiting time in order to learn. Peter Foldi¨ ak´ (1991) and more recently Wiskott and turn an otherwise intractable unsupervised problem into a lo- Sejnowski (2002) suggest that we learn invariances from cally semi-supervised, and plausibly tractable, learning prob- lem. From a pragmatic perspective, our system architecture temporal input sequences by exploiting the fact that sensory follows what we know of cortical neuroanatomy and provides input tends to vary quickly while the environment we wish a solid foundation for scalable hierarchical inference. This to model changes gradually. Gestalt psychologists and psy- combination of features promises to provide a range of ro- chophysicists have long studied spatial and temporal group- bust object-recognition capabilities. ing of visual objects and the way in which the two operations interact (Kubovy & Gepshtein 2003), and there is certainly In July of 2005, one of us (Dean) presented a paper at ample evidence to suggest concrete algorithms. AAAI entitled “A Computational Model of the Cerebral The idea of hierarchy plays a central role in so many dis- Cortex” (Dean 2005). The paper described a graphical ciplines it is fruitless to trace its origins. Even as Hubel and model of the visual cortex inspired by David Mumford’s Weisel unraveled the first layers of the visual cortex, they computational architecture (1991; 1992; 2003). At that same couldn’t help but posit a hierarchy of representations of in- meeting, Jeff Hawkins gave an invited talk entitled “From creasing complexity (1968). However, direct evidence for AI Winter to AI Spring: Can a New Theory of Neocortex such hierarchical organization is slim to this day and ma- Lead to Truly Intelligent Machines?” drawing on the con- chine vision has yet to actually learn hierarchies of more tent of his popular book On Intelligence (2004). A month than a couple layers despite compelling arguments for their later at IJCAI, Geoff Hinton gave his Research Excellence utility (Ullman & Soloviev 1999). Award lecture entitled “What kind of a graphical model is Horace Barlow (1961) pointed out more than forty years the brain?” ago that strategies for coding visual information should In all three cases, the visual cortex is cast in terms of take advantage of the statistical regularities in natural im- a generative model in which ensembles of neurons are ages. This idea is the foundation for work on sparse rep- modeled as latent variables. All three of the speakers resentations in machine learning and computational neuro- were optimistic regarding the prospects for realizing useful, science (Olshausen & Field 1996). biologically-inspired systems. In the intervening two years, By the time information reaches the primary visual cortex we have learned a great deal about both the provenance and (V1), it has already gone through several stages of process- the practical value of those ideas. The history of the most ing in the retina and lateral geniculate. Following the lead important of these is both interesting and helpful in under- of Hubel and Wiesel, many scientists believe that the out- standing whether these ideas are likely to yield progress on put of V1 can be modeled as a tuned band-pass filter bank. some of the most significant challenges facing AI. The component features or basis for this representation are Are we poised to make a significant leap forward in un- called Gabor filters — mathematically, a Gabor filter is a derstanding computer and biological vision? If so, what are two-dimensional Gaussian modulated by a complex sinu- the main ideas that will fuel this leap forward and are they soid — and are tuned to respond to oriented dark bars against Copyright c 2007, Association for the Advancement of Artificial a light background (or, alternatively, light bars against a dark Intelligence (www.aaai.org). All rights reserved. background). The story of why scientists came to this con- 1597 clusion is fascinating, but the conclusion may have been pre- functions. Kearns and Valiant showed that the problem of mature; more recent work suggests Gabor filters account for learning polynomial-size circuits (in Valiant’s probably ap- only about 20% of the variance observed in the output of proximately correct learning model) is infeasible given plau- V1 (Olshausen & Field 2005). sible cryptographic limitations (1989). We’re sure that temporal and spatial invariants, hierarchy, However, if we are provided access to the inputs and out- levels of increasingly abstract features, unsupervised learn- puts of the circuit’s internal subconcepts, then the problem ing of image statistics and other core ideas that have been becomes tractable (Rivest & Sloan 1994). This implies that floating around for decades must be part of the answer, but if we had the “circuit diagram” for the visual cortex and machine vision still falls far short of human capability in could obtain labeled data, inputs and outputs, for each com- most respects. Are we on the threshold of a breakthrough ponent feature, robust machine vision might become feasi- and if so what will push us through the final barriers? ble. Instead of labels, we have knowledge — based on mil- lennia of experience summarized in our genes — enabling Temporal and Hierarchical Structure us to transform an otherwise intractable unsupervised learn- The primate cortex serves many functions and most scien- ing problem (one in which the training data is unlabeled) tists would agree that we’ve discovered only a small fraction into a tractable semi-supervised problem (one in which we of its secrets. Let’s suppose that our goal is to build a com- can assume that consecutive samples in time series are more putational model of the ventral visual pathway, the neural likely than not to have the same label). circuitry that appears to be largely responsible for recogniz- This property called temporal coherence is the basis for ing what objects are present in our visual field. A successful the optimism of several researchers that they can succeed model would, among other things, allow us to create a video where others have failed.1 If we’re going to make progress, search platform with the same quality and scope that Google temporal coherence has to provide some significant lever- and Yahoo! provide for web pages. Do we have the pieces age. It is important to realize, however, that exploiting tem- in place to succeed in the next two to five years? poral coherence does not completely eliminate complexity In many areas of science and engineering, time is so in- in learning hierarchical representations. Knowing that con- tegral to the description of the central problem that it can’t secutive samples are likely to have the same label is help- be ignored. Certainly this is the case in speech understand- ful, but we are still left with the task of segmenting the time ing, automated control, and most areas of signal process- series into subsequences having the same label, a problem ing. By contrast, in most areas of machine vision time has related to learning HMMs (Freund & Ron 1995). been considered a distraction, a complicating factor that we can safely ignore until we’ve figured out how to interpret Learning Invariant Features static images. The prevailing wisdom is that time will only make the problem of recognizing objects and understand- It’s hard to come up with a trick that nature hasn’t already ing scenes more difficult. A similar assumption has influ- discovered. And, while nature is reluctant to reveal its tricks, enced much of the work in computational neuroscience, but decades of machine vision researchers have come up with now that assumption is being challenged. There are a num- their own. Whether or not we have found neural analogs ber of proposals that suggest time is an essential ingredient for the most powerful of these, a pragmatic attitude dic- in explaining human perception (Foldi¨ ak´ 1991; Dean 2006; tates adapting and adopting them, where possible. David Hawkins & George 2006; Wiskott & Sejnowski 2002). The Lowe (Lowe 2004) has developed an effective algorithm common theme uniting these proposals is that the percep- for extracting image features called SIFT (for scale invari- tual sequences we experience provide essential cues that we ant feature transform). The algorithm involves searching in exploit to make critical discriminations. Consecutive sam- scale space for features in the form of small image patches ples in an audio recording or frames in a video sequence that can be reliably recovered in novel images. These fea- are likely to be examples of the same pattern undergoing tures are used like words in a dictionary to categorize im- changes in illumination, position, orientation, etc.