Active Perception

RUZENA BAJCSY, MEMBER, IEEE

Invited Paper

Active Perception (Active Vision specifically) is defined as a study II. WHATIS ACTIVESENSING? of Modeling and Control strategies for perception. By modeling we mean models of sensors, processing modules and their interac- In the and computer vision literature, the term tion. We distinguish local models from global models by their “active sensor” generally refers to a sensor that transmits extent of application in space and time. The local models repre- (generally electromagnetic radiation, e.g., radar, sonar, sent procedures and parameters such as optical distortions of the lens, focal lens, spatial resolution, band-pass filter, etc. The global ultrasound, microwaves and collimated light) into the envi- models on the other hand characterize the overall performance ronment and receives and measures the reflected signals. and make predictions on how the individual modules interact. The We believe that the use of active sensors is not a necessary control strategies are formulated as a search of such sequence of condition on active sensing, and that sensing can be per- steps that would minimize a loss function while one is seeking the formed with passive sensors (that only receive, and do not most information. Examples are shown as the existence proof of the proposed theory on obtaining range from focus and sterolver- emit, information), employed actively. Here we use the term gence on 2-0 segmentation of an image and 3-0 shape parametri- active not to denote a time-of-flight sensor, but to denote za tion. a passive sensor employed in an active fashion, purpose- fully changing the sensor’s state parameters according to sensing strategies. I. INTRODUCTION Hence the problem of Active Sensing can be stated as a Most past and present work in machine perception has problem of controlling strategies applied to the data acqui- involved extensive static analysis of passively sampled data. sition process which will depend on the current state of the However, it should be axiomatic that perception is not pas- data interpretation and the goal or the task of the process. sive, but active. Perceptual activity is exploratory, probing, The question may be asked, “Is Active Sensing only an searching; percepts do not simply fall onto sensors as rain application of ?” Our answer is: “No, at least falls onto ground. We do not just see, we look. And in the not in its simple version.” Here is why: course, our pupils adjust to the level of illumination, our 1) The feedback is performed not only on sensory data eyes bring the world into sharp focus, our eyes converge but on complex processed sensory data, i.e., various or diverge, we move our heads or change our position to extracted features, including relational features. get a better view of something, and sometimes we even put 2) The feedback is dependent on a priori knowledge- on spectacles. This adaptiveness is crucial for survival in an models that are a mixture of numeric/parametric and uncertain and generally unfriendly world, as millenia of symbolic information. experiments with different perceptual organizations have clearly demonstrated. Yet no adequate account or theory But one can say that Active Sensing is an application of or example of active perception has been presented by intelligent control theory which includes reasoning, deci- machine perception research. This lack is the motivation sion making, and control. This approach has been elo- for this paper. quently stated by Teoenbaum [I]: “Because of the inherent limitation of a single image, the acquisition of information should be treated asan integral part of the perceptual pro- Manuscript received November23,1987; revised March21,1988. cess . . . Accommodation attacks the fundamental limita- This work was supported in part by NSF Grant DCR-8410771, Air tion of image inadequacy rather than the secondary prob- Force Grant AFOSR F49620-85-K-0018, Army/DAAG-29-84-K-O061, lems caused by it.” Although he uses the term NSF-CERIDCR82-19196A02, DARPMONR NIH Grant NS-10939-11 as part of Cerebo Vascular Research Center, NIH l-ROl-NS-23636- accommodation rather than active sensing the message is 01, NSF INT85-14199, NSF DMC85-17315, ARPA N0014-85-K-0807, the same. NATO Grant 0224/85,and by DEC Corporation, IBM Corporation, The implications of the active sensing approach are the and LORD Corporation. following: The author is with the Computer and Information Science Department, University of Pennsylvania, , PA 19104, 1) The necessityof models of sensors. This is to say, first, USA. the model of the physics of sensors as well as the noise of IEEE Log Number 8822793. the sensors. Second, the model of the signal processingand

0018-9219/88/0800-0996$0~.000 1988 IEEE

996 PROCEEDINGS OF THF IFFF Vnl 7fi Nn I( AIICIICT 1000 data reduction mechanisms that are applied on the mea- processor system on a MULTIBUS, with a MC68000, a Matrix sured data. These processes produce parameterswith adef- frame grabber and buffer, as an array processor, dedicated inite range of expected values plus some measure of uncer- image processing units, and a programmable transform tainties. These models shall be called Local Models. processor. Image positioning is achieved with a panltilt head 2) The system (which mirrors the theory) is modular as and motorized zoomlfocus lens. Although, this is a pow- dictated by good computer science practices and inter- erful and flexible system. Weiss [4] describes a model ref- active, that is, it acquires data as needed. In order to be able erence adaptive control feedback system that uses image to make predictions on the whole outcome, we need, in features (areas, centroids) as feedback control signals. addition to models of each module (as described in 1) Kuno etal. [5] report a stereo camera system whose inter- above), models for the whole process, including feedback. ocular distance, yaw, and tilt are computer controlled. The We shall refer to these as Global Models. cameras are mounted on a specially designed linkage. The 3) Explicit specification of the initial and final statelgoal. distance between the two cameras is controlled: the larger If the Active Vision theory is a theory, what is its predic- the distance, the more precisely disparity information from tive power?There are two components to our theory, each stereo can be converted to absolute distance; the smaller with certain predictions: the distance, the easier is the solution to the correspon- 1) Local models. At each processing level, local models dence problem. From the article, it is not clear how this flex- are characterized by certain internal parameters. Examples ibility will be exploited, nor how the 3 degrees of freedom of local modelscan be: region growing algorithm with inter- will be controlled. nal parameters, the local similarity and size of the local Recently the Rochester group [6] has reported a mobile neighborhood. Another example is an edge detection algo- stereocamerasystemaswellas Poggioat MIT[7l. Both these rithm with parameter of the width of the bandpass filter in groups are interested in modeling the biological systems which one is detecting the edge effect. These parameters rather than the machine perception problems. predict a) the definite range of plausible values, and b) the One approach to formulating sensing strategies is noise and uncertainty which will determine the expected described by Bajcsy and Allen [8]. But in this paper, control resolution, sensitivitylrobustness of the output results from is predicted on feedback from different stagesof object rec- each module. Following the edge detection example, from ognition. Such feedback will not be available from the pro- the width of the bandpass filter, we can predict how close cesses determining spatial layout, which do not perform two objects can be before their boundaries will merge, i.e., recognition. what is the minimum separation distance.The same param- Recently a very relevant paper to our subject matter eter will predict what details of the contour are detectable, appeared at the First International Conference on Com- and so on. puter Vision by Aloimonos and Badyopadhyay [9] with the 2) Global models characterize the overall performance title, “Active Vision.” They argue that: “problems that are and make predictions on how the individual modules will ill-posed, nonlinear or unstable for a passive observer interact which in turn will determine how intermediate become well-posed, linear or stable for an active observer.” results are combined. The global models also embody the They investigated five typical computer vision problems: globallexternal parameters, the initial and finallglobal state shape from shading, shape from contour, shape from tex- of the system. The basic assumption of the Active Vision ture, structure from motion and opticflow(area based).The approach is the inclusion of feedback into the system and principal assumption that they make is that the active gathering data as needed. The global model represents all observer moves with a known motion, and of course, has the explicit feedback connection, parameters, and the opti- available more than one view, i.e., more data. Hence, it is mization criteria which guides the process. not surprising thatwith more measurements taken in acon- trolled fashion, ill-posed problems get converted into well- Finally, all the predictions have been orwill be tested exper- posed problems. imentally. Although all our analysis applies to modalities of We differ, however, with Aloimonos and Bandyopadhyay touch, we shall limit ourselves onlyto visual sensing, hence in the emphasis of the Active Vision (or perception in gen- the name Active Vision (AV). eral) as a scientific paradigm. For us the emphasis is in the study of modelingandcontrolstrategies for perception, i.e., modeling of the sensors, the objects, the environment, and 111. PREVIOUS WORK the interaction between them for a given purpose, which To our knowledge, very little work has been performed can be manipulation, mobility, and recognition. on Activevision in the senseoftheabovedefinition.Tothat extent the state of the art in this domain is poorly defined. IV. ACTIVEVISION In this section, we will briefly describe the systems most pertinent to our design. As we stated earlier, Active Vision for us is modeling and The hand-eye system at Stanford used a pan-tilt head, a a study of control strategies. It is very difficult to talk about lens turret for controlling focal length, color, and neutral the theory of Active Vision without describing the system filters mounted on a “filter wheel,” and a vidicon camera whichembodiesthistheory.Soourstrategyofexposingthe whose sensitivity is programmable. This system is limited theory is to layout the system in such a way that indeed pre- in processing and I/O speed (it used a PDP-6 computer) and dictions can be made and verified either formally or exper- in resolution (four bits), but its design is well conceived. It imentally. Some of the predictions can be made specific was successfully used for adaptive edge following [2]. only when one describes a concrete hardwarelsoftware POPEYE is a grey level vision system developed at Car- setup. This is why we shall use some examples to support negie Mellon University [3]. It is a loosely coupled multi- our arguments.

BAICSY: ACTIVE PERCEPTION 997 In Section II we have categorized models into local and Table 1 global. The distinction is based on extent of the process in time and space. By and large the physical and geometric Feedback Goal/Stopping Level Parameters Condition properties are encoded in local models, while the inter- action among the local models is controlled by the global 0 directly measured: grossly focused models. We shall discuss different models as we proceed i.e., control of the current in the scene physical device lighting system camera optimally in our presentation of different systems. Here we just men- position of the adjusted tion a few related works on this subject. Models and esti- camera aperature mation theory have been successfully applied by Zucker openklose [IO]. The problem that Zucker addressed is how to find aperature simple curves from visual data. He decomposed the process into computation: three steps: Gross focus

~ ~ ~ ~ 1) Measurement step, implemented via series of con- 1 directly measured: focused one volutions. i.e., control of the focus, zoom subject physical device computed: distance from 2) The interpretation step. This has been realized via contrast focus functional minimization applied on the results from convolution. 2 computed only: 2-D segmentation, i.e., control of low threshold of i.e., maximum 3) Finally the integration process in order to get the level vision magnitude of number of edges curve. modules edge width of a max. and min. filter number of This decomposition in steps, with the parameters at each similarity criterion regions step explicit, allows Zucker to make clear predictions about on region where contours will be found and what the limitations are. growing The very same flavor of models and predictions can be 3 directly measured: Depth map, found in the papers of Leclerc and Zucker [Ill and Nalwa i.e., control of vergence angle 2; sketch and Binford [I21 which are applied to edge detection and binocular system computed: image discontinuities. Signal models have been investi- (hardware and threshold of software) similarity criteria gated by several workers, examples are Haralick [I31 and for correspon- recently Pavlidis [14]. dence purposes Now we shall discuss the problem of control. There are range of three distinct control stages proceeding in sequence: admissible depth values initialization, 4 computed only: depth map processing in midterm, i.e., control of threshold of segmentation completion of the task. intermediate similarity criteria into surface geometric 2$D on surface patches Strategies are divided with respect to the tradeoff between vision module growing and 3-D how much data measurement the system acquires (data boundary driven, bottom-up) and how much a priori or acquired detection knowledge the system uses at a given stage (knowledge 5 measured directly: 3-D object driven, top-down). Of course, there is that strategy which i.e., control of the position, description via combines the two. a) several views viewing angle of volumetric b) integration several views of categories To eliminate possible ambiguitieswith the terms bottom- process the scene up and top-down, we define them here. Bottom-up (data c) matching computed only: driven), in this discussion, is defined as a control strategy between threshold on where no concrete semantic, context dependent model is data and the similarity model criteria available, as opposed to the top-down strategy where such between knowledge is available. While semantic, context depen- consecutive dent models are not relevant to bottom-up analysis, other views types of models serve an important function. For example, threshold on similarity bottom-up analysis in active vision requires sensor models. criteria Table 1 below outlines the multilayered system of an between the Activevision system, with the final goal of 3-D objectlshape data and the recognition. The layers are enumerated from 0, 1, 2, . . * model with respect to the goal (intermediate results) and feedback 6 computed only: 3-D object parameters. Note that the first three levels correspond to i.e., control of the similarity description via monocular processingonly. Naturallythe menu of extracted semantic criterion on the model. featuresfrom monocular images is far from exhaustive. The interpretation matching The model can be between the of various other 3-5 levels are based on binocular images. It is only data and the complexity, i.e., the last level that is concerned with semantic interpreta- model basic categories, tion.There is no clear delineation among the levels 0-5 with hypercategories, respect to the device control and software control. This of and subcate- gories. course is not surprising.

998 PROCEEDINGS OF THE IEEE, VOL. 76, NO. 8, AUGUST 1988

~~ Several comments are in order: Two cameras each with 340 by480 spatial resolution, gray scale 256 levels. Each camera has a control of Focus, Zoom, 1) Although we have presented the levels in asequential and Aperature. There is a control of the vergence angle order, we do not believe that is the only way of the between the two cameras. There is acoupled control of Pan/ flow of information through the system. The only sig- Tilt, of UplDown motion and of RightlLeft motion. In addi- nificance in the order of levels is that the lower levels tion there is available a controlled lighting system in order are somewhat more basic and necessary for the higher to simulate: lambertian source, one point source up to four levels to function. point sources. 2) In fact, the choice of at which level one accesses the The initialization stage is comprised from two steps: Wak- system very much depends on the given task andlor ing up the camera system and Gross focusing. After accom- the goal. plishing this, we proceed to the intermediate stage which This design is closest in spirit to that of Brooks [15]. In the is carried out by orienting the cameras for stereo imaging, remaining part of this chapter we shall present two separate stereo ranging with verification by focusing, and focus scenarios (one bottom-up, the other top-down) which will ranging with verification by stereo. The final task is to obtain use some of the control levels from Table 1. a depth map with a given accuracy. I) Waking Up: Waking up the sensors involves opening V. THE BOTTOM-UPSCENARIO the device controllers and setting them to default param- eters. The lens controller zooms both lenses out, focuses In this scenario we begin without any a priori given task them at a distance of 2 m, opens the aperatures, and verges or request. Hence the system must generate a task for itself to a parallel position. The platform controller positions the as: Open your eyes and look around, perhaps measure how camera inthe middle of the gantry, and orients them to look far are you from an object or segment the image. Naturally straight ahead. The light controller adaptively illuminates for the system to be able to do so, it must have some built- the lamps until the image intensities reach reasonable mean in models of the signal and geometry in order to be able values, and good contrast. Now awake, the camera system to function at all. Examplesof such modelsare: edge models is ready to acquire images for any application. (step functions, linear edges, etc.), region models (piece- 2) Gross Focusing: One of the cameras is the master, and wise constant, piecewise linear, etc.), and topological models. the other is the slave. To bring the images into sharp focus and compute an initial, rough estimate of the range of There are two test cases that we shall present: one is to objects in the scene, we grossly focus the master lens, obtain depth maps using range from focus and vergence/ beginning by zooming in the master camera. As analysed stereo; the other is 2-D image segmentation. in Krotkov’s work [16], the depth of field of the lens limits the precision of the computation of range from focusing. A. Obtaining Depth Maps To maximize this precision, the depth of field should be as The objective of the first case was to study the control small as possible, which can be achieved by increasing the strategies in the data driven situation of an Agile Camera aperature diameter. For this, the process starts by turning system and how to combine information from different low- off all the lights, and then opens the aperature as wide as level vision modules, such as focus and stereo ranging [16]. possiblewithout saturating. (An image saturateswhen more The Agile camera system is an 11 degrees of freedom system than a certain n S of pixels are at their maximum intensity; under computer control. The Camera system is shown in the value S = 200 is employed in the current implemen- Fig. 1. tation.) Next it adaptively illuminates the lamps until image saturation, and finally turns them down slightly. The gross focusing process determines the focusing distance bring- ing the scene into sharp focus by Fibonacci search over the entire field of view and the entire space of possible focus motor positions. The process records the best focus motor position and the range Z. The gross focusing terminates by zooming the master lens out, and servoing the slave focus motor position to that of the master lens. 3) Orienting the Cameras: The task now is to orient the cameras so that objects at distance Z will lie in the field of view and have zero disparity. Using the formula derived by Krotkov, one can compute the vergence angle from which by further calculation we get the vergence motor position corresponding to the vergence angle and finally servoing the vergence motor to this position. After verging, some objects may have drifted out of view. To reacquire these objects, a corrective pan by the amount -a/2 is executed. The reason we have gone into such details of explaining this initial stage is to make the reader aware of how much interaction and feedback takes place just in the process of initial adjustment of the camera system. Fig. 1. Agile camera system. 4) The Final Task-Depth Maps! The steps carry out the

BAJCSY: ACTIVE PERCEPTION 999 computationof howtoget range from focusand range from called GOLDIE which indeed has several syntactic (context- stereo. Here we shall not describe these follow-up steps in independent) evaluation functions. This is quite encour- detail, but just to summarize the results of this portion: agingalthough it still must betested in avarietyof domains. (It was tested only in one domain, so far.) 1) It is a cooperative sensing operation in that the two Recently Anderson [20] has considered a modular, con- modules mutually check each other, that is, stereo text independent approach to the problem of 2-D seg- ranging is verified by focusing and focus ranging is mentation, denoted as the control level 2 in Table 1. The verified by stereo ranging. modules are edge and region formation modules, as shown 2) The result is theconservative measureof distance that in Fig. 2. It is a well-known fact that one can get very dif- has passed the error analysis of both the processes. ferent segmentations from a picture by just changing the 3) There is no feedback only feed forward in this por- parameters. tion. The most important results of this work so far are the fol- In summary, Krotkov's work is an example of data driven lowing: initialization with the goal of obtaining depth map of a 1) definition of the tasklgoal and parameters for each scene. An alternative approach to the intermediate goal of module in terms of general, geometric goals rather the Initialization stagecan be2-D segmentation of each view than with respect to context and semantic informa- before or concurrently performed with the process for tion, obtaining range from focus. 2) an extensive analysis was done on the relationship There are other researchers who have integrated differ- between the parametersand the false detection errors ent visual cues for obtaining range data, examples are Abbot and the false dismissal errors of a true object bound- and AhujatlT], Hoff andAhuja[18].The noveltyof Krotkov's work is in the thorough modeling of each module, the ary, 3) the detection of boundaries is an interdependent explicit parametrization which allowed him to predict a process between edge and region formation process, range of distance values in which the measurements are 4) the idea of feedback within a module and the inter- physically possible. Furthermore, he measured the a pos- dependency between modules, implies multiple out- teriori distribution of errors during the calibration phase puts and hence the need for fusion, i.e., combination which allowed him to implement a policy for integrating the rules. computed ranges into maximum likelihood estimate. Items 1) and 2) are issues of local models while items 3) B. Image Segmentation, Another Example of Bottom-Up and 4) are aspects of global models. The most popular Process approach to global models in the context of image seg- mentation is the cooperative network or relaxation The group at the University of Massachusetts [I91 have approach [21]. Another such model is the Random Markov been advocating for some time that one must be goal Fields model used by several researchers[22]-[24]. The prin- directed in order to do low level image processing andlor cipleofthis model isthattheeffectsof membersofthefield segmentation. They argue that the failure of general seg- upon each other are limited to local interaction as defined mentation technique can be traced to the following: bythe neighborhood. This is also the weakness of this model 1) the image is too complex because of the physical sit- because it assumes a priori spatial arrangement of objects uation from which the image was derived and/or the and their projection on the image. This assumption is too nature of the scene; or, strong, and applicable only in a few, highly controlled 2) there is a problem of evaluating different regionlline experiments. The work of Anderson shows that for image segmentations. segmentation the global models should represent topo- logical and integral (size and number of regiondedges)

They say: 'I.. . it has been our experience that no low-level properties which are positionally invariant rather than evaluation measure restricted to making measurementson neighborhood dependent (Fig. 4). The following predic- the segmentation can provide a useful comparative metric. tions have been verified: the minimum separability between Rather the quality of segmentation can be measured only two adjacent objects, and the amount of detectable details with respect to the goals of an interpretation . . ." on the boundary, (this follows from the width of the filter We agree that the images are complex, but some of the that is used with the edge detector) the homogeneity cri- image acquisition process can be modeled, and hence, one terion for regions depends on the number of expected seg- can account for the variability of the acquisition process as ments in the image. In Fig. 3 we show the original image, shown by Krotkov [16]. Of course, one cannot predict the then after edge detection with the width of the bandpass spatial arrangement of objects, (their surface properties) filter 3 pixels. Note that the pennies separated less than 3 butwe can have models that are somewhat invariantto these pixels are merged together, as predicted. The last portion variables. We also agree with the authors that there are no of this figure, labeled "Segmentation" is the result of a com- good segmentation evaluation functions. It is known that bination of edge detection and region growing which oper- segmentation process is not unique given any number of ates on local similarity in the neighborhood of one pixel. parameters. But we wish to argue with the authors that the This small neighborhood explains the fact that one detects only thing that can determine the segmentation is the goal the touching boundary. The magnitude of similarity criteria of its interpretation. If thiswould beso, thenwecannot ever eliminates the single lines. The magnitude of the similarity have a general (or even semi-general)purpose system which criterion is a parameter that is adaptively determined by can bootstrap itself and adapt to an apriori unknown envi- external criterion of the number of desired regions. We have ronment. In reality, they do present an intermediate system investigated this relationship between the number of

1000 PROCEEDINGS OF THE IEEE, VOL. 76, NO. 8, AUGUST 1988 Input Digitized Perimeter : Area Ratio, P:R Range of number of objects, n Typical object size, aug. R

P:R, n, aug. A

Estimator Calculate: Total Number of Perimeter Pi~els,Total P Minimum Region Size, min A Edge Scale, sigma

sigma

Edge Detection Module Edge Oetection and Thinning Improve t

Set Gradient Threshold, I, at Manimum

with improved n threshold t Region Formation Module

Fin e-Tun e Set Local Oillerence Limit, d, at a

Output: Segmented Image RdJust based on n Fig. 2. Segmentation 5ystem

Number of Regions I

I/*.*; ,..,.~... 50 ..

5 10 15 20 25 Fig. 4. Krlationihip ot the number ot regions and the sin- i Id r ity mag n it u dc.

becauseone has detected more noise than true segmented regions. Although we have shown that one can have a theorywith predictions on sensitivity and robustness ot 2-D segmen- tation and obtaining 3-D range data, there are still many Fig. 3. Image segmentation 01 pennies open questions lett in the bottom-up scenario. Some of them are the following: regions and the similarity magnitude, and it is ihown in Fig. * parallel explanation versus sequential, 4.This curve shows that up to a point there is almost a linear * partial explanation versus total, i.e., * relationship between the similarity magnitude and the Should one say: "As far as I can see this is it.. .I' number of regions. As one goes beyond a certain value (in or "From this view ihis is. . . , trom another view this graph around 10) the number of regions increases this is., .I' etc.

BA]CSY : ACT IL E PE KC E PT I 0 N 100 I VI. TOP-DOWNSCENARIO for some hypothesis-concepts. In the case of b) we go into thesensorydatato search for the expectedperceptual prop- Weshall considertwocasesof thetop-down initialization erties, which are formed by the low level models. The com- process:Taskdriven and querydriven. While in the bottom- mon fact for both of these cases a) and b) is that both have up mode the strategy is to start with the data, analyze it, a number N: the number of concepts-hypothesis or the identify some gestalt property, such as a closed contour, number of objects with given perceptual properties. Now or the largest or brightest object; in the top-down mode the let us study this number N: If N = 0 then this means in case strategy is to start with the Data Base (entered either via a) there is no hypothesis that would match the measure- query or by the Task Description), which should suggest ments; in case b) there is no object in the data that matches what perceptual properties to look for. The task-driven the expectations. If N = 1 then there is a unique interpre- mode differs from the query-driven mode only in the fact tation of the data. If N > 1 then in case a) we have multiple that in the task-driven mode the Active Vision System not hypothesis; in case b) we have multiple objects in the data only has to identify the queried object but also must mon- with expected perceptual properties. This number N can itor the work space, measure the changes that occur until be a feedback signal for further processing. For example, the task is accomplished. But most importantly, in the task- if N = 1 we are done, unless we are specifically looking for driven mode, the system by interacting with its environ- more than one object in the data. In the case of interpre- ment may change it; while when you are a reporter, you tation we are done unequivocally. If N = Owe are also done, have no direct influence on events that you are reporting at least for the moment. This fact tells us that in the current about. In the Query Driven mode the system acts as an data the searched object does not exist. If N > 1, then the observer, while the Task-Driven mode, the system is a par- question is: “What should be the Nmax?” In the case of a) ticipant. An interesting example is the DARPA project, the Nmax corresponds to a short term conceptual buffer, Autonomous Land Vehicle paradigm. Within this para- which contains the hypothesis; in the case of b) the Nmax digm, let us suppose that the task of an autonomousvehicle corresponds to the short term visual memory. is to get from placeA to place B, stay on the road, and avoid The open question is: “How large should this buffer be?” obstacles if there are any. Then we can ask further: is the For the conceptual buffer, we would like to argue that it Active Vision system acting as an Observer or a Participant? should be: 7 + or - 2. This number is based on psycho- At some level of abstraction, it acts as an Observer, but on logical studies [26]. In any case, whatever the Nmax is the the sensor-action level it behaves as a Participant. Consider question remains: “What do you do next?” for examplethatthewheels leavetracks,and hence,change The intermediate stage is structured and subdivided in the surface of the road which will require a change in pro- our Table 1 into levels 3 through 6. The issue here is what cessing strategy. The top-down, model-driven recognition are suitable models for grouping purposes, which would systems are quite popular in the vision community. There lead to useful geometric descriptions. are several review articles, an example is Binford’s work A very interesting Global model in this sense has been [25]. Their advantage to the bottom-up scenario is that the inspired by the influence of the Gestalt psychological expectations, models can help in the recognition process models implemented in Reynolds and Beveridge [27]. The even in the case of rather noisy and incomplete data. Nat- global model is a geometric grouping system based on sim- urally, the danger in this mode is that one may see anything ilarity and spatial proximity of geometric tokens. Examples that one wishes to see and not detect unexpected objects. for lines are colinearity, rectangularity, and so on. We think ”How does the Activevision approach fit into the top-down that this is a promising approach but what is missing is eval- scenario?” Clearly, if the expectations fit the data there is uation criterion and robustness studies. This cannot be no need for further control. But this is hardly ever the real- avoided especiallywith such vague notions as similarityand ity. In any other case, i.e., when decisions must be made spatial (or other) proximity relations. during the intermediate stage, the Active Vision approach Another way of explaining the 3-D data is to choose vol- becomes a necessity. In spite of all the past work in this umetric models that can be estimated and fit to the data. mode there are still open issues, such as the following: The difficult task here is the choice of such representation What is a satisfactory answer to a given query? which naturally maps into shape categories. There are sev- Should one include into the answer a degree of uncer- eral possibilities as reviewed thoroughly by Besl and Jain tainty? [28]. One such model is the superquadric function [29], Should one answer via another question? characterized by parameters: position and orientation in If any doubts what to do? space, size and roundnesslsquareness of edges, amount of In the task driven mode should one generate another tapering along the major axis, amount of bending along the subtask? major axis, and the size of cavities (its depth and steepness). Can one talk about convergence to an answer? Given these parameters, it is not difficult to see that one can make classification of objects with respect to their shapes into categories: squash, mellon, pear, etc. VII. THE INTERMEDIATESTAGE Fig. 5 displays an example of a squash. From top-left to After the initial processing discussed in Sections IV and bottom-right, the sequence shows first the gray scale image V, we have either: a) a segmented scene or at least labeled of the object, then the range data, and then the sequence entities such as two-dimensional regions, or lines or 3-D of fits to the data starting with an ellipsoid and finishing points (after a bottom-up initialization), in other words, with the best fit measured by the residuum. As the side some geometric description; or b) a set of perceptual prop- product, one gets the above mentioned parameters. Nat- erties to look for (after the top-down initialization). urally, there are common characteristics but also discern- In the case of a) we go into the Knowledge Base to search ible features to each of these categories. Those distinguish-

1002 PROCEEDINGS OF THE IEEE, VOL. 76, NO. 8, AUGUST 1966 Fig. 5. Model recovery of a tapered and bent object-a squash.

ing properties will reduce the number N, i.e., the number EllSCZ) - PI2. of possible interpretations. This adds the Bayes solution to our estimation problem [31]. A control sequence is to be evaluated relative to its VIII. SEARCH FOR INFORMATION expected utility. This utility can be thought of in two parts: The fundamental ingredient to the Active Vision theory the performance of the estimation procedure forthat choice is the mechanism for decision making, in general, how to of strategy and the cost of implementing that strategy. We choose data information with respect to the perceptual goal. can write a general loss for the combined estimationkon- This involves both the method of estimation and thechoices trol problem as of general strategyfor control. Following the work of Hager I(p, n, U) = Id(p, c(n, U) and Mintz [30], we formalize a controllable measurement p, p, + device as a mathematical system of the form: Id represents the loss attributed to the estimation proce- z, = H(u,, p) + Vb,,p) dure 6, and c represents the cost of taking n samples via control strategy U. The choice of actual forms for c and Id where U, m-dimensional control vector, is s-dimen- is p reflects the desired behavior of the system. sional quantityweareattemptingtoestimate, V(.,.) isaddi- Since sensors are to be used for reducing uncertainty, an tive noise, and z, is the observation. appropriate quality measure is a tolerance, E. Results must The problem is to optimize, by choice of some sequence be computed within some time constraints balanced against U = [U,, u2, . , U,] the performance of an estimation pro- the sensor’s resource budget. Finally, the results returned cedure &(.) estimating from z = [z,, z,]. In order p z2, . . . , by the sensor should indicate the degree to which the set to choose a particular estimation procedure, we must pick task was found achievable. Based on this discussion, an a criterion or loss by which to judge the merit of a decision appropriate evaluation criterion for the performance of an rule. A commonly used loss criterion is the mean square estimation procedure 6, in the context of control is the 0- error loss. The optimal estimation procedure is that which 1 loss given an a priori probability distribution on p and the con- 0, if l6,(z) - pI < E ditional distribution on observations, minimizes the quantity Id(p’ 6n(z)) = [I, Otherwise.

BAICSY: ACTIVE PERCFPTION By the previous definition of the loss function we can com- 3) the control criterion is a direct function of the infor- pute the Bayes decision risk of an estimation 6, as mation returned by the estimation procedure, and 4) the information is limited by sensor scope. r(n, U, 6) = ~p~z,p[~~(p,6n(z)) + c(n, U)] Given this setting, the EKF fails on the robustness issue. This = rd(a, 6,) + c(n, U) is so because the EKF is, like stochastic approximation, a If we disregard U, then we can consider the problem of differential correction techniqueand unlessone hasagood finding the n which minimizes the risk function for fixed prior estimate it will not converge. One can ask: "Can we U. This procedure is called batch procedure. On the other guarantee convergence by an appropriate gain?" Hager and hand wecould also solve this problem byevaluating the risk Mintz have shown via simulation that as the interval of conditioned on the observed data, and derive the stopping uncertainty widens, the error terms will drastically under- rule which says when enough data had been taken. This is represent the error estimation. For large enough intervals of course a sequential procedure in nature. In addition to the filter fails to converge. Next they investigate the finite determining batch size or stopping rule, we must choose Bayes approximation and show that this technique is more a plan of control to minimize the risk of the final outcome. robust and gives more predictable estimations. The impor- That we do by minimizing the risk given 6,: tant result of the work of Hager and Mintz is the critical analysis of the linearization techniques that are quite pop- min min r(n, U, &). nu ular currently in robotics and active perception. It is an Recently a great interest has been spurred in the computer important methodology to examine which methods/tools vision and robotics community on the integration of mul- are appropriate and what are the consequences of the tisensory information. A prime example of this is the work- approximations made at each step to a given problem. It shop on spatial reasoning and uncertainty, St. Charles, IL, also points to open problems that require new mathemat- October 1987, organized by A. Kak. The question is how to ical and formal tools that should be a challenge to theo- integrate multisensory information. For systems which are reticians. linear both in state and control parameters, the Kalman fil- ter is the optimal linear estimation procedure. When the Ix. CONCLUSIONS system is static, the Kalman filter becomes an iterative In conclusion we have defined active perception as a implementation of Baysian estimation procedure. There are problem of an intelligent data acquisition process. For that, no such general solutions for nonlinear measurement sys- one needs to define and measure parameters and errors tems, though there are a number of approximation tech- from the scene which in turn can be fed back to control the niques [32]. One approximation is to linearize a given non- data acquisition process. This is a difficult though impor- linear measurement system about a nominal trajectory and tant problem. Why? The difficulty is in the fact that many apply linear estimation technique. In reference to Kalman of the feedback parameters are context and scene depen- filter, this approximation is called the Extended Kalman Fil- dent. The precise definition of these parameters depends ter (EKF). on thorough understandingof the data acquisition devices Durant-Whyte [33] used a version of this technique to (camera parameters, illumination and reflectance param- solve the problem of updating location estimates from eters), algorithms (edge detectors, region growers, 3-D observations. Smith et al. [34] have applied this method to recovery procedures) as well as the goal of the visual pro- a mobile robot estimatingits position.Ayacheand Faugeras cessing. The importance however of this understanding is [35] have looked at several problems in stereo ranging by that one does not spend time on processing and artificially building an EKF a general constraint equation. However, it improving imperfect data but rather on accepting imper- is important to remember that EKF is only an approxima- fect, noisy data as a matter of fact and incorporating it into tion. The accuracy of this approximation depends on hav- the overall processing strategy. ing a relatively good prior estimate to linearize about and Why has it not been pursued earlier? The usual answers the approximation is only good in a small neighborhood one gets are: lack of understanding of static images, the about this point. Another method of nonlinear estimation need to solve simpler problems first, less data, etc. This of which does not rely on linearization is Stochastic Approx- course is a misconception. One view lacks information imation [36]. This technique is similar to Newton's method which is laboriously recovered when more measurements, (iterative method) adapted towork in the presence of noise. i.e., moreviews can resolve the problem easier. More views The choice of gain sequences is crucial to convergence. add a new dimension-time-which requires new under- There are no results for the small-sample behavior of this standing, new techniques, and new paradigms. estimator. The next question is the control of nonlinear sys- tems. While there are some results [31] in Bayesian sequential ACKNOWLEDGMENT and batch decision problems, there is no general theory. The authors would like to thank Patricia Grosse for her Hager and Mintz[30] haveanalyzed theoretically and in sim- editorial assistance on this document. ulation the EKF approach to the active perception task and have made the following conclusions: The Active percep- REFERENCES tion setting has the following attributes: [I] J. M. Tenenbaum, "Accommodation in computer vision," Ph.D. Thesis, , Nov. 1970. is 1) the system nonlinear both in state and control, [2] -,"A laboratory for hand-eye research," IFIPS, pp. 206-210, 2) the measurement noise depends on the control of 1971. measurement system, [3] R. Bracho, J. F. Schlag, and A. C. Sanderson, "POPEYE: Agray-

1004 PROCFFDINCS OF THF IFFF Vnl 76 Nn R AliClI

BAICSY: ACTIVE PERCEPTION 1005