Feature Selection Based on Reinforcement Learning for Object Recognition
Monica Piñol Angel D. Sappa Angeles López Computer Science Dept. Computer Vision Center Computer Science and Computer Vision Center Campus UAB Engineering Dept. Universitat Autònoma de 08193-Bellaterra Universitat Jaume I Barcelona Barcelona, Spain 12071-Castellón de la Plana 08193-Bellaterra [email protected] Castellón, Spain Barcelona, Spain [email protected] [email protected] Ricardo Toledo Computer Science Dept. Computer Vision Center Universitat Autònoma de Barcelona 08193-Bellaterra Barcelona, Spain [email protected]
ABSTRACT are several representations with different levels of perfor- This paper presents a novel method that allows learning the mance. Often, the feature space is built around “interest best feature that describes a given image. It is intended points” of the scene. The interest points are obtained with to be used in object recognition. The proposed approach detectors, then, for each “interest point” a descriptor is com- is based on the use of a Reinforcement Learning procedure puted. that selects the best descriptor for every image from a given Currently, researchers are beginning to work with rein- set. In order to do this, we introduce a new architecture forcement learning in computer vision. For instance, several joining a Reinforcement Learning technique with a Visual approaches have been proposed in the image segmentation Object Recognition framework. Furthermore, for the Rein- field [19, 21, 22]. In all these cases, the reinforcement learn- forcement Learning, a new convergence and a new strategy ing has been used to tackle the threshold tuning reaching a for the exploration-exploitation trade-off is proposed. Com- similar performance than the state of the art approaches. parisons show that the performance of the proposed method There are also some works in the face recognition problem. improves by about 6.67% with respect to a scheme based For instance, in [8] the authors proposes to learn the set of on a single feature descriptor. Improvements in the con- dominant features for each image to recognize the faces using vergence speed have been also obtained using the proposed a reinforcement learning based technique. exploration-exploitation trade-off. In the VOR domain, [13] presents a method using bottom- up and top-down strategies joining Reinforcement Learning and Ordinal Conditional Functions. A similar approach has Categories and Subject Descriptors been proposed in [6], which joins Reinforcement Learning I.2.11 [Artificial Intelligence]: Distributed Artificial Intel- with First Order Logic. In [10] and [11], a new method for ligence—Intelligent agents;I.4.8[Image Processing and extracting image features is presented. It is based on “inter- Computer Vision]: Scene Analysis —Object recognition est points” detection, and uses reinforcement learning and aliasing to distinguish the classes. Finally, in [2] the rein- forcement learning method is used to select the best clas- General Terms sification in a Bag of Features approach. On the contrary Algorithms, Performance, Experimentation to previous works, in the current work we propose the use of a reinforcement learning based method to learn the best descriptor for each image. Keywords One of the most recent and widely used approach for VOR Reinforcement Learning, Object Recognition, Q-Learning, is Bag of Features (BoF) [4, 5]. Figure 2(left) shows an illus- Visual Feature Descriptors tration of the BoF architecture. In general, a BoF approach consists of four steps detailed next. In the first step, feature extraction is carried out by image feature descriptors. In 1. BACKGROUND [15] the performance of descriptors is analysed. The second Usually, in Visual Object Recognition (VOR), the scenes step is the creation of a dictionary using visual words from are represented in feature spaces where the classification a training set; and the third step is the image representa- and/or recognition tasks can be done more efficiently. There tion. Both steps are solved with a Vocabulary Tree (VT)
33 Agent (τ:SxA →) are not known a priori. The optimal policy ∗ is π = argmaxa Q(s, a ), where Q is the evaluation func- tion the agent is learning. Note that since δ and/or τ are rt,sz+1 ah nondeterministic, the environment should be represented in a nondeterministic way. Hence, we can write the Q in a nondeterministic environment as follows: Environment
Qn(sz,ah) ←− (1 − αn)Qn−1(sz,ah)+ Figure 1: Scheme of interaction between the agent αn[r + γ max Qn−1(sz+1,a)], (1) and the environment a
1 [17]. The VT defines a hierarchical tree by k-means cluster- αn = , (2) ing. In the algorithm, k defines the branch factor (number of 1+visitsn(sz,ah) children of each tree-node) and the process is recursive until where 0 ≤ γ<1 is a discount factor for future reinforce- reaching a maximum depth. At any tree level, each cluster ments. The Eq. (2) is the value αn for a nondeterministic is defined by a dictionary of visual words. The fourth step world and visits is the number of iterations visiting the Q- is the classification. table at the tuple (sz,ah) [16]. The BoF approach does not specify the descriptors and Since, in this problem is not feasible to update the Q- the classification algorithms. One solution for determining table using the Eq. (1), because the current state (sz)is the descriptors is concatenating all of them, but this solution only affected by the previous visits (first order MDP), the introduces noise as well as increase the CPU time. Further- expression in Eq. (1) is modified by Eq. (3). more, since each image could have different characteristics of color, texture, edges, etc.; images from the same database could behave differently when we apply the same descriptor. Qn(sz,ah) ←− (1 − αn)Qn−1(sz,ah)+ This paper proposes a new approach for enhancing the αn[r + γ max Qn−1(sz,a)], (3) BoF. The key idea is the introduction of the Reinforcement a Learning technique [20] to learn the best descriptor for each where 0 ≤ γ<1andαn is defined as in Eq. (2). image. Then, these descriptors are used in a VT scheme, which is used by a Support Vector Machine (SVM) algo- 3. PROPOSED METHOD rithm for the classification. The reminder of the paper is or- ganized as follows. Section 2 summarizes the Reinforcement This paper proposes a new method where the agent learns Learning technique. Then, Section 3 presents the proposed the best descriptor for each image. Figure 2 shows an illus- method. Experimental results are provided in Section 4. tration of the proposed scheme. The most important ele- Finally, conclusions and future work are given in Section 5. ments of the proposed approach are detailed in this section. First, the elements defining the tuple are introduced. Then, the proposed convergence strategy is presented. Next, the 2. REINFORCEMENT LEARNING exploration-exploitation trade-off to select the actions is in- Reinforcement learning (RL) is a learning method based troduced. Finally, the training stage is summarized. Note on trial and error, where the agent does not have a prior the proposed method does not follow a classical RL based knowledge about which is the correct action to take. The scheme where an action (ah) and a reward/punishment (rz), underlying model that RL learns is a Markov Decision Pro- for a given state, produce a new state. In our case, after ap- cess (MDP). A MDP is defined as a tuple S, A, δ, τ , where: plying a given action, the Q-table is updated but it does not S is the set of states; A is the set of actions; δ is a tran- result in a new state, hence a new image is considered. The sition function δ:SxA → S;andτ is a reward/punishment different stages are detailed next. function τ:SxA →. The agent interacts with the environment and selects an 3.1 Tuple definition action. Applying the action (ah) at state(sz), the environ- The tuple that defines our MDP is S, A, δ, τ . The vari- ment gives a new state (sz+1) and a reward/punishment (rt) ables of the tuple are: (see Fig. 1). In order to maximize the expected reward, the 3.1.1 State definition, S agent selects the best action ah basedontheτ(sz,ah) pro- vided by τ:SxA →. In our case, the state is a representation of the image. Different methods have been proposed in the literature Many different image features can be used as a state. In for solving the RL problem: dynamic programming, Monte our experiments, we convert the color image to grey scale Carlo methods, and temporal difference learning. In the image. Then, we extract mean, standard deviation and me- current work a temporal difference learning based method dian values from that image (Fig. 3(a)). After that, the pro- has been selected since it does not require a model and it is cess is applied again by splitting the image into four equally fully incremental [23]. In concrete, our framework is based sized squared blocks, and, for each sub-image, we extract on the Q-learning algorithm [25]. In this case, the agent again the mean, standard deviation and median values (as learns the action policy π:S−→ A, where π maps the current depicted in Fig. 3(b)). Additionally, the number of corners state sz into an optimal action ah to maximize the expected (Fig. 3(c)) and the number of blobs (Fig. 3(d)) using the long term reward. whole grey scale image (Fig. 3(a)) are extracted. The cor- The Q-learning proposes a strategy to learn an optimal ners are obtained using a Harris corner detector [9], while policy π∗, when the δ function (δ:SxA → S)andτ function blobs are obtained converting grey scale image to black and
34 Imagez State choose Qtable Q-Learning
process Action Bag Of Features z+1 Features
update VT
SVM Reward
Classification decision GroundTruth
Figure 2: Illustration of the proposed scheme for image classification using Q-learning white using OTSU threshold [18] and then, the labelling al- gorithm (bwlabel) with a connectivity of 8 neighbours [7] is applied. The result gives a vector of 17 dimensions with the following shape: