Classification with Costly Features Using Deep Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Classification with Costly Features using Deep Reinforcement Learning Jarom´ır Janisch and Toma´sˇ Pevny´ and Viliam Lisy´ Artificial Intelligence Center, Department of Computer Science Faculty of Electrical Engineering, Czech Technical University in Prague jaromir.janisch, tomas.pevny, viliam.lisy @fel.cvut.cz f g Abstract y1 We study a classification problem where each feature can be y2 acquired for a cost and the goal is to optimize a trade-off be- af3 af5 af1 ac tween the expected classification error and the feature cost. We y3 revisit a former approach that has framed the problem as a se- ··· quential decision-making problem and solved it by Q-learning . with a linear approximation, where individual actions are ei- . ther requests for feature values or terminate the episode by providing a classification decision. On a set of eight problems, we demonstrate that by replacing the linear approximation Figure 1: Illustrative sequential process of classification. The with neural networks the approach becomes comparable to the agent sequentially asks for different features (actions af ) and state-of-the-art algorithms developed specifically for this prob- finally performs a classification (ac). The particular decisions lem. The approach is flexible, as it can be improved with any are influenced by the observed values. new reinforcement learning enhancement, it allows inclusion of pre-trained high-performance classifier, and unlike prior art, its performance is robust across all evaluated datasets. a different subset of features can be selected for different samples. The goal is to minimize the expected classification Introduction error, while also minimizing the expected incurred cost. In real-world classification problems, one often has to deal In this paper, we extend the approach taken by Dulac- with limited resources - time, money, computational power Arnold et al. (2011), which proposed to formalize the prob- and many other. Medical practitioners strive to make a diag- lem as an Markov decision process (MDP) and solve it with nosis for their patients with high accuracy. Yet at the same linearly approximated Q-learning. In this formalization, each time, they want to minimize the amount of money spent on sample corresponds to an episode, where an agent sequen- examinations, or the amount of time that all the procedures tially decides whether to acquire another feature and which, take. In the domain of computer security, network traffic is or whether to already classify the sample (see Figure 1). At often analyzed by a human operator who queries different ex- each step, the agent can base its decision on the values of the pensive data sources or cloud services and eventually decides features acquired so far. For the actions requesting a feature, whether the currently examined traffic is malicious or benign. the agent receives a negative reward, equal to the feature cost. In robotics, the agent may utilize several measurement de- For the classification actions, the reward is based on whether vices to decide its current position. Here, each measurement the prediction is correct. Dulac-Arnold et al. prove in their has an associated cost in terms of the energy consumption. paper that the optimal solution to this MDP corresponds to In all of these cases, the agent gathers a set of features, but the optimal solution of the original CwCF problem. it is not desirable to have a static set that works on average. Since 2011, we are not aware of any work improving upon Instead, we want to optimize for a specific sample - the cur- the method of Dulac-Arnold et al. In this paper, we take a rent patient, certain computer on a network or the immediate fresh look at the method and show that simple replacement robot’s location. of the linear approximation with neural networks can outper- These real-world problems give rise to the problem of form the most recent methods specifically designed for the Classification with Costly Features (CwCF). In this setting, CwCF problem. We take this approach further and implement an algorithm has to classify a sample, but can only reveal its several state-of-the-art techniques from Deep Reinforcement features at a defined cost. Each sample is treated indepen- Learning (Deep RL), where we show that they improve per- sequentially dently, and for each sample the algorithm selects formance, convergence speed and stability. We argue that features conditioning on values already revealed. Inherently, our method is extensible in various ways and we implement Copyright c 2019, Association for the Advancement of Artificial two extensions: First we identify parts of the model that can Intelligence (www.aaai.org). All rights reserved. be pre-trained in a fast and supervised way, which improves λ = 0.001 performance initially during the training. Second, we allow 0.95 λ = 0.01 the model to selectively use an external High-Performance λ = 0.0001 Classifier (HPC), trained separately with all features. In many 0.90 real-world domains, there is an existing legacy, cost-agnostic 0.85 λ = 0.1 model, which can be utilized. This approach is similar to (Nan and Saligrama 2017), but in our case it is a straightforward Accuracy 0.80 extension to the baseline model. We evaluate and compare the method on several two- and multi-class public datasets with 0.75 number of features ranging from 8 to 784. We do not perform λ = 1.0 0 10 20 30 any thorough hyperparameter search for each dataset, yet our Cost method robustly performs well on all of them, while often outperforming the prior-art algorithms. The source code is Figure 2: Different settings of λ result in different cost- available at github.com/jaromiru/cwcf. accuracy trade-off. Five different runs, miniboone dataset. Problem definition Let (x; y) be a sample drawn from a data distribution By altering the value of λ, one can make a trade-off between 2 D n precision and average cost. Higher λ forces the agent to prefer . Vector x R contains feature values, where xi is D 2 X ⊆ lower cost and shorter episodes over precision and vice versa. a value of feature fi = f1; :::; fn , n is the number of features, and y is2 a F class.f Let c : g R be a function Further intuition can be gained by looking at Figure 2. mapping a feature2 Yf into its real-valuedF! cost c(f), and let The initial state does not contain any disclosed features, s0 = (x; y; ), and is drawn from the data distribution . λ [0; 1] be a cost scaling factor. ; D 2The model for classifying one sample is a parametrized The environment is deterministic with a transition function t : , where is the terminal state: couple of functions yθ : , zθ : }( ), where S × A ! S [ T T y classifies and z returnsX!Y the featuresX! used in theF classifi- θ θ if a c cation. The goal is to find such parameters θ that minimize t((x; y; ¯); a) = T 2 A F (x; y; ¯ a) if a the expected classification error along with λ scaled expected F[ 2 Af feature cost. That is: These properties make the environment inherently episodic, 2 3 with a maximal length of an episode of + 1. Finding 1 X X jFj argmin 4`(yθ(x); y) + λ c(f)5 (1) the optimal policy πθ that maximizes the expected reward in θ this environment is equivalent to solving eq. (1), which was jDj (x;y)2D f2zθ (x) shown by Dulac-Arnold et al. (2011). We view the problem as a sequential decision-making prob- lem, where at each step an agent selects a feature to view Background or makes a class prediction. We use standard reinforcement ∗ learning setting, where the agent explores its environment In Q-learning, one seeks the optimal function Q , represent- through actions and observes rewards and states. We repre- ing the expected discounted reward for taking an action a in sent this environment as a partially observable Markov deci- a state s and then following the optimal policy, and it satisfies sion process (POMDP), where each episode corresponds to a the Bellman equation: classification of one sample from a dataset. We use POMDP ∗ h 0 ∗ 0 0 i Q (s; a) = E r(s; a; s ) + γ max Q (s ; a ) (2) definition, because it allows a simple definition of the tran- s0∼t(s;a) a0 sition mechanics. However, our model solves a transformed MDP with stochastic transitions and rewards. where r(s; a; s0) is the received reward and γ 1 is the Let be the state space, set of actions and r, t reward discount factor. A neural network with parameters≤θ takes a and transitionS functions. StateA s = (x; y; ¯) represents state s and outputs an estimate Qθ(s; a), jointly for all actions a sample (x; y) and currently selected setF of features2 S ¯. The a. It is optimized by minimizing MSE between the both F agent receives only an observation o = (x ; f ) f ¯ , sides of eq. (2) for transitions (st; at; rt; st+1) empirically f i i j 8 i 2 Fg that is, the selected parts of x without the label. experienced by an agent following a greedy policy πθ(s) = Action a = is one of the classification argmax Qθ(s; a). Formally, we are looking for parameters 2 A Ac [Af a actions = , or feature selecting actions = . Clas- θ by iteratively minimizing the loss function `θ, for a batch Ac Y Af F sification actions c terminate the episode and the agent of transitions : receives a reward ofA 0 in case of correct classification, else B 1 X θ 2 1.