<<

Deep Bayesian Active Learning, A Brief Survey on Recent Advances

Salman Mohamadi Hamidreza Amindavar Computer Science and Electrical Engineering Electrical Engineering West Virginia University Amirkabir University of Technology Morgantown, WV, USA Tehran, [email protected] [email protected]

Abstract—Active learning frameworks offer efficient is differences between their learning paradigm. In data annotation without remarkable accuracy degradation. more detain, semi-supervised learning frameworks In other words, active learning starts training the model use the unlabeled data for feature representations with a small size of labeled data while exploring the space in order to better model the labeled data [3]–[5]. of unlabeled data in order to select most informative samples to be labeled. Generally speaking, representing the Classical methods and tools in signal processing uncertainty is crucial in any active learning framework, area, with an emphasis on parametric modeling, however, deep learning methods are not capable of either have been used in many areas with different type of representing or manipulating model uncertainty. On the data [7]–[9]. However recent advances in machine other hand, from the real world application perspective, learning and in particular, artificial neural network, uncertainty representation is getting more and more atten- have shown that non-parametric models, are capable tion in the machine learning community. Deep Bayesian active learning frameworks and generally any Bayesian of almost modeling any type of data at the cost of active learning settings, provide practical consideration in higher complexity. In this line, deep learning could the model which allows training with small data while be incorporated with classical tools and frameworks representing the model uncertainty for further efficient of machine learning such as active learning in order training. In this paper, we briefly survey recent advances to improve the performance. in Bayesian active learning and in particular deep Bayesian Representing the uncertainty of either embedding active learning frameworks. Index Terms—Bayesian Active Learning, Deep learning, space or output probability space is challenging Posterior estimation, Bayesian inference, Semi-supervised especially when we are going to use deep learning learning tools and concepts. It would show up in multiple scenarios in which we need to measure the model I.INTRODUCTION uncertainty such as problems addressed by classical Active learning is a framework in the area of active learning or its various versions. In practice, arXiv:2012.08044v1 [cs.LG] 15 Dec 2020 machine learning in which the model starts training we are interested in forming a desired output prob- by small amount of labeled data and then, in a ability space in a typical active learning frame- sequential process asks for more data samples from work, which necessitates the model uncertainty a pool of unlabeled data to be labeled. In fact, measurement and representation. In recent years, the key idea behind this framework is to achieve multiple efforts have been performed to introduce desired accuracy while lowering the cost of labeling deep learning tools to active learning framework. by efficiently asking for more labeled data. There- Authors of [6], in their work as a pioneered effort, fore, compared to many other frameworks, active discussed that bringing deep learning tools into learning tries to achieve the same or higher accu- active learning setting poses two major problem; racy by using smaller amount of labeled data [1], uncertainty representation and the amount of data [2]. In contrast, semi-supervised learning addresses needed to train the model. In next two sections, relatively similar problem domain, however, there in order have a tast of basic concepts and some of data while at the same time allow model uncer- tainty representation using an acquisition function, which essentially makes them the key tool for active learning on big data with high dimensional samples. Accordingly, authors of [11] proposed a new version of CNNs with Bayesian prior on a set of weights,i.e., Gaussian prior p(w); w is {W1,W2, ...WN }. Bayesian CNNs for classification tasks with a softmax layer, would be formulated with a likelihood model:

w Fig. 1. Learning paradigm of active learning; as it is shown, at every p(y = c|x, w) = softmax(f (x)). (1) iteration the training starts from scratch on modified training data. However, in order to practically implement this model, the Gal et.al [6] suggest approximate in- theoretical concepts, a brief overview of the learning ference using stochastic regularization techniques, paradigm of active learning and Bayesian neural and perform it by applying dropout during the network will be presented. training as well as test process (to estimating the posterior). In more detail, they do this by finding II.LEARNING PARADIGMIN ACTIVE LEARNING ∗ a distribution, namely qθ (w) which given a set of Simply put, The goal of active learning is to training data D, minimizes the Kullback-Leibler minimize the cost of data annotation or labeling (KL) divergence between estimated posterior and by efficiently selecting the unlabeled data to be exact posterior p(w|D). Finally using Mont Carlo labeled. In more detail, in every iteration of active integration we will have: learning, a new labeled data sample (or even batch Z of data) will be added to the training data, and the p(y = c|x, D) = p(y = c|x, w)p(w|D)dw (2) training process starts from scratch. This sequential Z ∗ training will continue until the accuracy reaches to ≈ p(y = c|x, w)qθ (w)dw (3) the desired level. The overview of the learning cycle T of active learning is shown in Fig.1. 1 X ≈ p(y = c|x, wˆ ), (4) In each iteration, all of the unlabeled data samples T t will be evaluated in the sense of uncertainty, and t=1 the best one will be selected by a function, namely with qθ(w) as dropout distribution and wˆt as ∗ acquisition function. Generally speaking,acquisition estimation of qθ [6]. function performs as a function for uncertainty IV. REVIEWON RECENT ADVANCES sampling, or diversity sampling or both of them. While random acquisition is considered baseline, Bayesian inference methods allow the introduc- depending on data setting, there are several acqui- tion of probabilistic framework to machine learning sition functions, some of them for image data are and deep learning. The notion behind the intro- presented and approximately formulated in [6]. duction of these kind of frameworks to machine learning is that learning from data would be treated III.CONVOLUTIONAL NEURAL NETWORKS as inferring optimal or near optimal models for WITH BAYESIAN PRIOR data representation, such as automatic model dis- Deep learning algorithms mostly rely on training covery. In this sense, Bayesian methods and here, a convolutional neural networks (CNNs). In fact specifically Bayesian active learning methods gain with the recent advancement of CNNs, one of the attention due to their ability for uncertainty repre- main advantages of CNNs is that they enable cap- sentation and even better generalization on small turing spatial information of the data [10]. Bayesian amount of data [20]. One of the main work on CNNs are capable of learning from small amount introduction of model uncertainty measurement and manipulation to active learning is done by Gal be interpreted as a variational Bayesian approxima- and Ghahramani [6]. In fact the major contribution tion [21], [22], where the approximating distribution of this paper is special introduction of Bayesian is a mixture of two Gaussians with small variances uncertainty estimation to active learning in order which the mean of one of the Gaussians is zero. The to form a deep active learning framework. In more prediction uncertainty caused by uncertainty in the detail, deep learning tools are data hungry while weights which could be measured by approximate active learning tends to use small amount of data, posterior using Monte Carlo integration. moreover, generally speaking deep learning is no Authors of reference [24] poses another similar suitable for uncertainty representation while active problem by introducing deep learning with rela- learning relies on model uncertainty measurement or tively very large amount of data and big network even manipulation. Understanding these big natural into active learning; and suggesting the necessity of differences, authors of this paper found the Bayesian systematic request for labeling in the form of batch approach to be the solution. In fact they refine the active learning (batch rather than sample in each active learning general framework, which usually active learning iteration). They offer batch active work with SVM and small amount of data, to be learning in order to address the problem that existing well scaled to high dimensional data such as images greedy algorithms become computationally costly in the case of big data. in contrast to small data. and sensitive to the model slightest changes. The It practice, the authors put a Bayesian prior on authors propose a model aimed at efficiently scaled the kernels of a convolutional neural network as active learning by well estimating data posterior. the training engine of active learning framework. They suggest scenarios in which more efficiency They refer to their previous work [21] suggesting comes with one batch rather than one data sample that in order to have a practical Bayesian CNN, at each iteration. In this paper, authors take multiple the Bayesian inference could be done through ap- active learning methods, different acquisition func- proximate inference in the Bayesian CNN, which tions, into account for their objective of efficient makes the solution computationally tractable. The batch selection in the sense of sparsity, or sparse interesting point is that they empirically showed that subset approximation. Moreover, they claim that dropout is a Bayesian approximation which can be based on their experiments, that reference [6], as used as a way to introduce uncertainty to deep learn- a Bayesian approach, outperforms others in many ing [22]. Here the point is that dropout is not only problem setting. More specifically, with the same used in the training process, i.e., they do inference Bayesian active learning framework proposed by applying dropout before every weight layer during [6] for capturing uncertainty, they target the most training, and also during test to sample from the optimum batch selection by finding data posterior, approximate posterior. This framework compared to however as active learning setting does not provide other active learning methods addressing big data access to the labels before querying the pool set, for image, such as those using RBF, performs better. they take expectation w.r.t. the current predictive In this line, Jedoui et. al. [23] even go further in posterior distribution. This work represents a closed- the level of uncertainty of model by assuming that form solution consistent with basic theoretical set- the output space is no longer mutually exclusive, ting of reference [6]. for instance we have more that one output for a Gal and Ghahramani [11] suggest a Bernoulli single input. They empirically show that classical approximate variational inference method, which uncertainty sampling does not perform better than prevents from CNNs over-fitting, i.e., by consider- random sampling at these sort of tasks such as ing a Bayesian prior on the weights of the network, Visual Question Answering, therefore they refer to it will become capable of learning from small [6], [18], [19], [21] take a similar strategy by using data, with no over-fitting or higher computational Bayesian uncertainty in a semantically structured complexity. This work can be considered as one embedding space rather than modeling uncertainty of the basics of developing deep Bayesian active of the output probability space. Referring to Gal and learning. Authors of [12] with an emphasis on the Ghahramani’s works, they mention that dropout can fact that the nature of active learning does not allows thorough comparison of models and acqui- Their framework outperform several state of-the- sition functions, explore more than 4 experiments art frameworks based on enhanced linear regression of different models and acquisition functions for in terms of prediction accuracy and response to multiple tasks of natural language processing, and missing data. finally show that deep Bayesian active learning con- Houlsby et. al. [18] propose a measurement of sistently provides the best performance. Kandasamy predictive entropy which later is used in a classifi- et. al [13] underscore the fact that classical methods cation framework based on Gaussian Process. Their for posterior estimation are query inefficient in the method performs well compared to several similar sense of estimating likelihood. They suggest that a classification frameworks while the computational query efficient approach would be posterior approx- complexity is not greater than other methods. Fi- imation using Bayesian active learning framework. nally they develop their framework to a Gaussian Considering a Gaussian prior, a utility function process preference learning by extension of binary as a measure of divergence between probabilities preference learning to classification setting. One (here densities)is formed and at each time step, the of the main advantages of this method is that it estimated most informative query would be sent provides desired accuracy at a relatively low com- to an oracle. Then the posterior will be updated. putational complexity cost. Their experiment confirms the query efficiency of V. CONCLUSION the approach. Reference [14] suggests that since most of the methods of distance metric learning are In this paper, we surveyed recent advances in sensitive to the size of the data and randomly select Bayesian active learning, with an emphasis on deep the training pairs, they could not return satisfactory Bayesian active learning. Our main focus in on the results in many real world problems. The authors works contributing to the theory of this problem address this problem by firstly introducing Bayesian domain, however some interesting works on the approach to distance metric learning, and then de- application of Bayesian active learning are surveyed. veloping this framework by uncertainty modeling, Bayesian inference approaches hold very important which ends up in a Bayesian framework for active place in machine learning and recently, many at- distance metric learning. This framework enables ef- tentions have shifted toward data, model and even ficient pair selection for training as well as posterior network structure uncertainty representation using probability approximation. Reference [15] tries to these approaches. Bayesian active learning and its combine the benefits of Bayesian active learning and intersection with deep learning concepts provide semi-supervised learning by introducing a active very interesting frameworks in terms of theroy and expectation maximization framework with pseudo- application. labels. Authors use Mont Carlo dropout introduce REFERENCES in [6] to compute the probability outputs. [1] Settles, B. (2009). Active learning literature survey. University Zeng et. al [16] address the question that is it pos- of Wisconsin-Madison Department of Computer Sciences. sible to measure the model uncertainty without fully [2] Sener, O., & Savarese, S. (2017). Active learning for convo- Bayesian CNNs. Their results on several Bayesian lutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489. CNNs confirm that in order to represent the model [3] Taherkhani, Fariborz, Nasser M. Nasrabadi, and Jeremy Daw- uncertainty, one needs to apply the Bayesian prior son. ”A deep face identification network enhanced by facial on only a few last layer before the output. With this attributes prediction.” In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 553- setting, the model would enjoy the benefits of both 560. 2018. deterministic and Bayesian CNNs. Lewenberg et. al [4] Taherkhani, Fariborz, Hadi Kazemi, and Nasser M. Nasrabadi. [17] address the problem of active surveying using ”Matrix completion for graph-based deep semi-supervised learning.” In Proceedings of the AAAI Conference on Artificial a Bayesian active learner. In fact they use dimen- Intelligence, vol. 33, pp. 5058-5065. 2019. sionality reduction in an active learning framework [5] Taherkhani, Fariborz, Hadi Kazemi, Ali Dabouei, Jeremy Daw- by applying Bayesian prior in order to design a son, and Nasser M. Nasrabadi. ”A weakly supervised fine label classifier enhanced by coarse supervision.” In Proceedings of system to predict the answers to unasked question the IEEE International Conference on Computer Vision, pp. using a limited sequential active question asking. 6459-6468. 2019. [6] Gal, Yarin, Riashat Islam, and Zoubin Ghahramani. ”Deep [24] Pinsler, Robert, Jonathan Gordon, Eric Nalisnick, and Jose´ bayesian active learning with image data.” arXiv preprint Miguel Hernandez-Lobato.´ ”Bayesian batch active learning as arXiv:1703.02910 (2017). sparse subset approximation.” In Advances in Neural Informa- [7] Mohamadi, Salman, Hamidreza Amindavar, and SM Ali tion Processing Systems, pp. 6359-6370. 2019. Tayaranian Hosseini. ”Arima-garch modeling for epileptic seizure prediction.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 994- 998. IEEE, 2017. [8] Mohamadi, Salman, Farhang Yeganegi, and Nasser M. Nasrabadi. ”Detection and Statistical Modeling of Birth-Death Anomaly.” arXiv preprint arXiv:1906.11788 (2019). [9] Mohamadi, Salman, Farhang Yeganegi, and Hamidreza Amin- davar. ”A New Framework For Spatial Modeling And Syn- of Genome Sequence.” arXiv preprint arXiv:1908.03342 (2019). [10] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Im- agenet classification with deep convolutional neural networks.” Communications of the ACM 60, no. 6 (2017): 84-90. [11] Gal, Yarin, and Zoubin Ghahramani. ”Bayesian convolutional neural networks with Bernoulli approximate variational infer- ence.” arXiv preprint arXiv:1506.02158 (2015). [12] Siddhant, Aditya, and Zachary C. Lipton. ”Deep bayesian active learning for natural language processing: Results of a large- scale empirical study.” arXiv preprint arXiv:1808.05697 (2018). [13] Kandasamy, Kirthevasan, Jeff Schneider, and Barnabas´ Poczos.´ ”Bayesian active learning for posterior estimation.” In Pro- ceedings of the 24th International Conference on Artificial Intelligence, pp. 3605-3611. 2015. [14] Yang, Liu, Rong Jin, and Rahul Sukthankar. ”Bayesian ac- tive distance metric learning.” arXiv preprint arXiv:1206.5283 (2012). [15] Matthias, Rottmann, Kahl Karsten, and Gottschalk Hanno. ”Deep bayesian active semi-supervised learning.” In 2018 17th IEEE International Conference on Machine Learning and Ap- plications (ICMLA), pp. 158-164. IEEE, 2018. [16] Zeng, Jiaming, Adam Lesnikowski, and Jose M. Alvarez. ”The relevance of Bayesian layer positioning to model un- certainty in deep Bayesian active learning.” arXiv preprint arXiv:1811.12535 (2018). [17] Lewenberg, Yoad, Yoram Bachrach, Ulrich Paquet, and Jeffrey S. Rosenschein. ”Knowing What to Ask: A Bayesian Active Learning Approach to the Surveying Problem.” In AAAI, pp. 1396-1402. 2017. [18] Houlsby, Neil, Ferenc Huszar,´ Zoubin Ghahramani, and Mat´ e´ Lengyel. ”Bayesian active learning for classification and pref- erence learning.” arXiv preprint arXiv:1112.5745 (2011). [19] Karpathy, Andrej, Armand Joulin, and Li F. Fei-Fei. ”Deep frag- ment embeddings for bidirectional image sentence mapping.” In Advances in neural information processing systems, pp. 1889- 1897. 2014. [20] Ghahramani, Zoubin. ”Probabilistic machine learning and arti- ficial intelligence.” Nature 521, no. 7553 (2015): 452-459. [21] Gal, Yarin, and Zoubin Ghahramani. ”Dropout as a bayesian approximation: Insights and applications.” In Deep Learning Workshop, ICML, vol. 1, p. 2. 2015. [22] Gal, Yarin, and Zoubin Ghahramani. ”Dropout as a bayesian ap- proximation: Representing model uncertainty in deep learning.” In international conference on machine learning, pp. 1050-1059. 2016. [23] Jedoui, Khaled, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. ”Deep Bayesian Active Learning for Multiple Correct Outputs.” arXiv preprint arXiv:1912.01119 (2019).