Exploring Deep Learning Using Information Theory Tools and Patch Ordering

Exploring Deep Learning Using Information Theory Tools and Patch Ordering

Unde r revie w as a conf er e nc e pape r at ICLR 2019 EXPLORING DEEP LEARNING USING INFORMATION THEORY TOOLS AND PATCH ORDERING Anony m o us author s Pape r unde r double -blin d revie w ABSTRACT We present a framework for automatically ordering image patches that enables in-depth analysis of dataset relationship to learnability of a classification task using convolutional neural network. An image patch is a grou p of pixels resid ing in a contin u ou s area contain e d in the sample . Our preliminary experimental results show that an informed smart shuffling of patches at a sample level can expedite training by exposing important features at early stages of training. In addition, we conduct systematic experiments and provide evidence that CNN’s generalization capabilities do not correlate with human recognizable features present in training samples. We utilized the framework not only to show that spatial locality of features within samples do not correlate with generalization, but also to expedite convergence while achieving similar generalization performance. Using multiple network architectures and datasets, we show that ordering image regions using mutual information measure between adjacent patches, enables CNNs to converge in a third of the total steps required to train the same network without patch ordering. 1 INTRODUCTION Adva nc e s in Deep Lear nin g (DL) and Convolutional Neura l Netw or ks (CNN ) have dramatically impro ve d the state- of-the - ar t in compu te r vision tasks. Many of these breakthroughs are attributed to the succe ssiv e featu re extrac tion and an increa sin g abstr a ct representation of the underly ing training dat a using multi- stag e simple oper ation s such as convolution. These opera tion s posse ss several model para m ete r s such as convolution filter whic h are traine d to amplif y and refine information that are relev a n t to the classification, and to suppr e ss irrele v an t information (Ian Goodfellow et al., 2006). The traini n g procedure uses backpropagation algorithm with supervision. This algorith m comb ine d with Stocha s t i c Gradie nt Desc e nt (SGD ), attem pts to minim iz e the over all erro r or devia tio n from true label by compu ti n g the error grad ien t of each para m e te r and by perfo rm in g small upda te s in the opposite directio n. Desp i t e their succ e ss, theoretical characterization of deep learnin g and CNN s is still at its infanc y and valua b l e correlations such as numbe r of layer s need ed to achie ve a certain performance are not well under sto o d . However, the success of deep learning has spawned many research avenues in order to explain deep network’s exceptional generalization performance (Saxe et al., 2018) (Mehta and Schwab, 2014) (Pai, 2016; Tishby and Zaslavsky, 2015). One promising theoretical characterization of deep learning that supports an intuition that motivated this work is the characterization that uses an information theoretic view of feature extraction. In particular it is based on the information bottleneck (IB) method which is concerned with the problem of how one extracts an efficient representation of relevant information contained in a large set of features (Slonim, 2002). Saxe et al., (2018) proposes to study deep learning through the lens of information theory using the IB principle. In this characterization, deep learning is modeled as a representation learning. Each layer of a deep neural network can be seen as a set of summary statistics which contain some of the information present in the training set, while retaining as much information about the target output as possible (Saxe et al., 2018). In this context a relevant information, of a cat vs dog classification task for instance, is the information pattern present in all the cat samples useful for predicting any picture of a cat. With this view, the amount of information relating the training set and the labels encoded in the hidden layers can be measured over the course of training (Tishby and Zaslavsky, 2015). Inspired by this view, we use information theoretic measures of entropy extended to measure image characteristics, to develop preprocessing techniques that enable rob ust features extraction during training. One relev a nt insigh t prese nte d in these pape r s is that the goal of DL is to captu re and efficiently repr e se nt the relev a nt information in the input varia b le that desc rib e the outp u t variab le. This is equiv ale nt to the IB meth od whose goal is to find maximally compressed mappin g of 1 Unde r revie w as a conf er e nc e pape r at ICLR 2019 the input while prese rvin g as much relev a nt inform ation of the output as possible . This characterizatio n leads us to ask the questio n: 1 Can we utilize information theoretic techniq ue s for imag e s to make training efficient? Particularly, can we prep roc e ss training set and feature maps such that the relev an t information is captu re d in the early stage s of training ? In superv ise d learnin g, we are intere sted in good featu re representations of the input patte rn that ena b l e good predictio n of the label (Janocha and Czarnecki, 2017). As a result, a training set for ima g e classification tasks that employ supervised learnin g, is constructed with the help of huma n labele r . For instan ce , for a cat vs dog classification proble m , the huma n labele r must cate go riz e each sample into eithe r one of the classe s. Durin g this proce ss, the labele r must recog niz e and classify each input usin g their own experience and distinguishing capabilities. Considering this, a natural question we first must answer before addressing the question above is: Does human classification performance on the training dataset affect learnability of the task? In other word s, can the netw or ks learn from ‘scrambled’ sample s that cann ot be classifie d by the nak e d eye? This questio n was investigated in Zhang et al. (2016) with intrig uin g outco m e s. The aut h o r s presented results that indicate that CNN s are capa ble of easily fitting trainin g set conta inin g sample s that have no correlation with labels (see Fig. 3 for illustration). These results have us recon side r the traditio nal view that netw or ks build hiera rc hy of featu re s in incre asin g abstraction, i.e., learn combination pixels that make edges in the lower layers, learn combinations of edge s that make up objec t parts in the mid d l e layers, learn combinations of parts that make up an object the next layer etc. This view is challe n g e d by the finding s highlighted in Zhang et al. (2016) and in this pape r (see sectio n V for detail) . We use the infor m ation theoretic characterization of deep learnin g to shed light on the question s by deve lo p i n g preprocessing and learnin g techniq u es that reduc e convergence time by improv ing featu re s extra c t i o n from imag e s using multilayered CNN s. We first rule out that human recognizable featur e s matc hin g lab e l s are not nece ssa r y for CNN s and that they are able to fit trainin g set contain ing scram b led samp les wit h minim a l impa ct on generalization. Equip p ed with this result we then utilize simila rity and inform ation theore tic measur es of imag e characteristics to prep ro c e ss and ease featu re extra c tio n from image s dur i n g training . Our methods aim to expose important features of each training sample earlier in training by reorganizing image regions. The contribution of our appro ac h are: 1. We provid e a framework and algorithm s for preprocessing datase t to reorder image patches usin g techniq ue s that minim iz e mutual entro p y of adjacent image patches of each training sample. As the results demonstrate, orga niz ing patch es , of each training sample using measures such as entropy of a patch and mutual information index between patch es enable faste r convergence. 2. We prese nt several techniqu e s for rankin g samp les that use information theore tic measur e s of the relationship betwe e n adjac e nt patche s and prese nt results that show faste r convergence compared to stand a rd training . Inception (Szegedy et al., 2015) architecture, known for achieving exceptional results on image classification tasks, is used for evaluation. The network is first evaluated on the corresponding datasets to create baseline reference performance metrics for comparison. For each network we used Adams optimization technique with cross-entropy loss to gather emperical training, validation and test data. The remaining content is presented as follows. In section 2, we present the patch ordering approach and highlight the design and implementation of algorithms used to preprocess data and feature maps based on patch ordering.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    14 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us