Extending the Morphological Hit-Or-Miss Transform to Deep
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 c 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. arXiv:1912.02259v2 [cs.CV] 28 Sep 2020 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2 Extending the Morphological Hit-or-Miss Transform to Deep Neural Networks Muhammad Aminul Islam, Member, IEEE, Bryce Murray, Student Member, IEEE, Andrew Buck, Member, IEEE, Derek T. Anderson, Senior Member, IEEE, Grant Scott, Senior Member, IEEE, Mihail Popescu, Senior Member, IEEE, James Keller, Life Fellow, IEEE, Abstract—While most deep learning architectures are built on pass filters), frequency-orientation filtering via the Gabor, etc. convolution, alternative foundations like morphology are being In a continuous space, it is defined as the integral of two explored for purposes like interpretability and its connection functions—an image and a filter in the context of image to the analysis and processing of geometric structures. The morphological hit-or-miss operation has the advantage that it processing—after one is reversed and shifted, whereas in takes into account both foreground and background information discrete space, the integral realized via summation. CNNs when evaluating target shape in an image. Herein, we identify progressively learn more complex features in deeper layers limitations in existing hit-or-miss neural definitions and we with low level features such as edges in the earlier layers formulate an optimization problem to learn the transform relative and more complex shapes in the later layers, which are to deeper architectures. To this end, we model the semantically important condition that the intersection of the hit and miss composite of features in the previous layer. While that has structuring elements (SEs) should be empty and we present a way been the claim of many to date, recent work has emerged to express Don’t Care (DNC), which is important for denoting suggesting that mainstream CNNs–e.g., GoogLeNet, VGG, regions of an SE that are not relevant to detecting a target ResNet, etc.–are not sufficiently learning to exploit shape. pattern. Our analysis shows that convolution, in fact, acts like In [1], Geirhos et al. showed that CNNs are strongly biased a hit-miss transform through semantic interpretation of its filter differences. On these premises, we introduce an extension that towards recognizing texture over shape, which as they put it outperforms conventional convolution on benchmark data. Quan- “is in stark contrast to human behavioural evidence and reveals titative experiments are provided on synthetic and benchmark fundamentally different classification strategies.” Geirhos et al. data, showing that the direct encoding hit-or-miss transform support these claims using a total of nine experiments totaling provides better interpretability on learned shapes consistent 48,560 psychophysical trials with respect to 97 observers. with objects whereas our morphologically inspired generalized convolution yields higher classification accuracy. Last, qualitative Their research highlights the gap and stresses the importance hit and miss filter visualizations are provided relative to single of shape as a central feature in vision. morphological layer. An argument against convolution is that its filter does Index Terms—Deep learning, morphology, hit-or-miss trans- not lend itself to interpretable shape. Because convolution is form, convolution, convolutional neural network correlation with a time/spatial reversed filter, the filter weights do not necessarily indicate the absolute intensities/levels in I. INTRODUCTION shape. Instead, they signify relative importance. Recently, Deep learning has demonstrated robust predictive accuracy investigations like guided backpropagation [2] and saliency across a wide range of applications. Notably, it has achieved mapping [3] have made it possible to visualize what CNNs and, in some cases, surpassed human-level performance in are perhaps looking at. However, these algorithms are not many cognitive tasks, for example, object classification, detec- guarantees, they inform us what spatial locations are of tion, and recognition, semantic and instance segmentation, and interest, not what exact shape, texture, color, contrast, or depth prediction. This success can be attributed in part to the other features led a machine to make the decision it did. ability of a neural network (NN) to construct an arbitrary and Furthermore, these explanations depend on an input image very complex function by composition of simple functions, and the learned filters. The filters alone do not explain the thus empowering it as a formidable machine learning tool. learned model. In many applications, it is not important that To date, state-of-the-art deep learning algorithms mostly we understand the chain of evidence that led to a decision. The use convolution as their fundamental operation, thus the name only consideration is if an AI can perform as well, if not better, convolutional neural network (CNN). Convolution has a rich than a human counterpart. However, other applications, e.g., and proud history in signal/image processing, for example ex- medical image segmentation in healthcare or automatic target tracting low-level features like edges, noise filtering (low/high recognition in security and defense, require glass versus black box AI when the systems that they impact intersect human Muhammad Aminul Islam is with the Department of Electrical & Computer lives. In scenarios like these, it is important that we ensure Engineering and Computer Science, University of New Haven, Connecticut, CT 06516, USA. E-mail: (amin [email protected]). that shape, when/where applicable, is driving decision making. Bryce Murray, Andrew Buck, Derek T. Anderson, Grant J. Scott, Mi- Furthermore, the ability to seed, or at a minimum understand hail Popescu, and James Keller are with the Department of Electrical Engi- what shape drove a machine to make its decision is essential. neering and Computer Science, University of Missouri, Columbia, MO 65211. In contrast to convolution, morphology-based operations are Manuscript revised June, 2020. more interpretable—a property well-known and well-studied IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 3 in image processing, which has only been lightly studied and shape in terms of relevant and non-relevant elements (i.e., explored in the context of deep neural networks [4]–[16]. DNC). Herein, we provide the conditions that will make Morphology is based on set theory, lattice theory, topology elements under the conventional definition of hit-or-miss to act and random functions and has been used for the analysis and as DNC and we show that the valid ranges for target and DNC processing of geometric structures [17]–[26]. The most fun- elements are discontinuous. However, this constraint poses damental morphological operations are erosion and dilation, a challenge to data-driven learning using gradient descent, which can be combined to build more complex operations which requires the variables to reside in a (constrained or like opening, closing, the hit-or-miss transform, etc. Grayscale unconstrained) continuous space. As a result, we propose hit- erosion and dilation are used to find the minimal offset by or-miss transforms that implicitly enforces non-intersecting which the foreground and background of a target pattern fits in condition and addresses DNC. an image, providing an absolute measure of fitness in contrast Last, while convolution can act like a hit-or-miss transform to relative measure by convolution, facilitating the learning of – when its “positive filter weights” correspond to foreground, interpretable structuring elements (SEs). “negative weights” to background, and 0s for DNC – it differs Recently, a few deep neural networks have emerged based in some important aspects. For example, elements in a hit-or- on morphological operations like dilation, erosion, opening, miss SE indicate the absolute intensity levels in the target and closing [4], [27]. In [4], Mellouli et al. explored pseudo- shape whereas weights in a convolution filter indicate relative dilation and pseudo-erosion defined in terms of an weighted levels/importance. Another difference is that the sum operation counter harmonic mean, which can be carried out as the gives equal importance to all operands versus max (or min) ratio of two convolution operations. However, their network is in the hit-or-miss. On this premises, we propose a new ex- not an end-to-end morphological network, rather a hybrid of tension to convolution, referred to as generalized convolution traditional convolution and pseudo-morphological operations. hereafter, by replacing the sum with the generalized mean. In [27], Nogueira et al. proposed a neural network based on bi- The use of a parametric generalized mean allows one to nary SEs (consisting of 1s and 0s) indicating which pixels are choose how values in the local neighborhood contribute to relevant to the target pattern. Their proposed implementation the result; e.g., all contribute equally (as in the case of the requires a large number of parameters, specifically s2 binary mean) or just one drives the result (as in max), or something filters