IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

c 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. arXiv:1912.02259v2 [cs.CV] 28 Sep 2020 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2 Extending the Morphological Hit-or-Miss Transform to Deep Neural Networks Muhammad Aminul Islam, Member, IEEE, Bryce Murray, Student Member, IEEE, Andrew Buck, Member, IEEE, Derek T. Anderson, Senior Member, IEEE, Grant Scott, Senior Member, IEEE, Mihail Popescu, Senior Member, IEEE, James Keller, Life Fellow, IEEE,

Abstract—While most deep learning architectures are built on pass filters), frequency-orientation filtering via the Gabor, etc. convolution, alternative foundations like morphology are being In a continuous space, it is defined as the integral of two explored for purposes like interpretability and its connection functions—an image and a filter in the context of image to the analysis and processing of geometric structures. The morphological hit-or-miss operation has the advantage that it processing—after one is reversed and shifted, whereas in takes into account both foreground and background information discrete space, the integral realized via summation. CNNs when evaluating target shape in an image. Herein, we identify progressively learn more complex features in deeper layers limitations in existing hit-or-miss neural definitions and we with low level features such as edges in the earlier layers formulate an optimization problem to learn the transform relative and more complex shapes in the later layers, which are to deeper architectures. To this end, we model the semantically important condition that the intersection of the hit and miss composite of features in the previous layer. While that has structuring elements (SEs) should be empty and we present a way been the claim of many to date, recent work has emerged to express Don’t Care (DNC), which is important for denoting suggesting that mainstream CNNs–e.g., GoogLeNet, VGG, regions of an SE that are not relevant to detecting a target ResNet, etc.–are not sufficiently learning to exploit shape. pattern. Our analysis shows that convolution, in fact, acts like In [1], Geirhos et al. showed that CNNs are strongly biased a hit-miss transform through semantic interpretation of its filter differences. On these premises, we introduce an extension that towards recognizing texture over shape, which as they put it outperforms conventional convolution on benchmark data. Quan- “is in stark contrast to human behavioural evidence and reveals titative experiments are provided on synthetic and benchmark fundamentally different classification strategies.” Geirhos et al. data, showing that the direct encoding hit-or-miss transform support these claims using a total of nine experiments totaling provides better interpretability on learned shapes consistent 48,560 psychophysical trials with respect to 97 observers. with objects whereas our morphologically inspired generalized convolution yields higher classification accuracy. Last, qualitative Their research highlights the gap and stresses the importance hit and miss filter visualizations are provided relative to single of shape as a central feature in vision. morphological layer. An argument against convolution is that its filter does Index Terms—Deep learning, morphology, hit-or-miss trans- not lend itself to interpretable shape. Because convolution is form, convolution, convolutional neural network correlation with a time/spatial reversed filter, the filter weights do not necessarily indicate the absolute intensities/levels in I.INTRODUCTION shape. Instead, they signify relative importance. Recently, Deep learning has demonstrated robust predictive accuracy investigations like guided backpropagation [2] and saliency across a wide range of applications. Notably, it has achieved mapping [3] have made it possible to visualize what CNNs and, in some cases, surpassed human-level performance in are perhaps looking at. However, these algorithms are not many cognitive tasks, for example, object classification, detec- guarantees, they inform us what spatial locations are of tion, and recognition, semantic and instance segmentation, and interest, not what exact shape, texture, color, contrast, or depth prediction. This success can be attributed in part to the other features led a machine to make the decision it did. ability of a neural network (NN) to construct an arbitrary and Furthermore, these explanations depend on an input image very complex by composition of simple functions, and the learned filters. The filters alone do not explain the thus empowering it as a formidable machine learning tool. learned model. In many applications, it is not important that To date, state-of-the-art deep learning algorithms mostly we understand the chain of evidence that led to a decision. The use convolution as their fundamental operation, thus the name only consideration is if an AI can perform as well, if not better, convolutional neural network (CNN). Convolution has a rich than a human counterpart. However, other applications, e.g., and proud history in signal/image processing, for example ex- medical image segmentation in healthcare or automatic target tracting low-level features like edges, noise filtering (low/high recognition in security and defense, require glass versus black box AI when the systems that they impact intersect human Muhammad Aminul Islam is with the Department of Electrical & Computer lives. In scenarios like these, it is important that we ensure Engineering and Computer Science, University of New Haven, Connecticut, CT 06516, USA. E-mail: (amin [email protected]). that shape, when/where applicable, is driving decision making. Bryce Murray, Andrew Buck, Derek T. Anderson, Grant J. Scott, Mi- Furthermore, the ability to seed, or at a minimum understand hail Popescu, and James Keller are with the Department of Electrical Engi- what shape drove a machine to make its decision is essential. neering and Computer Science, University of Missouri, Columbia, MO 65211. In contrast to convolution, morphology-based operations are Manuscript revised June, 2020. more interpretable—a property well-known and well-studied IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 3 in image processing, which has only been lightly studied and shape in terms of relevant and non-relevant elements (i.e., explored in the context of deep neural networks [4]–[16]. DNC). Herein, we provide the conditions that will make Morphology is based on set theory, lattice theory, topology elements under the conventional definition of hit-or-miss to act and random functions and has been used for the analysis and as DNC and we show that the valid ranges for target and DNC processing of geometric structures [17]–[26]. The most fun- elements are discontinuous. However, this constraint poses damental morphological operations are and , a challenge to data-driven learning using gradient descent, which can be combined to build more complex operations which requires the variables to reside in a (constrained or like , , the hit-or-miss transform, etc. Grayscale unconstrained) continuous space. As a result, we propose hit- erosion and dilation are used to find the minimal offset by or-miss transforms that implicitly enforces non-intersecting which the foreground and background of a target pattern fits in condition and addresses DNC. an image, providing an absolute measure of fitness in contrast Last, while convolution can act like a hit-or-miss transform to relative measure by convolution, facilitating the learning of – when its “positive filter weights” correspond to foreground, interpretable structuring elements (SEs). “negative weights” to background, and 0s for DNC – it differs Recently, a few deep neural networks have emerged based in some important aspects. For example, elements in a hit-or- on morphological operations like dilation, erosion, opening, miss SE indicate the absolute intensity levels in the target and closing [4], [27]. In [4], Mellouli et al. explored pseudo- shape whereas weights in a convolution filter indicate relative dilation and pseudo-erosion defined in terms of an weighted levels/importance. Another difference is that the sum operation counter harmonic mean, which can be carried out as the gives equal importance to all operands versus max (or min) ratio of two convolution operations. However, their network is in the hit-or-miss. On this premises, we propose a new ex- not an end-to-end morphological network, rather a hybrid of tension to convolution, referred to as generalized convolution traditional convolution and pseudo-morphological operations. hereafter, by replacing the sum with the generalized mean. In [27], Nogueira et al. proposed a neural network based on bi- The use of a parametric generalized mean allows one to nary SEs (consisting of 1s and 0s) indicating which pixels are choose how values in the local neighborhood contribute to relevant to the target pattern. Their proposed implementation the result; e.g., all contribute equally (as in the case of the requires a large number of parameters, specifically s2 binary mean) or just one drives the result (as in max), or something filters of size s × s just to represent a single s × s SE; making in between. Through appropriate selection of this parameter, the method expensive in terms of storage and computation performance can be significantly enhanced as demonstrated by and not suitable for deep learning. Furthermore, they did not our experiments. conduct any experiments nor provide results for popular com- While convolution, likewise the hit-or-miss transform, con- puter vision benchmark datasets, e.g., MNIST or Cifar. More sider foreground, background, and DNC, they differ in how importantly, none of these algorithms simultaneously apply fitness is evaluated. For example, convolution uses a relative dilation and erosion on an image to take into account both measure whereas the hit-or-miss uses an absolute measure. foreground and background. In the morphological community, One question naturally arises, how does this difference impact there is a well-known operation for achieving this, the hit-miss performance on two aspects of a learned model, explain- transform, the subject of our current article. ability and accuracy. Our analysis (Section IV) shows that Following the success of convolution based shared weight morphology provides better interpretability through its use of neural networks on handwritten digit recognition tasks, Gader an absolute measure, while convolution yields higher accuracy et al. introduced a generalized hit-or-miss transform network, as a relative measure is more robust. referred to as image algebra network [28]. Later, the standard In summary, our article makes the following specific con- hit-or-miss transform was applied in target detection [7]. All tributions to neural morphology. of these methods employed two SEs, one for the hit to find the “fitness” of an image relative to target foreground and • We identify limitations in the current definition of the another for the miss to find the “fitness” relative to target grayscale hit-or-miss and we formulate an optimization background. However, existing grayscale hit-or-miss transform to properly learn the transform in a neural network. definitions as well as their neural network implementations [6], • In light of this optimization, we propose an algorithm to [17], [28]–[31] neither state nor enforce the condition that the learn the hit-or-miss transform and also its generalization. intersection of the hit and miss SEs must be empty. Failing • We extend “conventional convolution” used in most neu- to meet this condition can result in semantically inconsistent ral networks with a parametric generalized mean. and uninterpretable SEs. To address this, we put forth an • Synthetic and benchmark datasets are used to show the optimization problem enforcing the non-intersecting condition. behavior and effectiveness of the proposed theories in a However, considering only foreground and background are quantitative (via accuracy) and qualitative (via prelimi- not sufficient to describe target shape. We also need Don’t nary shallow, single layer, filter visualizations) respect. Care (DNC), which denotes regions of the SE that are not The remainder of this article is organized as follows. In relevant to detecting a target pattern. While binary morphology Section II, we provide notations and definitions of binary and considers 0s as DNCs and it ignores them during computation, grayscale morphological operations. Section III introduces the it’s grayscale extension unfortunately considers all elements optimization problem, learning algorithm, our generalization including 0s. Therefore, we propose a new extension to the of the hit-or-miss transform, and our extension of convolution, hit-or-miss transform which allows it to describe a grayscale followed by experiments and results in Section IV. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 4

II.BINARY AND GRAYSCALE MORPHOLOGY translated structuring elements) at which, simultaneously, H c First, we briefly review definitions related to binary mor- found a match (“hit”) in A and M found a match in A . By phology in order to understand the basis of morphological using the dual relationship between erosion and dilation, the operations and their semantic meaning pertaining to image hit-or-miss transform equation can alternatively be written as processing. The most basic of morphological operations are A B = (A H) \ (A ⊕ Mˆ ), (1) dilation and erosion, which coupled with algebraic operations (e.g., sum) create more complex morphological operations where \ is the set difference operation (A \ B = A ∩ Bc). like opening, closing, hit-or-miss transform, top-hat, thinning, Though obvious from Def. 3, we emphasize that the in- thickening, and skeleton, to name a few. tersection of sets that define the foreground (aka hit) and background (aka miss) must be null or empty, i.e., both H(x, y) and M(x, y) at a given location (x, y) cannot be 1. A. Binary Morphology This is because an element in the target structure can either Binary morphology is grounded in the theory of sets. Let be treated as foreground, background or DNC (an element not Z be a set of integers. part of the target structure and is defined by 0s in both hit Definition 1. (Dilation) Let A be an image, B a SE, and and miss SEs) but it cannot simultaneously be foreground and A, B ∈ Z2. The dilation of A by B, denoted by A ⊕ B is background. We illustrate all these cases (e.g., non-intersecting and intersecting SEs) with examples in Fig. (1). Table I shows ˆ A ⊕ B = {z|(B)z ∩ A 6= ∅}, combination of hit-miss values for binary morphology. ˆ where B is the reflection of B about its origin and (B)z is TABLE I: Binary combinations for the hit-or-miss transform the translation of B by z [30], [31]. in binary morphology As the above definition shows, the dilation operation in- volves reflecting B and then shifting the reflected B by z. H M Semantic meaning The dilation of A by B is the set of all displacements z such 0 0 DNC that B and A overlap by at least one element. The set B is 0 1 Background often referred to as the (SE). 1 0 Foreground Definition 2. (Erosion) Let A be an image, B a SE, and 1 1 Inadmissible - semantically infeasible A, B ∈ Z2. Then the erosion of A by B, denoted A B is

A B = {z|(B)z ⊆ A}, B. Grayscale Morphology where (B)z is translation of B by z [30], [31]. Let f be a grayscale image, b a structuring element, and The above equation indicates that A B is the set of all f(x, y) the grayscale intensity at a location (x, y). points such that B, translated by z, is contained in A. It is a well-known fact that dilation and erosion are duals Definition 4. (Grayscale Dilation) The grayscale dilation of of each other with respect to complement and reflection. f by b, denoted as f ⊕ b, is [31] (A B)c = Ac ⊕ B,ˆ (f ⊕ b)(x, y) = max{f(s − x, t − y) + b(x, y)| (s − x), (t − y) ∈ Df ;(x, y) ∈ Db}, (2) where Ac is the complement of A. Similarly, where Df and Db are the domains of f and b, respectively. (A ⊕ B)c = Ac B.ˆ Definition 5. (Grayscale Erosion) The grayscale erosion of The morphological hit-or-miss transform is a technique for f by b, denoted as f b, is defined as shape detection that simultaneously matches both foreground and background shapes in an image. (f b)(x, y) = min{f(s + x, t + y) − b(x, y)| Definition 3. (Binary Hit-or-Miss) The binary hit-or-miss (s + x), (t + y) ∈ Df ;(x, y) ∈ Db}, (3) transform w.r.t. SEs H and M satisfying H ∩ M = ∅ is where Df and Db are the respective domains [31]. A (H,M) = (A H) ∩ (Ac M), As noted in [28], [30], the umbra transform provides the theoretical basis for grayscale extension for morphological where H is the set associated with the foreground or an object operation by providing a mechanism to express grayscale and M is the set of elements associated with the background. operations in terms of binary operations. Interested readers can refer to [28], [30] for the theory and proof of the extension. The elements in a binary SE are indexed w.r.t the origin (or A major difference between binary and grayscale morphol- a reference point) that can be designated to any point within ogy is that unlike binary morphological operations, there is the SE. In order to compute the transform, both H and M no explicit DNC conditions in grayscale morphology, i.e., all are slided over the binary image for every possible locations. elements including those with 0s contribute to the results. In this way, A (H,M) finds all the points (origins of the So, a mechanism needs to be put in place to distinguish IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 5

Image, A Template, T

0 0 0 0 b b 0 1 1 0 f f b 0 0 1 0 f 0 0 0 0

(a)

Hit SE, H Miss SE, M Erosion of A by H, Dilation of A by M, hit-or-miss transform, A H A ⊕ M A (H,M) * * * * * * * * * * * * 0 0 0 0 1 1 * 0 1 * * 1 0 * * 0 1 * Non-intersecting SEs 1 1 0 0 0 1 * 0 0 * * 1 1 * * 0 0 * 0 1 0 0 0 0 * * * * * * * * * * * *

* * * * * * * * * * * * 0 0 0 0 1 1 * 0 1 * * 1 1 * * 0 0 * Intersecting SEs 1 1 0 0 1 1 * 0 0 * * 1 1 * * 0 0 * 0 1 0 0 0 0 * * * * * * * * * * * *

(b) Fig. 1: Example of binary hit-or-miss transform structuring elements to detect a top-right corner. (a) shows a binary image, A, with a top-right corner in the top-right 3 × 3 window and a 3 × 3 template, T , that encodes the structure of the top-right corner and is used to construct SEs for hit-or-miss transform, f stands for foreground, b for background and empty cells are DNC. (b) Top row shows hit-or-miss transform for non-intersecting SEs derived from T , which correctly finds matching for both foreground in hit (1 in erosion means the foreground is matched) and background in miss (0 in dilation means the background is matched). Bottom row shows intersecting SEs, which produces empty set as it cannot find a matching for both foreground and background. Note that the transform is calculated without padding of the input image so the output size is 2 by 2. Note that we considered centers of the SEs as the origins, which are marked with circles.

between target pixels and DNC. Ideally, the DNC elements tation of max. This leads to the condition, lbI − d ≥ vmax or can be specified by −∞, which would result in maximum d ≤ lbI −ubI +lbhf , where d is a DNC element. Consequently, value for the erosion and minimum value for the dilation ubhd = lbI − ubI + lbhf and lbhd = −∞. Since lbI − ubI < 0 and thus will never contribute to the result. While suitable for a grayscale image, ubhd < lbhf . This reveals that there is for hand-crafted SE design, it might not be feasible to learn a discontinuity between valid ranges for foreground and DNC −∞-valued elements in the context of data-driven learning elements, i.e., a separation of ubI − lbI must exist between unless some constraints are imposed. Instead, the SEs can be them. This poses a challenge to the data-driven learning tasks designed such that the DNC elements are set to very low in since the weights learned are real-valued in a continuous compared to neighborhood elements so that the difference is domain and discontinuity cannot be enforced. Similar analysis always relatively high and as such it never carries over to the can be performed for dilation SE, which yields the following result. Thus, the filters can be designed smartly so that DNC condition, lbmd < lbmb , where lbmd and lbmb denote the is automatically enforced via appropriate selection of values. lower bounds of background elements and DNC, respectively, Alternatively, the erosion equation can be rewritten to consider in dilation SE m. only the foreground elements as in binary morphology. The hit-or-miss transform for grayscale is defined in litera- Next, we find the condition for an element in an erosion SE ture via Eq. (1), by replacing the set difference operation with to act as DNC. Let I be an image in the interval [lbI , ubI ]. an arithmetic subtraction operation [28], [30], [31]. Furthermore, let h be the erosion SE with foreground elements Definition 6. (Grayscale Hit-or-Miss Transform) The in the interval [lbh , ubh ] and DNC elements in the interval, f f grayscale hit-or-miss transform is [lbhd , ubhd ]. Note that lbI , lbhf , and lbhd denote the lower bounds of image I, foreground elements in h, DNC elements in h, respectively. Similarly ubI , ubh , and ubh correspond to f d r the respective upper bounds. The maximum difference possible f (h, m) = (f h) − (f ⊕ m ), for foreground is vmax = ubI − lbhf . We want the difference produced by DNC to be higher than vmax for the lowest image r r value, lbI so that they are always ignored during the compu- where m is the reflection of m, i.e., m (x, y) = m(−x, −y), IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 6 which gives pattern and contribute to the output even if they are not. This can hurt the performance when there is a lot of variation in (f (h, m))(x, y) = min {(f(x + a, y + b) − h(a, b))| context and object shape and size, however they still might (x + a), (y + b) ∈ Df ; a, b ∈ Dh} perform well for rigid pattern with fixed size and shape with − max {(f(x + a, y + b) + m(a, b)) little change in background and foreground. a,b∈Dm Figure 2 illustrates the role of non-intersecting condition and (x + a), (y + b) ∈ Df ; a, b ∈ Dm} , DNC in the grayscale hit-or-miss transform. Considering these conditions, we propose the following hit-or-miss transform where (x + a), (y + b) ∈ Df and Df , Dh, and Dm are the domains of f, h, and m, respectively [31]. (f (h, m))(x, y) = min (f(x + a, y + b) − h(a, b)) a,b∈D Let h and m be SEs with non-negative weights. The hit hf and miss SEs together define the target pattern with hit − max (f(x + a, y + b) + m(a, b)), a,b∈D indicating the foreground and miss indicating the background. mb (4) For example, if h(x, y) > m(x, y) - then that pixel is treated more as foreground than background and vice versa. subject to Similar to the binary case, the filters must be non- intersecting, i.e., satisfy the following constraints, h(a, b) ≥ mc(a, b), or m(a, b) ≥ hc(a, b), c c h(x, y) ≤ m (x, y) or m(x, y) ≤ h (x, y) c c where x + a, y + b ∈ Df , h (a, b) and m (a, b) are comple- where hc and mc are the complement of h and m, respectively. ments of h and m, and hf ⊆ h and mb ⊆ m are the foreground This condition prevents the hit and miss SEs from contradict- and background elements in h and m, respectively. We remark ing each other. According to this condition, if h(x, y) = 0.9, that SEs learned without non-intersection condition may turn then m(x, y) must be less than 1 − 0.9 or 0.1 for an unit out to preserve this property, however it cannot be guaranteed interval image. so we make this condition explicit in our proposed definition.

C. Properties of morphological operations B. Hit-or-Miss Transform Neuron Both grayscale erosion and dilation as well as the hit-or- A major challenge in enforcing the non-intersecting condi- miss transform are translation invariant, i.e., tion via complement according to Eq. (4) is computing the (f + c) b = f b + c and ranges for the image and SEs. This is because the ranges can be at different scales and they can vary across layers and (f + c) ⊕ b = f ⊕ b + c, from one iteration to the next due to updating of elements where c is an arbitrary value. Note that these operations as during optimization. To circumvent this issue, we take a seen from Eqs. (2) and (3) are not scale invariant. In contrast, more restrictive approach (analogous to binary morphology) convolution is scale invariant but not translation invariant. where an element in a hit-or-miss transform is exclusively foreground, background, or DNC. We propose two algorithms, one with single SE incorporating only the non-intersecting III.METHODS condition and another with two SEs incorporating both the A. Morphological Shared Weight Neural Network non-intersecting condition and DNC. Inspired by the success of shared weight CNNs on hand- Single SE hit-or-miss transform: Let f be an image and w written digit recognition (MNIST dataset) by LeCun in 1990 be an SE. The SE elements are partitioned into wh and wm [32], Gader et al. [28] introduced morphology based image such that their pairwise intersection is empty, where wh = algebra network substituting convolution for morphological {w : w ≤ 0} and wm = {w : w ≥ 0}. The hit-or-miss neuron operations. Particularly, they used the hit-or-miss transform is defined as because of its ability to take into account both background and foreground of an object. This transform was extended with a (f w)(x, y) = min (f(x + a, y + b) + wh(a, b)) a,b∈D power mean to soften the extremely sensitive max and min wh operations, where all the parameters including the exponents − max (f(x + c, y + d) + wm(c, d)), (5) c,d∈Dw of the power mean were learned. In later works, Won et. al. [6] m and Khabou et al. [29] used the standard hit-or-miss transform. where wh and wm conceptually correspond to foreground None of these works considered the following aspects of hit- and background. This formulation has advantages: (i) implicit or-miss transform, non-intersecting condition and DNC. As complementary conditions, (ii) fewer parameters, and (iii) illustrated with examples of binary and grayscale morphology fewer algebraic operations, thus less complexity and more in Figs. 1 and 2, DNC plays an important role in the design computational efficiency. A caveat of this method is 0 acts of an SE that helps disregard irrelevant parts of an image not as a transition point between foreground and background so necessary for finding a target pattern while keeping focus only DNC cannot be enforced around this transition point, which on the relevant parts. Without a mechanism in place to provide otherwise would hinder switching of foreground elements to for DNC, each element will be treated as a part of the target background and vice versa. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 7

No DNC Vs. DNC - Case I No DNC Vs. DNC - Case II Non-intersecting vs. Intersecting SEs

0.0 0.0 0.3 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.7 0.7 0.3 0.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 Grayscale image, f 0.0 0.0 0.7 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 −∞ −∞ −∞ 0.0 0.0 0.0 −∞ −∞ −∞ −∞ 0.1 0.1 −∞ 0.1 0.1

Hit SE, h 0.7 0.7 0.0 0.7 0.7 −∞ 0.7 0.7 0.0 0.7 0.7 −∞ 0.7 0.7 0.1 0.7 0.7 0.7

0.0 0.7 0.0 −∞ 0.7 −∞ 0.0 0.7 0.0 −∞ 0.7 −∞ −∞ 0.7 −∞ −∞ 0.7 −∞

−∞ 0.7 0.7 −∞ 0.7 0.7 0.0 0.7 0.7 0.0 0.7 0.7 −∞ 0.7 0.7 −∞ 0.7 0.7

Miss SE, m 0.0 0.0 0.7 −∞ −∞ 0.7 0.0 0.0 0.7 −∞ −∞ 0.7 0.1 0.1 0.7 0.1 0.1 0.7

0.0 0.0 0.0 0.0 0.0 0.0 −∞ 0.1 −∞ −∞ 0.1 −∞ −∞ −∞ −∞ −∞ −∞ −∞

* * * * * * * * * * * * * * * * * * * * * * * *

* -0.7 0.0 * * -0.7 0.0 * * -0.7 0.0 * * -0.7 0.3 * * -0.7 -0.1 * * -0.7 -0.7 * Erosion of f by h, f h * -0.7 -0.7 * * -0.7 -0.7 * * -0.7 -0.7 * * -0.7 -0.7 * * -0.7 -0.7 * * -0.7 -0.7 *

* * * * * * * * * * * * * * * * * * * * * * * *

* * * * * * * * * * * * * * * * * * * * * * * *

* 1.4 1.0 * * 1.4 1.0 * * 1.7 1.0 * * 1.7 0.7 * * 1.7 1.1 * * 1.7 1.1 * Dilation of f by m, f ⊕ m * 1.4 1.4 * * 1.4 1.4 * * 1.7 1.7 * * 1.7 1.7 * * 1.7 1.7 * * 1.7 1.7 *

* * * * * * * * * * * * * * * * * * * * * * * *

* * * * * * * * * * * * * * * * * * * * * * * *

* -2.1 -1.0 * * -2.1 -1.0 * * -2.4 -1.0 * * -2.4 -0.4 * * -2.4 -1.2 * * -2.4 -1.8 * hit-or-miss transform, f (h, m) * -2.1 -2.1 * * -2.1 -2.1 * * -2.4 -2.4 * * -2.4 -2.4 * * -2.4 -2.4 * * -2.4 -2.4 *

* * * * * * * * * * * * * * * * * * * * * * * *

Comments SEs with and without DNC Produces the same results when the SE SE with DNC considers only foreground and background elements Intersecting SEs are unable to find a match for both foreground and including 0s fits perfectly in the image. and yields better results background structures. Fig. 2: Grayscale hit-or-miss transform illustrating the importance of DNC and non-intersecting condition with an example of top-right corner detection. In SEs, −∞ is used for DNC and the origins are marked with a circle. The first two columns show the case when the SEs including its 0s exactly fit in the image, f. SEs with and without DNC produce the same results as expected. Third and fourth columns are for the case where SEs for hit and miss fit below and above, respectively in the target area (top-right 3 × 3 window) of the input image. Without DNC, 0s (vs. 0.7) in h determine the output, which remains the same even though the input image is changed. On the other hand, with DNC, the output latches on 0.7s, not on 0s in h and varies with the change in input. Fifth and sixth columns compare the effect of non-intersecting and intersecting SEs. In the sixth column, erosion of f by intersecting SE h produces −0.7 for all cells, meaning no matching foreground-pattern found in any window of the input image.

Dual SEs hit-or-miss transform: Algorithm 1 outlines the change the output from a target shape being present to absent. proposed algorithm. The algorithm takes an input image, the To date, numerous variants and extensions have been put size of the SEs, and the threshold for DNC. To enforce DNC forth to ameliorate this issue. For example, Gader et al. and/or the non-intersecting condition, we take aid of two [28] generalized the hit-or-miss transform substituting max n 1 1 P p p auxiliary variables, ah and am, initialized with zeros. The and min for the weighted power mean ( n i=1 wixi ) that elements not part of h and m are assigned to −∞ in ah and has limitations such as being undefined for 0s and yielding am, respectively. Then the hit is calculated as min(f −h−ah) complex output for negative inputs in the case of fractional p. and miss as max(f + m + am). Fuzzy and rank hit-or-miss transforms introduced in literature Because of separate SEs for foreground and background, are not suitable for gradient based optimization [17]. an element can switch back and forth from one to another Motivated by [28], we extend the hit-or-miss transform in without transitioning through DNC region. However, once an Eq. 4, referred to herein as soft hit-or-miss (SHM), using a element falls below the threshold and enters into the DNC parametric soft-max and soft-min in place for max and min, non-optimization space, it cannot revert owing to the gradient respectively. being zero in this space. Soft hit-or-miss (SHM): Eq. (4) for the hit-or-miss transform (f s (h, m))(x, y) = involves max and min operations, which are highly restrictive softmina,b∈Dh (f(x + a, y + b) − h(a, b)) and overly sensitive to fluctuations and noise in the input. f − softmaxa,b∈D (f(x + a, y + b) + m(a, b)), (6) As for instance, a sudden fluctuation in just one pixel can mb IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 8

Algorithm 1: The hit-or-miss transform using two SEs the minimum offset between target shape and image. For 1 Input: Image f and threshold for DNC, th. example, perfect alignment will produce a value of 0 in the 2 Initialize two matrices, h and m (hit and miss SEs), hit and −1 in the miss for an image with input in an unit pseudo-randomly w.r.t. a half-normal distribution. interval (as illustrated with an example in Fig. 2). On the other 3 Find the mask for DNC as the indices, ID of hand, convolution will produce higher output for an image D = {x|x ≥ max(h, m) x ≤ th} and . with target than non-target, so just looking at the convolution 4 Find the mask of non-foreground elements in h as the output, we cannot say whether a target shape is present in the indices, IAh of Ah = {x|x ≤ max(h, m) and x ∈ h} 5 Initialize an auxiliary matrix ah of the same size as h image or not. with 0’s. Given an image f, the convolution operation on this image 6 Set ah[ID + IAh ] = −∞ w.r.t. a filter w is 7 Calculate hit = min(f − h − a ) h X 8 Find the mask of non-background elements in m as the (f ∗ w)(x, y) = f(x − a, y − b) w(a, b) indices, IAm of Am = {x|x ≤ max(h, m) and x ∈ m} a,b∈Dw 9 Initialize an auxiliary matrix am of the same size as m with 0’s. This equation can be decomposed into two parts with positive 10 a [I + I ] = −∞ Set m D Am and negative weights, respectively. 11 Calculate hit miss = max(f + m + am) X 12 Calculate the hit-or-miss transform as (f ∗ w)(x, y) = f(x − a, y − b) w (a, b) f (h, m) = hit + miss, h (a,b)∈Dwh X − (−1)f(x − c, y − d)wm(c, d), (8)

subject to (c,d)∈Dwm c c h(a, b) ≥ m (a, b), or m(a, b) ≥ h (a, b). where wh = {w : w > 0} and wm = {w : w < 0}. It is worth noting the structural similarity of hit and miss terms While there exists several formulae to define soft-min and with those in the hit-or-miss transform. Since f ∗ w increases soft-max, herein we opt for a generalized mean based on with increasing coefficient of wh and decreasing coefficient of smooth-max function parameterized by α that is favorable to wm, wh and wm indicate the weight or the relative importance gradient based optimization, of the foreground and background elements of the target P xeαx pattern, respectively. As such, non-negative weights are hit s (x) = , (7) α P eαx (foreground), non-positive weights are miss (background), and zeros act as DNC in convolution. Following CNN convention, where α ∈ . This has an advantage over other generalized R we do not flip image nor filter in our implementation. mean equations such as power and Lehmer means that it The linear operation, sum, in Eq. 8, gives equal importance produces real valued output for non-negative valued inputs to all operands regardless of their values. Instead, we can use when fractional exponent is used whereas power and Lehmer soft-min for the foreground in the first term so that those means produce complex-valued results. Based on Eq. (7), the with smallest values dominate the results (akin to erosion). softmax and softmin operators are defined as Similarly, the sum for the background/miss can be generalized softmax = ssmax,α(x) = {sα(x)|α ≥ 0}, with a soft-max operation so that those with largest values in the local neighborhood dominate the results (akin to dilation). softmin = s (x) = {s (x)|α ≥ 0}. smin,α −α We refer to this extension as generalized convolution 1 (GC1), denoted ∗g1. C. Hit–or-miss transform inspired generalized convolution (f ∗g1 w)(x, y) = n (s (f(x − a, y − b)w (a, b)) Previous works like Maragos have demonstrated equiva- smin,α1 h

lency between the binary hit-or-miss transform and the thresh- −ssmax,α2 ((−1)f(x − c, y − d)wm(c, d))) , olded correlation (the linear correlation operation followed by (9) thresholding) between a binary image and the binary hit-or- or miss transform. [33]. However, no direct equivalency can be g1 established between real-valued convolution (linear operation) (f ∗ w)(x, y) = n (ssmin,α1 (f(x − a, y − b)wh(a, b)) and grayscale hit-or-miss transform (non-linear operation). +ssmax,α (f(x − c, y − d)wm(c, d))) (10) Herein, we aim to draw an analogy between these two op- 2

erations w.r.t. the detection mechanism. While both consider where ssmin,α1 and ssmax,α2 are the softmax and softmin foreground and background of the target structure, they differ aggregation operations spanning between mean and max and in that an SE in the hit-or-miss encodes the absolute level of between min and mean, respectively. Note that we apply a target shape whereas a filter in convolution encodes relative multiplication factor n in Eqs. (9) and (10) so that it becomes importance/level. As a result, the hit-or-miss transform can convolution when α = 0. Eq. (10) has the computational provide an absolute measure of how the target shape fits advantage over Eq. (9) as it requires computation only of the in an image whereas convolution provides relative measure soft-min whereas Eq. 9 involves both soft-max and soft-min. of correlation or the degree of matching. The hit-or-miss We propose an alternative definition of the GC that instead transform tells us whether an image fits a target pattern and of decomposing the convolution operation analogous to the IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 9 hit-or-miss transform, applies soft-max and soft-min directly and convolution, which also apply to standard operations. to the standard convolution and then takes their sum, TABLE III: Mini-VGG (4 layer NN)

Layer Filter size g2 (f ∗ w)(x, y) = n(ssmin,α (f(x − a, y − b)w(a, b)) 28 × 28 × 1 (MNIST/Fashion-MNIST) 1 Input layer 32 × 32 × 3 (Cifar-10) + ssmax,α (f(x − a, y − b)w(a, b))). (11) 2 HMC1 layer + BN2+ ReLU3 3 × 3 × 32, padding:1 HMC1 layer + BN + ReLU 3 × 3 × 32, padding=1 MaxPool 2 × 2 Dropout 25% We refer to this operation as GC 2 (GC2). Next, we discuss HMC1 layer + BN + RelU 3 × 3 × 64, padding=1 how this extension will affect the gradient-descent based HMC1 layer + BN + ReLU 3 × 3 × 64, padding=1 optimization, more specifically initialization. MaxPool 2 × 2 Dropout 25% Optimization: Recent advancements and key insights into Fully Connected Layer + BN + ReLU 512 the optimization process of a neural network such as initializa- Dropout 50% tion, skip connection in a residual network, and batch normal- Softmax layer 10 ization contributed to achieving high performance. Kaiming 1 HMC denotes the basic operation specific to a particular network, e.g., convolution in CNN and hit-or-miss in morphological NN; He et al. [34] showed that initializing weights such that the 2 Batch-Normalization; variance of the output of a layer remains the same as the input 3 Rectified Linear Unit. helps to keep the distribution of gradients unvaried across all layers. This addresses vanishing gradients, enabling training IV. EXPERIMENTS of deep neural networks. However, their analysis was limited In order to compare our proposed algorithms with its to convolution with ReLu activation function. standard counterparts, for sake of an apples-to-apples com- parison in a controlled fashion where we can responsibly Convolution involves sum and product, both of which are account for all the moving parts, we consider both synthetic linear operations and have closed form equations for variances and real datasets. The synthetic dataset consists of a simple (e.g., sum of variances for sum and product of variances for classification task with two fixed-shape objects such that all product). In contrast, there is no similar closed-form equation the methods can correctly classify the objects using a single for max/min and generalized mean. Therefore, we model the layer, thus allowing us to visualize and interpret the learned variance in the form of σ2 = anbσ2 for different values s,α x SEs to shed light onto the inner workings of these algorithms. of α, where n is the number of elements in a SE, and a We also evaluate the performance of our proposed algo- and b are learned. We used a synthetic dataset where x is rithms in terms of classification accuracy on two benchmark generated pseudo-randomly from a Gaussian distribution with datasets with varying context, shape, and size—from approx- unit variance and n = [3 6 9 ... 24]2. Table II lists the ratio imately fixed-sized, and rigid shaped objects with constant of the input and output variances of Eq. 7 for different α. background in the Fashion-MNIST to complex background, variable size, and shaped objects in Cifar-10. These datasets and CNNs were carefully selected to enable the fairest compar- TABLE II: Variance of the smooth-max function, s vs. α α ison possible. The goal is to understand the benefits and draw- backs of morphology relative to convolution. In future work, α 0 ±0.5 ±1 ±2 ±∞ (max/min) we will focus more on obtaining state-of-the-art global neu- σ0 = σ2 /σ2 1 1.32 1.44 0.82 0.60 ral morphological architectures such as GoogLeNet, ResNet, sα sα x n n0.95 n0.74 n0.32 n0.24 NASNets, and similar. The focus here is the fundamental value of morphology and how making it scale to deep contexts. Since our focus is to compare different feature learning Another challenge with finding exact criteria for initial- operations rather than other aspects of deep learning such ization is the interdependency of terms. As we know, the as architecture or optimization algorithms, we select a small 2 2 2 2 variance of z = x ± y is σz = σx + σy ± 2σxy, where VGG-like [35] architecture with 4 layers, referred to as mini- 2 σxy is the covariance between x and y. When x and y are VGG (see Table III for its architecture). This small NN also independent, their covariance will be zero, and the variance allows us to have the same setup (e.g., hyper-parameters and of z can be obtained directly by summing up the variance of optimization algorithm) for all experiments, including convo- individual components. However, this is not the case for hit-or- lution and standard hit-or-miss. First, we provide an analysis miss transform (e.g., Eq. (4)) and extensions (e.g., Eq. (10)), of different initialization strategies followed by experiments where f exists in both hit and miss terms. Since α changes on hit-or-miss transform and convolution and their extensions. the distribution, which in turn changes the covariance, the analysis is very complicated. Herein, we simplify variance analysis by ignoring the covariance term and using modeled A. Synthetic dataset equations for the generalized mean. The Appendix provides This dataset consists of two objects, a solid circle and an initialization criteria for extensions of hit-or-miss transforms annular ring with a hollow at the center, on a 28 × 28 grid, IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 10 as shown in the leftmost column in Fig. 3. Two hundred As seen in Table IV, initialization can make a big difference. images from each class were generated by perturbing these For example, using a normal instead of uniform distribution images with a Gaussian noise with a standard deviation of increases the accuracy by 20% on Cifar-10 dataset. Half- 0.03. A single layer NN with two 28 × 28 hit-or-miss trans- normal distribution boosts the performance further by 3.61%. form/convolution filters without padding was batch-optimized This improvement can be explained by the fact that normal using gradient descent with a learning rate of 0.01 and distribution has a high density around the mid-point. In momentum 0.9 for 1000 epochs. The network was initialized contrast, half-normal has a high density at the lower end, with a fixed 0.01 for Dual SEs transforms; and −0.01 and thus facilitating sparse optimization as fewer elements will 0.01 for convolution and Single SE transform. We used the contribute to the error. Adopting the initialization condition mean of squared error as the loss function. put forth in Appendix B gives the best result. Fig. 3 shows the learned filters/SEs. Since convolution itself is a linear operation, we also included convolution+ReLU to TABLE IV: Results for different initialization strategies for make it non-linear and thus comparable to non-linear hit-or- standard hit-or-miss transform miss transforms. As we can see, convolution+ReLU learns the Datasets shape of only one class, annular ring, with foreground shape, Distribution Parameter hollowed ring, for the hit, and a solid circle the same size as Fashion-MNIST Cifar-10 the hole in the ring for the miss. The filters for the solid circle U(-0.01,0.01) 89.33 32.54 Uniform class are just the opposite of those for the annular ring. In U(-1,1) 90.26 44.83 effect, convolution decides based on whether a ring is present N(0,0.1) 89.13 57.12 or absent in an input image, acting as a relative measure rather Normal N(0,1) 91.56 64.15 than finding a similarity measure with corresponding object HN(0,1) 92.4 67.76 shape. Contrast these filters against those SEs for the standard Half-normal According to Appendix. B 92.48 69.57 hit-or-miss transform. The learned shapes are now consistent with the class objects, e.g., circle and inverted circle for the hit and miss for solid circle object; and ring and inverted TABLE V: Results for hit-or-miss transforms and convolution ring for the annular ring object. Enforcing the non-intersecting condition helps to learn the solid circle better and the annular Methods Constraints Fashion-MNIST Cifar-10 ring worse. Adding DNC makes the filters sparse. The SF hit-or-miss transform yields very sparse SEs, e.g., SEs for the Hit (Erosion) 90.97 56.33 solid circle includes some dots close to the center in the hit and Miss (-Dilation) 88.33 53.10 on the outer-side in the miss, and are sufficient to detect a solid Dual SEs hit-or-miss None 93.31 72.49 circle. Note that due to the discriminatory nature of learning, Non-intersecting 93.25 72.72 exact matching is not required to obtain a peak classification DNC (th=0.0) 93.25 72.91 accuracy. Therefore, the speckles within the SEs/filter may be relevant and can be robust against noise and imperfection in Non-intersecting 93.25 72.72 + DNC (th=0.0) the input. A caveat of sparse SEs is that our model can easily Single SE hit-or-miss 93.09 72.90 be fooled with inputs artificially crafted or sampled from a distribution different from what the model is trained on. Convolution 94.60 87.59

2) hit-or-miss transform and convolution: The experiment B. Benchmark datasets setup is the same except that 150 epochs is used versus 70 We first provide a brief description of the datasets used in prior experiments. We used Kaimiing initialization [34] for in this experiment. Fashion-MNIST: This data set consists convolution. For DNC, we used threshold, th = 0.0. of fashion articles images of 10 classes; t-shirt/top, trouser, As seen in Table V, foreground aka hit is a better pre- pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle dictor (56% accuracy on Cifar-10) than background aka miss boot. This dataset is similar to the MNIST in term of number (53.1%). Standard hit-or-miss transform improves the perfor- of examples (70, 000), image size (28 × 28), and training-test mance further by 16.16% demonstrating the importance of partition (60, 000/10, 000), and number of classes (10). Cifar- both foreground and background in object detection. However, 10: This data set consists of 60000 32 × 32 colour images accuracy remains more-or-less the same for the the proposed in 10 classes, with 6000 instances per class. The dataset is method after incorporating the non-intersecting condition and partitioned into training and test with 50, 000 and 10, 000 DNC. Several factors affect the performance: (i) adding the examples, respectively. non-intersecting condition makes the optimization problem 1) Impact of initialization: We consider three distributions, more constrained that weakens its approximation power to uniform, normal, and half-normal, with different parameters. learn an arbitrary function, and (ii) DNC space is discontin- We optimize miniVGG with standard hit-or-miss transform for uous, where no updating occurs during optimization, limiting 70 epochs using Adam optimization [36] with a learning rate its ability to learn proper SEs. of 0.001 and a batch size of 64. The best test classification While the hit-or-miss transform enables learning inter- accuracy for each experiment is reported in Table IV. pretable SE, convolution outperforms all variants of hit-or-miss IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 11 Hit SE Miss SE Hit SE Miss SE

(a) Class object (b) Convolution (c) Convolu- (d) Standard (e) hit-or-miss (f) hit-or-miss (g) hit-or-miss (h) Single SE tion+ReLU hit-or-miss with with DNC with hit-or-miss non-intersecting non-intersecting and DNC Fig. 3: Visualization of the learned SEs and filters for synthetic objects. Convolution+ReLU learns the shape of one object, annular ring, and uses it and its invert to discriminate between the two objects. On the other hand, hit-or-miss learns the shapes of both objects. Speckles within the filters may be relevant as the exact matching is not required to obtain a peak classification accuracy due to the discriminatory nature of learning and can be robust against noise and imperfection in the input images. transform. This performance gain by convolution is due in part to its superior ability to approximate an arbitrary function, as TABLE VI: Results for extensions of the hit-or-miss transform stated by the universal approximation theorem. So, one can and convolution trade-off between interpretability and accuracy and select an appropriate operation appropriate for a task. Method α Fashion-MNIST Cifar-10

3) SHM and GC: Table VI reports the results for exten- Dual SEs SHM 1.0 93.74 77.57 sions of the hit-or-miss transform and convolution. Relaxing Dual SEs SHM + non- 1.0 93.75 77.39 max/min in the hit-or-miss transform with softer average- intersecting weighting operator enhances SHM’s performance, though still Dual SEs SHM + DNC 1.0 93.64 77.32 lags behind standard convolution. GC1 results are at the Dual SEs SHM + non- 1.0 93.78 77.7 same level as convolution. GC2 leads the scoreboard with an intersecting + DNC accuracy of 94.66 for Fashion-MNIST and 88.29 for Cifar- Single SE SHM 1.0 93.46 76.95 10. This indicates that extensions in general boost the results, 0.5 94.28 87.28 which reach maximum somewhere between α = 0 and ±∞. GC1 1.0 94.32 87.76 All these experiments share a common story that con- Convolution with sum 0.5 94.36 87.59 volution and the hit-or-miss transform come very close in replaced by softmin (1st term terms of accuracy for simpler classification tasks (Fashion- of Eq. (11)) MNIST) but the gap becomes wider for challenging tasks with Convolution with sum 0.5 94.44 86.49 complex objects (Cifar-10). There are many factors behind replaced by softmax (2nd term this performance gap, however the primary reasons can be of Eq. (11)) 0.5 94.58 88.29 attributed to (i) its underlying theory of measuring absolute GC2 fitness, which enables learning explainable filters but works 1.0 94.66 87.71 as a hindrance in achieving top performance, and (ii) the difficulty of optimization with DNC. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 12

V. CONCLUSIONAND FUTURE WORK APPENDIX

In this article, we provide an in-depth analysis of the A. GC2 theory of grayscale morphology relative to deep neural net- Consider a NN layer consisting of GC2, works, shedding some critical insights into its limitations and y = f ∗g2 w = n (s (fw) + s (fw)) , strengths. We also explore an application of a morphological smax,α1 smin,α2 operation, the hit-or-miss transform, that takes into account followed by Relu activation function both foreground and background in measuring the fitness of a target pattern in an image. Unlike binary morphology, conven- z = max(y, 0). tional grayscale morphological operations consider all pixels σ2 σ2 f w regardless of their relevance to a target shape. Furthermore, Let f and w be the variances of and , respectively. hit and miss SEs should semantically be non-intersecting. We Ignoring the covariance between two terms, the variance of y provide an optimization friendly neural network-based hit-or- the output will approximately be miss transform that takes these properties into account. σ2 ≈ n2σ02 σ2σ2 + σ02 σ2σ2 y ssmax,α f w ssmmin,α f w Specifically, we outline an optimization problem to appro- 1 2 priately learn semantically meaningful and interpretable SEs. If we use symmetric soft-max and soft-min, function, then 2 2 2 α1 = α2 = α and σ = σ = σ . This gives Following this formulation, we provide two algorithms for ssmax,α1 ssmax,α2 sα the hit-or-miss transform with one and two SEs. Since max σ2 ≈ 2n2σ2σ2 σ02 . and min in the hit-or-miss equation are too restrictive and y f w sα overly sensitive to variation and fluctuation in inputs, we 2 2 Since σz = 0.5σy as shown in [34] for a symmetric distribu- relaxed these operators with a parametric generalized mean, tion of y, As a result yielding a flexible and more powerful transform that leads σ2 ≈ n2σ2σ2 σ02 . to better classification accuracy. In the same spirit, we also z f w sα extend convolution, which outperforms standard convolution The output variance σ2 will be the same as σ2 if on benchmark datasets. z f Our analysis and experimental results show that both the hit- 1 σ2 ≈ , or-miss transform and convolution consider both background w n2σ02 sα and foreground, however they differ in the respect that the former provides an absolute measure while the latter gives which gives us the variance to initialize the filter weights. GC1 a relative measure. These differences impact their ability is also initialized with this same variance, which we found to in terms of interpretability and robustness. As better inter- give better results. pretability comes from an absolute measure, morphology leads convolution in this regard. On the other hand, relative measures B. Soft hit-or-miss transform are more roubust, so convolution outperforms morphology in Consider a NN layer consisting of softer extension of classification accuracy. Last, quantitative experiments were standard hit and miss transform, presented that demonstrate the numeric potential of these s networks and qualitative results were demonstrated related to f (h, m) = ssmin,α(f − h) − ssmin,α(f + m), a single hit-or-miss layer. We limit the focus of the current article to applying morpho- followed by Relu activation function logical operation in deep learning. In the future, we will study z = max(y, 0). how to explain a morphological neural network solely based on 2 02 2 2 the SEs, leveraging the shapes learned by them. Furthermore, Let σh = σm. Then σz ≈ σs,α(σf + σh). The condition for 2 2 we will study how to better handle the discontinuity for DNC. σz to be equal to σf is Specifically we will explore other optimization techniques   such as genetic algorithms (not stochastic gradient descent- 2 2 1 2 σh = σm ≈ 02 − 1 σf . based) with better constraints handling mechanism that will σs,α be able to update elements in a bidirectional manner across If initialized with half-normal distribution, then the variance disjointed spaces. Next, the initialization criteria developed will be, herein was based on curve-fitting and simplified analysis. A   2 2 1 1 2 future research direction can be toward conducting rigorous σ = σ ≈ − 1 σ , h m σ2 σ02 f mathematical analysis to find exact closed-form equations hn s,α for variances and co-variances involving generalized mean where σhn is the ratio of half-normal to normal variances, 2 to enhance the performance further. Last, our qualitative σhn = (1 − 2/π). visualization of shape is currently only applicable to a single We use this variance to initialize both standard and proposed morphological layer. In future work, we will extend this hit-or-miss transforms. For |α| < ∞, the variance obtained analysis to multi-layer propagation of morphology to extract using this equation is very high, causing exploding gradient. explicit shape descriptors for purposes like explainable deep To alleviate this, we scale the hit-or-miss transform equation neural shape analysis. with σs,∞/σs,α and initialize SEs the variance for α = ±∞. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 13

REFERENCES [23] E. R. Urbach, J. B. Roerdink, and M. H. Wilkinson, “Connected shape- size pattern spectra for rotation and scale-invariant classification of gray- [1] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, scale images,” IEEE Transactions on Pattern Analysis and Machine and W. Brendel, “Imagenet-trained cnns are biased towards texture; Intelligence, vol. 29, no. 2, pp. 272–285, 2007. increasing shape bias improves accuracy and robustness,” 2018. [24] P. L. Palmer and M. Petrou, “Locating boundaries of textured regions,” [2] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, IEEE transactions on geoscience and remote sensing, vol. 35, no. 5, pp. “Striving for simplicity: The all convolutional net,” arXiv preprint 1367–1371, 1997. arXiv:1412.6806, 2014. [25] R. M. Haralick, S. R. Sternberg, and X. Zhuang, “Image analysis using [3] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional ,” IEEE transactions on pattern analysis and networks: Visualising image classification models and saliency maps,” machine intelligence, no. 4, pp. 532–550, 1987. arXiv preprint arXiv:1312.6034, 2013. [26] D. Sinha and E. R. Dougherty, “Fuzzy mathematical morphology,” [4] D. Mellouli, T. M. Hamdani, J. J. Sanchez-Medina, M. B. Ayed, and Journal of Visual Communication and Image Representation, vol. 3, A. M. Alimi, “Morphological convolutional neural network architecture no. 3, pp. 286–302, 1992. for digit recognition,” IEEE transactions on neural networks and learn- [27] K. Nogueira, J. Chanussot, M. D. Mura, W. R. Schwartz, and J. A. d. ing systems, 2019. Santos, “An introduction to deep morphological networks,” arXiv [5] S. Halkiotis, T. Botsis, and M. Rangoussi, “Automatic detection of clus- preprint arXiv:1906.01751, 2019. tered microcalcifications in digital mammograms using mathematical [28] P. D. Gader, Y. Won, and M. A. Khabou, “Image algebra networks morphology and neural networks,” Signal Processing, vol. 87, no. 7, for pattern classification,” in Image Algebra and Morphological Image pp. 1559–1568, 2007. Processing V, vol. 2300. International Society for Optics and Photonics, [6] Y. Won and P. D. Gader, “Morphological shared-weight neural network 1994, pp. 157–168. for pattern classification and automatic target detection,” in Proceed- [29] M. A. Khabou, P. D. Gader, and J. M. Keller, “Morphological shared- ings of ICNN’95-International Conference on Neural Networks, vol. 4. weight neural networks: A tool for automatic target recognition be- IEEE, 1995, pp. 2134–2138. yond the visible spectrum,” in Proceedings IEEE Workshop on Com- [7] Y. Won, P. D. Gader, and P. C. Coffield, “Morphological shared- puter Vision Beyond the Visible Spectrum: Methods and Applications weight networks with applications to automatic target recognition,” IEEE (CVBVS’99). IEEE, 1999, pp. 101–109. Transactions on neural networks, vol. 8, no. 5, pp. 1195–1203, 1997. [30] E. R. Dougherty, “An introduction to morphological image processing,” [8] H. Zheng, L. Pan, and L. Li, “A morphological neural network approach SPIE, 1992, 1992. for vehicle detection from high resolution satellite imagery,” in Interna- [31] R. C. Gonzalez, R. E. Woods et al., “Digital image processing,” 2002. tional Conference on Neural Information Processing. Springer, 2006, [32] Y. LE CUN, “Constrained neural networks for unicon-strained hand- pp. 99–106. written digit recognition,” Proc. Fronties in Handwritting Recognition, pp. 145–151, 1990. [9] H. K. Sulehria, D. I. Ye Zhang, and A. K. Sulehria, “Vehicle number [33] P. Maragos, “Optimal morphological approaches to image matching plate recognition using mathematical morphology and neural networks,” and object detection,” in [1988 Proceedings] Second International WSEAS Transactions on Computers, vol. 7, no. 6, pp. 781–790, 2008. Conference on Computer Vision, 1988, pp. 695–699. [10] X. Jin and C. H. Davis, “Vehicle detection from high-resolution satellite [34] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: imagery using morphological shared-weight neural networks,” Image Surpassing human-level performance on imagenet classification,” in and Vision Computing, vol. 25, no. 9, pp. 1422–1431, 2007. Proceedings of the IEEE international conference on computer vision, [11] B. Raducanu, M. Grana, and P. Sussner, “Morphological neural net- 2015, pp. 1026–1034. works for vision based self-localization,” in Proceedings 2001 ICRA. [35] K. Simonyan and A. Zisserman, “Very deep convolutional networks for IEEE International Conference on Robotics and Automation (Cat. No. large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. 01CH37164), vol. 2. IEEE, 2001, pp. 2059–2064. [36] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” [12] P. D. Gader, M. A. Khabou, and A. Koldobsky, “Morphological reg- in 3rd International Conference for Learning Representations, 2015. ularization neural networks,” Pattern Recognition, vol. 33, no. 6, pp. 935–944, 2000. [13] A. K. Hocaoglu and P. D. Gader, “Domain learning using choquet integral-based morphological shared weight neural networks,” Image and Vision Computing, vol. 21, no. 7, pp. 663–673, 2003. [14] M. A. Khabou, P. D. Gader, and J. M. Keller, “Ladar target detection using morphological shared-weight neural networks,” Machine Vision Muhammad Aminul Islam (M’18) received the and Applications, vol. 11, no. 6, pp. 300–305, 2000. B.Sc. degree in Electrical and Electronic Engineer- [15] N. Theera-Umpon, M. A. Khabou, P. D. Gader, J. M. Keller, H. Shi, and ing from Bangladesh University of Engineering and H. Li, “Detection and classification of mstar objects via morphological Technology, Dhaka, Bangladesh, in 2005 and the shared-weight neural networks,” in Algorithms for Synthetic Aperture Ph.D. degree in Electrical and Computer Engineer- Radar Imagery V, vol. 3370. International Society for Optics and ing from Mississippi State University, Starkville, Photonics, 1998, pp. 530–540. MS, USA in 2018. [16] A. Ouadou, “Vehicle detection using morphological shared-weight He is currently an Assistant Professor in the De- neural network in the multiple instance learning framework,” Ph.D. partment of Electrical & Computer Engineering and dissertation, University of Missouri–Columbia, 2017. Computer Science at the University of New Haven [17] B. Perret, S. Lefevre,` and C. Collet, “A robust hit-or-miss transform for (UNH). His research interests include deep learning, template matching applied to very noisy astronomical images,” Pattern computer vision, information fusion, autonomous driving, and remote sensing. Recognition, vol. 42, no. 11, pp. 2470–2480, 2009. [18] V. Chatzis and I. Pitas, “A generalized fuzzy mathematical morphology and its application in robust 2-d and 3-d object representation,” IEEE Transactions on Image Processing, vol. 9, no. 10, pp. 1798–1810, 2000. [19] V.-T. Ta, A. Elmoataz, and O. Lezoray,´ “Nonlocal pdes-based mor- phology on weighted graphs for image and data processing,” IEEE transactions on Image Processing, vol. 20, no. 6, pp. 1504–1516, 2010. Bryce Murray received his B.S. in Computer Sci- [20] N. Bouaynaya and D. Schonfeld, “Theoretical foundations of spatially- ence, Mathematics, and Physics from Mississippi variant mathematical morphology part ii: Gray-level images,” IEEE College, Clinton, MS, USA, in 2015. He then re- Transactions on pattern analysis and machine intelligence, vol. 30, no. 5, ceived a M.S. in Electrical and Computer Engineer- pp. 837–850, 2008. ing from Mississippi State University, Mississippi [21] L. Ji and J. Piper, “Fast homotopy-preserving skeletons using mathemat- State, MS, USA, in 2018. ical morphology,” IEEE Transactions on Pattern Analysis & Machine He is a Ph.D. candidate at the University of Intelligence, no. 6, pp. 653–664, 1992. Missouri, Columbia, MO, USA. His interests in- [22] F. Zana and J.-C. Klein, “Segmentation of vessel-like patterns using clude data/information fusion, machine learning, mathematical morphology and curvature evaluation,” IEEE transactions deep learning, computer vision, remote sensing, and on image processing, vol. 10, no. 7, pp. 1010–1019, 2001. eXplainable AI. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 14

Andrew Buck (S’11-M’18) received the B.S. de- Mihail Popescu received his B.S. degree in Nu- grees in electrical engineering and computer engi- clear Engineering from the Bucharest Polytechnic neering in 2009, M.S. degree in computer engineer- Institute in 1987. Subsequently, he received his ing in 2012, and Ph.D. in electrical and computer M.S. degree in Medical Physics in 1995, his M.S. engineering in 2018, all from the University of degree in Electrical and Computer Engineering in Missouri, Columbia, MO, USA. 1997, and his Ph.D. degree in Computer Engineering He is an Assistant Research Professor at the Uni- and Computer Science in 2003 from the University versity of Missouri in the Electrical Engineering and of Missouri. He is currently a Professor with the Computer Science (EECS) department. His research Department of Health Management and Informatics, interests include intelligent agents, deep learning, School of Medicine, at the University of Missouri and computer vision. in Columbia, Missouri, USA. Dr. Popescu is in- terested in machine learning and medical decision support systems. His current research focus is developing decision support systems for early illness recognition in elderly and investigating sensor data summarization and visualization methodologies. He has authored or coauthored more than 160 technical publications. He is a senior IEEE member.

Derek T. Anderson (SM’13) received the Ph.D. degree in electrical and computer engineering (ECE) James Keller is now the University of Missouri from the University of Missouri, Columbia, MO, Curators Distinguished Professor Emertitus in the USA, in 2010. Electrical Engineering and Computer Science De- He is an Associate Professor in electrical engi- partment on the Columbia campus. Jim is an Hon- neering and computer science (EECS) at the Uni- orary Proferssor at the University of Nottingham. versity of Missouri and director of the Mizzou In- His research interests center on computational formation and Data Fusion Laboratory (MINDFUL). intelligence with a focus on problems in computer His research is information fusion in computational vision, pattern recognition, and information fusion intelligence for signal/image processing, computer including bioinformatics, spatial reasoning, geospa- vision, and geospatial applications. Dr. Anderson has tial intelligence, landmine detection and technology published over a 150 articles. He received the best overall paper award at the for eldercare. Professor Keller has been funded by a 2012 IEEE International Conference on Fuzzy Systems (FUZZIEEE), and the variety of government and industry organizations and has coauthored over 500 2008 FUZZ-IEEE best student paper award. He was the Program Co-Chair of technical publications. Jim is a Life Fellow of the IEEE, is an IFSA Fellow, FUZZ-IEEE 2019, an Associate Editor for the IEEE Transactions on Fuzzy and a past President of NAFIPS. He received the 2007 Fuzzy Systems Pioneer Systems, Vice Chair of the IEEE CIS Fuzzy Systems Technical Committee Award and the 2010 Meritorious Service Award from the IEEE Computational (FSTC), and an Area Editor for the International Journal of Uncertainty, Intelligence Society. He finished a full six year term as Editor-in-Chief of the Fuzziness and Knowledge-Based Systems. IEEE Transactions on Fuzzy Systems, followed by being the Vice President for Publications of the IEEE CIS from 2005-2008, and then an elected CIS Adcom member. He is VP Pubs for CIS again, and has served as the IEEE TAB Transactions Chair and as a member of the IEEE Publication Review and Advisory Committee from 2010 to 2017. Jim has had many conference positions and duties over the years.

Grant Scott (S’02-M’09-SM’17) received the B.S. and M.S. degrees in computer science and the Ph.D. degree in computer engineering and computer sci- ence from the University of Missouri, Columbia, in 2001, 2003, and 2008, respectively. He is a founding Director of the Data Science and Analytics Masters Degree program at the University of Missouri. He is an Assistant Professor with the Electrical Engineering and Computer Science De- partment, University of Missouri. Dr. Scott is exploring novel integrations of com- putational hardware and software to facilitate high performance advances in large-scale data science, computer vision, and pattern recognition. His current research efforts encompass areas such as: real-time processing of large-scale sensor networks, parallel/distributed systems, and Internet of Things (IoT) data; deep learning technologies applied to geospatial data sets for land cover classification and object recognition; extensions of enterprise RDBMS with HPC co-processors; crowd-source information mining and multi-modal analytics; high performance & scalable content-based retrieval (geospatial data, imagery, biomedical); imagery and geospatial data analysis, feature ex- traction, object-based analysis, and exploitation; pattern recognition databases and knowledge-driven high-dimensional indexing; and image geolocation. He has leveraged this experience in the development of innovative remote sensing (satellite and airborne) change detection technologies, resulting in 5 US Patents. He has participated in a variety of professional networking and academic events, as well as worked with a variety of groups to bring data science training to their people (MU Public Policy, USDA, IEEE international conferences).