QATM: Quality-Aware Template Matching for Deep Learning

QATM: Quality-Aware Template Matching For Deep Learning Jiaxin Cheng Yue Wu Wael Abd-Almageed Premkumar Natarajan USC Information Sciences Institute, Marina del Rey, CA, USA chengjia@fusc/isig.edu fyue wu,wamageed,[email protected] Abstract matches caused by the background pixels. Deformable Di- versity Similarity (DDIS) was introduced in [26], which ex- Finding a template in a search image is one of the core plicitly considers possible template deformation and uses problems many computer vision, such as semantic image se- the diversity of NN feature matches between a template mantic, image-to-GPS verification etc. We propose a novel and a potential matching region in the search image. Co- quality-aware template matching method, QATM, which is occurrence based template matching (CoTM) was intro- not only used as a standalone template matching algorithm, duced in [14] to quantify the dissimilarity between a tem- but also a trainable layer that can be easily embedded into plate and a potential matched region in the search image. any deep neural network. Here, our quality can be inter- These methods indeed improve the performance of template preted as the distinctiveness of matching pairs. Specifically, matching. However, these methods cannot be used in deep we assess the quality of a matching pair using soft-ranking neural networks (DNN) because of two limitations — (1) among all matching pairs, and thus different matching sce- using non-differentiable operations, such as thresholding, narios such as 1-to-1, 1-to-many, and many-to-many will counting, etc. and (2) using operations that are not efficient be all reflected to different values. Our extensive evalu- with DNNs, such as loops and other non-batch operations. ation on classic template matching benchmarks and deep Existing DNN-based methods use simple methods to learning tasks demonstrate the effectiveness of QATM. It not mimic the functionality of template matching [15, 30, 28, only outperforms state-of-the-art template matching meth- 27,4], such as computing the tensor dot-product [18] 1 be- ods when used alone, but also largely improves existing tween two batch tensors of sizes B × H × W × L and deep network solutions. B ×H0 ×W 0 ×L along the feature dimension (i.e., L here), and producing a batch tensor of size B ×H ×W ×H0 ×W 0 containing all pairwise feature dot-product results. Of 1. Introduction and Review course, additional operations like max-pooling may also be applied [30, 31, 18,7]. Template matching is one of the most frequently used In this paper, we propose the quality-aware template techniques in computer vision applications, such as video matching (QATM) method, which can be used as a stan- tracking[35, 36,1,9], image mosaicing [25,6], object de- dalone template matching algorithm, or in a deep neural tection [12, 10, 34], character recognition [29,2, 21,5], and network as a trainable layer with learnable parameters. It 3D reconstruction [22, 23, 16]. Classic template match- takes the uniqueness of pairs into consideration rather than arXiv:1903.07254v2 [cs.CV] 9 Apr 2019 ing methods often use sum-of-squared-differences (SSD) simply evaluating matching score. QATM is composed of or normalized cross correlation (NCC) to calculate a sim- differentiable and batch-friendly operations and, therefore, ilarity score between the template and the underlying im- is efficient during DNN training. More importantly, QATM age. These approaches work well when the transformation is inspired by assessing the matching quality of source and between the template and the target search image is sim- target templates, and thus is able handle different matching ple. However, these methods start to fail when the transfor- scenarios including 1-to-1, 1-to-many, many-to-many and mation is complex or non-rigid, which is common in real- no-matching. Among different matching cases, only the 1- life. In addition, other factors, such as occlusions and color to-1 matching is considered to be high quality due to it’s shifts, make these methods even more fragile. more distinctive than 1-to-many and many-to-many cases. Numerous approaches have been proposed to over- The remainder of paper is organized as follows. Section come these real-life difficulties applying standard tem- 2 discusses motivations and introduces QATM . In Section plate matching. Dekel et al. [11] introduced the Best- 3, the performance of QATM is studied in classic template Buddies-Similarity (BBS) measure, which focuses on the nearest-neighbor (NN) matches to exclude potential and bad 1See numpy.tensordot and tensorflow.tensordot. matching setting. QATM is evaluated on both semantic im- Matching Cases Not age alignment and image-to-GPS verification problems in 1-to-1 1-to-NM-to-1 M-to-N Matching Section4. We conclude the paper and discuss future works Quality High Medium Medium Low Very Low in Section5. QATM (s; t) 1 1/N 1=M 1=MN 1=kT kkSk ≈ 0 Table 1: Template matching cases and ideal scores. 2. Quality-Aware Template Matching 2.1. Motivation is therefore not difficult to come up the qualitative assess- In computer vision, regardless of the application, many ment for each case in the Table1. As a result, the optimal methods implicitly attempt the solve some variant of fol- matched region in S can be found as the place that maxi- lowing problem — given an exemplar image (or image mizes the overall matching quality. We can therefore come patch), find the most similar region(s) of interest in a target up with a quantitative assessment of the matching as shown image. Classic template matching [11, 26, 14], constrained in Eq. (1) template matching [31], image-to-GPS matching [7], and ∗ X semantic alignment [18, 19,8, 13] methods all include some R = arg maxf max Quality(r; t)jt 2 T g (1) R sort of template matching, despite differences in the details r2R of each algorithm. Without loss of generality, we will focus such that the region R in S that maximizes the overall the discussion on the fundamental template matching prob- matching quality will be the optimally matching region. R lem, and illustrate applicability to different problem in later is a fixed size candidate window and we used the size of sections. object as window size in the experiment. One known issue in most of existing template matching methods is that typically, all pixels (or features) within 2.2. Methodology the template and a candidate window in the target image are taken into account when measuring their similarity[11]. To make Eq. (1) applicable to template matching, we This is undesirable in many cases, for example when the need to define Quality(s; t), i.e. how to assess the match- background behind the object of interest changes between ing quality between (s; t). In the rest of section, we de- the template and the target image. To overcome this is- rive the quality-aware template matching (QATM) measure, sue, the BBS [11] method relies on nearest neighbor (NN) which is a proxy function of the ideal quality assessment matches between the template and the target, so that it could Quality(s; t). exclude most of background pixels for matching. On top of Let fs and ft be the feature representation of patch s BBS, the DDIS [26] method uses the additional deforma- and t, and ρ(·) is a predefined similarity measure between tion information in NN field, to further improve the match- two patches, e.g. cosine similarity. Given a search patch s, ing performance. we define the likelihood function that a template patch t is Unlike previous efforts, we consider five different tem- matched, as shown in Eq.2, plate matching scenarios, as shown in Table1, where t and expfα · ρ(ft; fs)g s are patches in the template T and search S images, re- L(tjs) = P (2) expfα · ρ(f 0 ; f )g spectively. Specifically, “1-to-1 matching” indicates exact t02T t s matching, i.e. two matched objects, “1-to-N” and “M-to-1” where α is a positive number and will be discussed later. indicates s or t is a homogeneous or patterned patch caus- This likelihood function can be interpreted as a soft-ranking ing multiple matches, e.g. a sky or a carpet patch, and “M- of the current patch t compared to all other patches in the to-N” indicates many homogeneous/patterned patches both template image in terms of matching quality. It can be alter- in S and T. It is important to note that this formulation natively considered as a heated-up softmax embedding [38], is completely different from the previous NN based formu- which is the softmax activation layer with a learnable tem- lation, because even though t and s are nearest neighbors, perature parameter, i.e. α in our context. their actual relationship still can be any of the five cases con- In this way, we can define the QATM measure as simple sidered. Among four matching cases, only 1-to-1 matching as the product of likelihoods that s is matched in T and t is is considered as high quality. This is due to the fact that in matched in S as shown in Eq. (3). other three matching cases, even though pairs may be highly similar, that matching is less distinctive because of multiple QATM(s; t) = L(tjs) · L(sjt) (3) matched candidates. Which turned out lowering the relia- bility of that pair. Any reasonable similarity measure ρ(·) that gives a high It is clear that the “1-to-1” matching case is the most value when ft and fs are similar, a low value otherwise important, while the “not-matching” is almost useless.

Load more