Matchability Prediction for Full-Search Template Matching Algorithms

Adrian Penate-Sanchez1, Lorenzo Porzi2,3, and Francesc Moreno-Noguer1

1Institut de Robotica` i Informatica` Industrial, UPC-CSIC, Barcelona, Spain 2Fondazione Bruno Kessler, Trento, Italy 3University of Perugia, Italy

Abstract template matching algorithms that perform a full search on all possible transformations of the template guarantee to While recent approaches have shown that it is possible find the global maximum of the distance function, yielding to do template matching by exhaustively scanning the pa- more accurate results at the price of increasing the compu- rameter space, the resulting algorithms are still quite de- tational cost. manding. In this paper we alleviate the computational load Several approaches have focused on improving specific of these algorithms by proposing an efficient approach for characteristics of the template matching framework, such as predicting the matchability of a template, before it is actu- its speed [22, 24], its robustness to partial occlusions [4] or ally performed. This avoids large amounts of unnecessary ambiguities [26] and its ability to select reliable templates computations. We learn the matchability of templates by for rapid visual tracking [2]. Yet, most of these improve- using dense convolutional neural network descriptors that ments have been designed for the algorithms that do not per- do not require ad-hoc criteria to characterize a template. form a full search. In this work we will bring these improve- By using descriptions of patches we are able ments to algorithms that perform full search, and in particu- to predict matchability over the whole image quite reli- lar we will consider FAst-Match [15], a recent method that ably. We will also show how no specific training data is re- performs near full search at reasonable speed. quired to solve problems like panorama stitching in which Algorithms like FAst-Match, though, may still suffer you usually require data from the scene in question. Due from false positives. Furthermore, they are only able to es- to the highly parallelizable nature of this tasks we offer an timate affine transformations: this is enough to approximate efficient technique with a negligible computational cost at small projective deformations, but produces bigger errors in test time. the general case (as seen on the third experiment of [15]). We aim to provide a template selection approach to partially overcome these problems. We propose to learn a matchable template detector by modeling the probability of a template 1. Introduction to be correctly matched. In particular, we combine dense CNN features, due to the promising results obtained on a Template-based matching has been an important topic in variety of tasks [30, 16], and a logistic re- Computer Vision for nearly as long as the field has existed. gression objective. This allows for an extremely efficient Recently it has been applied successfully to dense 3D re- GPU implementation of the proposed method, which makes construction [8], image-based rendering [7], image match- its computational demands negligible when compared to the ing [32] and even used with RGBD data for object detec- matching itself. tion [12]. As finding a close approximation of the correct trans- Template Matching algorithms can consider all possible formation is not difficult when performing full search, our transformations [18, 31] of the template or just a discrete main aim will be to reduce the overlap error, as well as de- subset of them [11, 21]. By sampling a subset of all can- creasing the computational effort by preemptively selecting didate templates we are able to obtain algorithms that can good templates. We use the overlap error as it seems the operate at speeds of nearly 30 fps [11]. On the other hand, community agrees on it [15, 19, 20]; this is discussed ex- Adrian Penate-Sanchez and Lorenzo Porzi have contributed equally to tensively in [19, 20]. this work. In a similar manner as [10] did for feature points we

4321 Figure 1. First row: first image from sequences 1,2,4 and 5. Second row: heat maps representing the response of our matchable template detector on the images in the first row (blue: low probability, red: high probability). Third row: original images filtered using the heat maps to highlight the regions most likely to be selected. will identify which are the areas of a scene that give good matching problem. In order to handle general transforma- matchability. The main difference when using templates tions, in [14] a grayscale template matching algorithm that is that, instead of focusing on a set of points, one has to de- considered variations of scale and rotation was presented. fine the matchability of all the image. An example of how More recently FAst-Match [15] was proposed as an effi- our approach is capable of obtaining the interesting parts cient way to handle general affine transformations; although from all the image can be seen in Fig. 1. In this work we it was not the first successful approach to try so [9], FAst- will show that template selection is able to noticeably im- Match remains to this day the approach that shows best re- prove matching accuracy at the expense of an almost neg- sults in literature. FAst-Match [15] lies between discrete ligible additional computational cost. Furthermore, we will template matching and full search approaches. In particu- show the proposed approach to have good generalization lar it considers all the affine transformation space but, us- capabilities. ing a tree search approach, avoids searching high error sub- spaces. 2. Related Work 2.1. Feature Selection As mentioned above, we may roughly classify template matching algorithms into those that perform full search or Feature selection approaches can enhance both speed near full search of the transformation parameters [18, 31], and robustness of matching techniques. In [13] feature se- and those that just locally or discretely refine the parame- lection is done based on the upper bound of the average ters. The latter allow for fast optimization but do not guar- error, prioritizing templates that are more robust to small antee a global maximum [11]. errors in the transformation; this is appropriate to avoid test- In between, there exist a huge body of work that attempts ing huge amounts of candidate templates. In [2], the relia- to balance both. In [1] the relation between appearance dis- bility of a template is defined through a number of char- tance and spatial overlap is exploited to give an upper bound acteristics, namely the uniformity of texture, contrast and on appearance distance given the spatial overlap of two win- spatial locality among others. These characteristics are then dows in an image. By doing this, it is then possible to used to define a scoring scheme using Support Vector Re- provide a computationally efficient solution to the template gression [25] that will, at runtime, score the different candi- Rescaling CNN Descriptors Matchability Maps

Fk(·) Template Selection

Tθ1(·)

Fk(·) Tθ2(·)

Tθt(·)

Fk(·)

Figure 2. Schematic representation of our matchable template detection method. In a first step we apply a set of transformations

Tθ1 ,...,Tθt to the input image. Then we extract dense CNN features for each of the transformed images and use them to compute the matchability maps. Finally we select the matchable templates as the set of image patches corresponding to local maxima in the matchability maps. dates. [4] presents a method that selects template pixels that template. We can express this as an optimization problem: verify the approximation of the tracking algorithm. In [34] X method that selects an set of templates by learning a quality arg min kI1(p) − I2(Ap + t)k, (1) A,t measure and by optimizing coverage. The commented work p∈I1 is mainly applied to tracking and builds on the assumption 2×2 2 where A ∈ R and t ∈ R are the transformation’s that a full search is not performed on the test image, it just parameters, while the vector p denotes pixel coordinates needs the template to be stable to local transformations and on the images. FAst-Match describes a branch-and-bound does not have to assume free transformations through the scheme to find a good solution of (1) over the 6-DOF space whole space. of affine transformations in a computationally efficient way. Another relevant approach [24] enhances the perfor- In general, the template I1 can be a rectangular region of mance and the speed of [11] and [12] by learning what a larger image I, which we call the source image. In this parts of the template are worth analyzing and boosts the best case, selecting a good template from I can be crucial for parts of each template only testing those, thus in the process the effectiveness of the template matching algorithm. Con- reducing greatly the computational cost. This work focus sider e.g. a natural landscape image: intuitively, a template on texture-less objects and uses edges as features making it depicting a large portion of uniformly colored sky could be specific for such objects. It was also expanded to handle the especially hard to match. Similarly to the well-known ap- RGB-D inputs of [12] in a similar manner. proach adopted for feature point descriptors and detectors We have seen how learning algorithms have been pre- such as SIFT [17] or SURF [3], we propose a matchable viously applied to speedup and increase reliability of tem- template detector which is able to select good candidate plate matching when using discrete search template match- templates from an input image. We train our detector with ing algorithms [24, 2, 13]. But when dealing with full the objective of learning a function that assigns to each can- search [18, 31] or near full search algorithms [15] you are didate template a probability of it being correctly matched. guaranteed to obtain the global maximum of your distance In this work we only consider FAst-Match, but the measure so the same criteria does not apply. We will focus method we propose is completely general and can be used our work more on learning how to improve the precision of with any template matching algorithm. the yielded matching rather than on the recall that a template can provide. 3.1. Learning a template matchability predictor Given a template matching algorithm and a set of train- 3. Method ing source and target images, we sample random templates from the source images and match them to one of the tar- Given a template image I1 and a target image I2, tem- get images using the matching algorithm. Using this pro- plate matching methods aim to estimate a transformation cedure we collect a set of observations (x, y) ∈ S, where that matches I1 to I2 in such a way to minimize some er- x = (I, p, w, h) denotes the w × h pixels patch of image ror function E(I1,I2). In particular, FAst-Match considers I ∈ I centered at p, i.e. the template, and y ∈ {−1, +1} affine plane transformations that minimize the sum of ab- is a label that indicates whether the template has been cor- solute differences (SAD) between the target image and the rectly matched in a target image (y = +1) or not (y = −1). Given a template x, we model the posterior probability of it The scaling parameters θ are chosen to exactly match the being matched as size of the sliding window to that of the input template x in the transformed image, that is: σ1 = ws/w, σ2 = hs/h. > eyw φ(x) Similarly, the indices (i0, j0) are chosen so that the overlap P(y|x, w) = , (2) 1 + eyw>φ(x) between the corresponding sliding window and the template is maximized. m where w ∈ R is a parameter vector to be learned and φ(·) In practice, instead of taking any template and looking is a feature function that will be defined in Sec. 3.2. We 0 0 for a transformation Tθ(·) and window coordinates i , j that learn w by means of a regularized maximum log-likelihood meet the criteria we just described, we follow the reverse ap- estimation, which yields the well-known L2-regularized lo- proach. Given a network architecture, we select a finite set gistic regression problem: of transformations and consider only templates that exactly overlap a sliding window in one of the transformed images. 1 X > arg min w>w + C log(1 + e−yw φ(x)), (3) This allows a considerable advantage in terms of computa- w 2 (x,y)∈S tional complexity: a single application of the network func- tion Fk to a transformed image Tθ(I) immediately yields where C is a non-negative regularization parameter. We the value of φ(·) for a large set of templates. solve (3) using the trust region Newton method imple- In this paper we consider a pre-trained CNN. In particu- mented in LIBLINEAR [6]. lar, we adopt the AlexNet [16] model, which is trained on 3.2. CNN Features an object recognition task with millions of images and 1000 object categories. A common choice is that of using the out- A n-layers convolutional neural network can be ex- put of the first fully connected layer of AlexNet as a generic d ×···×d pressed as a function F : I → R 1 s mapping images feature representation for an image. Instead, motivated by to s-dimensional tensors. F can be factorized as the com- the recent work of Yosinski et al. [33] on CNN feature trans- position of n functions F = fn ◦ · · · ◦ f1, corresponding ferability and by the performance considerations detailed in to the network’s layers. We denote by Fk = fk ◦ · · · ◦ f1, the previous paragraph, we take the output of the third con- k ≤ n the function corresponding to the first k layers of volutional layer as our feature representation. This results the network. Many common CNN architectures (e.g. used in a feature vector of length m = 384 and a sliding window in image recognition) can be subdivided in a convolutional with dimensions ws = hs = 99 pixels. 0 and a fully connected part. That is, every Fk for k ≤ k (the convolutional part) produces a 3-dimensional tensor output, 3.3. Detecting matchable templates 0 while every Fk for k > k (the fully connected part) pro- Ideally, given the learned matching probability model duces a vector output. In these cases, the functions Fk(I) in (2) and an input image, we want to apply the model to 0 for k ≤ k can be understood as dense feature maps over the every template in the set T of all possible rectangular sub- input image I. regions of I. Since the number of these templates is very r×c×m In the setting described above, Fk : I → R asso- high, we chose instead to follow the approach sketched in m ciates a feature vector in R to each of r×c local regions of the last paragraph of Sec. 3.2. We fix a set of scaling trans- the input image. These regions are given by a sliding win- formations Tθ ,...,Tθ and apply them to the input image dow, which is commonly referred to as the receptive field 1 t I, obtaining the transformed images J1 = Tθ1 (I),...,Jt = of the convolutional neurons of layer k. The size (ws, hs) Tθt (I). Applying Fk(·) to the transformed images yields a and stride of the window depend on the hyperparameters of 1 t set of feature maps Z = Fk(J1),..., Z = Fk(Jt). Each the network’s bottom k layers. Given these definitions, we t t t > vector zi,j = [Zi,j,1 ... Zi,j,m] computed from the feature define the feature function φ(x) of (3) as maps then corresponds to the value of the feature function t t t   φ(xi,j) for a particular template xi,j ∈ T ⊂ T . Thus, Zi0,j0,1 t . the templates in each set T have fixed size and are sam- φ(x) =  .  , Z = Fk(Tθ(I)), (4) t  .  pled from a rectangular grid over I. The probability of xi,j Zi0,j0,m being matched correctly becomes:

w>zt where we use subscript notation A to denote the el- e i,j a,b,... pt = P(yt = 1|xt , w) = . (6) i,j i,j i,j w>zt ement of the tensor A indexed by (a, b, . . . ). Tθ(·) is an 1 + e i,j asymmetric image scaling, parameterized by a tuple θ = By fixing t and varying i and j the probabilities pt form (σ1, σ2), such that i,j t “matchability maps” over the sets T t. As a last step in   σ1 0 our algorithm, we apply a simple non-maximal suppression J = Tθ(I) → J(p) = I( p). (5) 0 σ2 algorithm to these matchability maps to locate their local 1 2 3 error, the FAst-Match algorithm obtains near perfect results on the first two datasets, suggesting that using a good tem- plate selection procedure might not be necessary on that data. Conversely, only a qualitative evaluation of the per- formance of FAst-Match is presented in [15] for the third 4 5 6 dataset, as it does not include any ground truth data. Conse- quently, we opt to evaluate our template selection algorithm on a different dataset which proves to be challenging for the FAst-Match approach and at the same time provides ground truth. In particular, we chose the Venturi Mountain dataset, 7 8 9 previously employed in [23]. The Venturi Mountain dataset consists of 12 outdoors video sequences recorded from several locations in the Alps using a Sony Ericsson XPERIA Arc S smartphone. Each sequence contains between 150 and 500 frames with a reso- 10 11 12 lution of 640 × 480 pixels, for a total of 3117 images. Cam- era calibration parameters and absolute orientation are also given for every image, making it trivial to derive a ground- truth homography between any pair of frames. Given that the images are extracted from video sequences, adjacent Figure 3. Examples of images in each of the 12 outdoor scenes in pairs are in general quite similar to each other. For this rea- the Venturi Dataset. We can see that the variability is quite big son, we only consider one in fifty frames for evaluation and due to clouds, lighting conditions, etc, we can also see that there training. The sequences in the dataset cover several differ- are many zones that are clear to be of no interest for matching ent illumination conditions, weather conditions and land- (shadows, clear skies, etc) making this a good benchmark in which scape characteristics, as well as different kinds of rotation- to test our approach. only camera movements. Fig. 3 shows some samples ex- tracted from the dataset, which is publicly available online1. maxima. The templates of each T t corresponding to the t 4.2. Experimental Setup local maxima in pi,j are finally returned as the detected matchable templates. The proposed method’s pipeline is As previously mentioned, in our evaluation we only con- schematized in Fig. 2. sider one in fifty frames for each video sequence. For each It is interesting to note that the calculation of (6) can be sequence we fix the first frame as the source image and use seen as an additional convolutional layer with a 1 × 1 × m all other images from the same sequence in turn as target. kernel and sigmoid non-linearity appended to the CNN used Then, using one of the methods detailed below, we select for feature computation. Indeed, our whole learning process a set of templates from the source and match them to the could be easily cast in a traditional CNN setting, where both targets using FAst-Match. As a measure of the template the parameters of Fk and w are learned by optimizing (3) matching error we use the “intersection over union” ratio with stochastic gradient descent and back-propagation. of the matched template in the target image and its ground truth. Through the rest of this section we will refer to this 4. Experimental Validation measure as the “overlap error”. As done in [15], only tem- We evaluate the performance of the proposed match- plates that are completely within the boundary of the target able templates detector when used in conjunction with the image are considered for evaluation. FAst-Match algorithm on a challenging dataset. We adopt a Each time our approach is evaluated on a new sequence methodology that is similar to that used in [15], where tem- (the test sequence), the template matchability detector is plates from a fixed source image are matched against a set trained using data from all sequences except the one un- of target images. We also show how our algorithm is able der exam (the training sequences). In particular, we use the to generalize among several different outdoor scenes. procedure described in the previous paragraph on all train- ing sequences to collect a set of templates with their cor- 4.1. Dataset responding overlap errors. For each image pair 128 tem- plates are chosen at random from the source image. Fi- Korman et.al. [15] present results on three datasets: Pas- nally, all templates with an overlap error lower than the cal VOC 2010 [5], the Mikolajczyk [19, 20] dataset and the 25th percentile are labeled as good matches (y = +1) Zurich building dataset [29]. When evaluated by consider- ing as correct every match showing less than 20% overlap 1https://venturi.fbk.eu/results/public-datasets/mountain-dataset/ Sequence 1 Sequence 2 Sequence 3 Sequence 4 40 40 40 40

30 30 30 30 20 20 20 20 10 10 10 10 vra erroroverlap % % error overlap % error overlap % error overlap 0 0 0 0 1 2 3 4 5 1 2 3 4 5 1 2 3 4 1 2 sequence no. sequence no. sequence no. sequence no. Sequence 5 Sequence 6 Sequence 7 Sequence 8 40 40 40 40 30 30 30 30

20 20 20 20 10 10 10 10 vra erroroverlap % % error overlap % error overlap % error overlap 0 0 0 0 1 2 1 2 3 1 2 1 2 3 4 5 6 sequence no. sequence no. sequence no. sequence no. Sequence 9 Sequence 10 Sequence 11 Sequence 12 40 40 40 40

30 30 30 30

20 20 20 20

10 10 10 10 vra erroroverlap % % error overlap % error overlap % error overlap 0 0 0 0 1 2 3 1 2 3 4 5 1 1 2 3 4 sequence no. sequence no. sequence no. sequence no. proposed DoG random sampling Figure 4. Average overlap errors on 50 patches with random sampling (orange), DoG (green) and our method (blue). and all templates with an overlap error greater than the Mean % Std % 75th percentile are labeled as bad matches (y = −1). For seq. Ours DoG RS Ours DoG RS our matchable template detector we use a total of 16 scal- 1 5.3 15.0 15.2 2.4 23.5 21.0 ing transformations (c.f.r. Sec. 3.3), given by parameters 2 5.5 6.3 6.2 2.5 9.8 8.4 θt ∈ {1, 0.75, 0.5, 0.25} × {1, 0.75, 0.5, 0.25}. 3 8.7 6.6 6.7 11.7 8.0 7.6 4 4.0 5.5 6.2 2.1 5.1 5.7 In the evaluation phase, our matchable template detec- tor is applied to the source images, selecting fifty templates 5 11.3 18.5 18.2 18.0 26.2 24.2 corresponding to the local maxima in the matchability maps 6 21.6 35.0 46.4 27.2 35.5 39.0 with the highest predicted matching probability. We refer 7 6.7 7.2 7.5 2.2 3.3 5.8 to this approach as PROPOSED. We compare our method 8 7.7 16.3 13.0 11.0 26.1 20.7 with two additional approaches: RANDOM SAMPLING 9 8.4 8.9 10.3 3.1 3.2 11.7 (or RS) and DoG. In RS we randomly select fifty among 10 19.5 25.5 27.0 17.7 22.7 25.5 all the candidate templates considered by PROPOSED. In 11 8.6 14.6 14.1 3.5 22.0 16.0 DoG we select as templates the image regions that would 12 7.1 24.4 31.3 2.6 31.5 34.5 be used by the SIFT [17] algorithm to compute keypoint Table 1. Mean and standard deviation of the overlap error for all descriptors: for each image the fifty regions with highest image pairs tested in each scene. We improve results for all scenes DoG cornerness score are selected. except for scene 3. We can observe that template selection im- It must be noted that, similarly to the experiment per- proves not only the mean error but also the standard deviation. formed by Korman et.al. on the Zurich bulding dataset, the affine transformation estimated by FAst-Match is only a lo- 4.3. Results cal approximation of the true homography that relates the images in the dataset. Thus, in general, we expect to mea- In Fig. 4 we show the results obtained over the whole sure some amount of unavoidable overlap error in all cases. Venturi mountain dataset. Due to the difference in length of o efr elo tef( does itself FAst-Match on which well in perform not scenes in Conversely, ( minor. small is quite already is error sequences the which in that sequences see can in We per evaluations. templates FAst-Match 50 2100 with totaling pairs pair, image 42 on im- tested test We of pairs. number age different a contains scene each videos the (blue). approach PROPOSED the with obtained position template (orange); matched SAMPLING (white), RANDOM image with 10. target obtained the position sequence in template approach; of position matched PROPOSED out (white), the images image and target of SAMPLING the pair RANDOM in a using respectively for image, results source matching the and selection Template 5. Figure elfrslcigkyonsmgtntb utbefrselect- for suitable templates. be ing not might keypoints works selecting that for to criterion well cornerness close also DoG very is the is that It suggesting performance RS, error. DoG’s that the note of to variability interesting the on selec- improves template also matchable tion performing It that mean. observed the be with can together deviation standard error’s the ing results. bad show approaches all to and becomes the coarse transform however, affine the big, by provided too pro- approximation the just When becomes pairs. image transformation to 3 first jective interesting the in also lot a is helps It selection scene in [15]. results the wrong in comment considered proposed been criterion have the would by matches than DoG greater and error RS overlap average an with Scene case: all. at valid not a or finding between match difference the match, making better cases some much in a provides generally selection template PROPOSED SOURCE RS SOURCE ogv nitiino htormthbetmlt de- template matchable our what of intuition an give To report- scene, each for results overall the shows 1 Table. 2 , 3 , 4 , 7 ,teipc fuigtmlt selection template using of impact the ), 12 e.g. 10 sago xml fti last this of example good a is sequences eosreta template that observe We . 1 , 5 , 6 , 20% 8 , 10 most , , 12 e.g. ), n epae eeeydgae h perneo h final the how of evident appearance the quite degrades severely is select- templates randomly It ing from stemming errors homogra- of algorithm. accumulation computed DLT the to the RS them using with used phies selected templates and 50 PROPOSED, the took and we of pair corners image matched form each the For the in stitching. photo method, panoramic our of of correctly of example application an then shows proof-of-concept are 6 a Fig. which probability. high image, with the matched most of the select parts proposed to informative able The consistently is match. contrast, correctly and in algorithm, to first are difficult sky the quite of of general portions images large in two containing last templates the row: in second illustrated is RS matching for template ex- six sequence our in shows on considered periments particular those and from in sampled 5 5 uniformly results Fig. Fig. in 6. PROPOSED Fig. and SAMPLING RANDOM dis- be consideration. agrees under should image This template the matchable for a tinctive weights. that smaller third intuition given the the in general with grass in the are as such scene, Similarly, patterns, skies template. repetitive clear the and and selecting noisy areas when dark considered how not row clearly are third shows The figure dataset. the the in of matchability images 16 4 the over of computed average maps the taking heat- some by shows obtained 1 maps Fig. template”, “good a considers tector eas rsn oeqaiaiecmaio between comparison qualitative some present also We is n hr row third and First eodrow Second 10 n ftetpclfiuecases failure typical the of One . orhrow Fourth rudtuhtmlt position template truth ground : epae eetdfrom selected templates : rudtuhtemplate truth ground : lecting Conclusion 5. laptop. the GTX750M-equipped sec- Nvidia Applying one a than on less average seconds. takes on few image ond a single testing a of on all order method proposed on the it in running take images and for detector needed the training template time both matchable that the observed Excluding we cores itself, GPU. Xeon matching template K40 32 with Nvidia exper- equipped an our machine and perform server We a on method. iments proposed the of ficiency an approximating when accurate images. as panoramic not are sampling templates random 50 through the obtained of experiment. panoramas third corners [15]’s panorama. that the the seen use in be We done of as can homography approaches. homography It the both Sampling. obtain using Random stitching to using panorama cases performing both of in results matched qualitative show We 6. Figure fEooyadCmeiieesudrpoetRobInstruct project under Competitiveness and Economy of Acknowledgments presented those like 28]. problems [27, in of reduce ambiguities and inherent algorithms pri- the matching build template to non-rigid methodology for this ors future exploiting our in of consists part tem- particular, work other In principle with approaches. in matching employed is plate be however, can algorithm, algo- and FAst-Match proposed “matcher-agnostic” the has The with dataset conjunction challenging rithm. a in validated on ef- The been approach ratios. our aspect of and fectiveness scales match- multiple extracts at reliably templates able which classifier, a regression proposed gistic we end this detector To template an- image. in matched target correctly other being of probability high have that RS PROPOSED nti ae etcldtepolmo uoaial se- automatically of problem the tackled we paper this In ef- computational the on report we remark, final a As hswr a enprilyfne ySaihMinistry Spanish by funded partially been has work This matchable ae ndneCNfaue n lo- a and features CNN dense on based , epae rma image, an from templates i.e. matchable templates o row Top References AEROARMS project EU the by ViSenH2020-ICT-2014-1-644271. and project Chistera PCIN-2013-047; ERA-Net and TIN2014-58178-R C ose,M izl,adD crmza Appearance- Scaramuzza. D. and Pizzoli, M. Forster, C. [8] A .Ftgbo,Y elr n .Zsemn Image-based Zisserman. A. and Wexler, Y. Fitzgibbon, W. A. [7] C.-J. and Wang, X.-R. Hsieh, C.-J. Chang, K.-W. Fan, R.-E. [6] and Winn, J. Williams, K. C. Gool, L. Everingham, M. [5] Lin- Navab. N. and Lepetit, V. Ladikos, A. Benhimane, S. [4] H a,A s,T utlas n .VnGo.Speeded-up Gool. Van L. and Tuytelaars, T. Ess, A. Bay, H. [3] of selection Rapid Navab. N. and Hinterstoisser, S. Alt, N. [2] over- spatial Exploiting Ferrari. V. and Petrescu, V. Alexe, B. [1] ae cie ooua,dnercntuto o micro RSS for Systems, In reconstruction dense vehicles. aerial monocular, active, based optrVso,IJCV Vision, Computer priors. image-based using rendering classification. Research linear Learning large Machine of for Journal library A Liblinear: Lin. chal- 2010. (voc) 88(2):303–338, classes object visual pascal lenge. The Zisserman. A. Recog- Pattern In and Vision CVPR nition, tracking. Computer template-based on for Conference IEEE subsets quadratic and ear tnig CVIU standing, (surf). features robust CVPR Recognition, Pattern In and Vision Computer tracking. on visual for templates reliable NIPS Systems, im- In between windows. distances age appearance compute efficiently to lap eut fuigorapproach. our using of Results : nentoa ora fCmue iin IJCV Vision, Computer of Journal International 2014. , 2007. , 2011. , 1()3639 ue2008. June 110(3):346–359, , dacsi erlIfrainProcessing Information Neural in Advances rceig fRbtc:Sineand Science Robotics: of Proceedings 32:4–5,2005. 63(2):141–151, , optrVso n mg Under- Image and Vision Computer nentoa ora of Journal International otnrow Botton :8117,2008. 9:1871–1874, , EEConference IEEE eut of Results : 2010. , The , [9] C.-S. Fuh and P. Maragos. Motion displacement estimation [25] B. Scholkopf,¨ A. J. Smola, R. C. Williamson, and P. L. using an affine model for image matching. Optical Engineer- Bartlett. New support vector algorithms. Neural Compu- ing, 30(7):881–887, 1991. tation, 12(5):1207–1245, May 2000. [10] W. Hartmann, M. Havlena, and K. Schindler. Predicting [26] E. Serradell, M. Ozuysal, V. Lepetit, P. Fua, and F. Moreno- matchability. In IEEE Conference on Computer Vision and Noguer. Combining geometric and appearance priors for ro- , CVPR, 2014. bust homography estimation. volume 6313 of Lecture Notes [11] S. Hinterstoisser, V. Lepetit, S. Ilic, P. Fua, and N. Navab. in Computer Science, 2010. Dominant orientation templates for real-time detection of [27] E. Serradell, M. Pinheiro, R. Sznitman, J. Kybic, F. Moreno- texture-less objects. In IEEE Conference on Computer Vi- Noguer, and P. Fua. Non-rigid graph registration using active sion and Pattern Recognition, CVPR, 2010. testing search. IEEE Transactions on Pattern Analysis and [12] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Kono- Machine Intelligence (PAMI), 37(3):625–638, 2015. lige, N. Navab, and V. Lepetit. Multimodal templates for [28] E. Serradell, A. Romero, R. Leta, C. Gatta, and F. Moreno- real-time detection of texture-less objects in heavily cluttered Noguer. Simultaneous correspondence and non-rigid 3d re- scenes. In IEEE International Conference on Computer Vi- construction of the coronary tree from single x-ray images. sion, ICCV, 2011. In Proceedings of the International Conference on Computer [13] T. Kaneko and O. Hori. Feature selection for reliable track- Vision (ICCV), 2011. ing using template matching. In IEEE Conference on Com- [29] H. Shao, T. Svoboda, and L. V. Gool. ZuBuD — Zurich¨ puter Vision and Pattern Recognition, CVPR, 2003. buildings database for image based recognition. Technical [14] H. Y. Kim and S. A. de Araujo.´ Grayscale template-matching Report 260, Computer Vision Laboratory, Swiss Federal In- invariant to rotation, scale, translation, brightness and con- stitute of Technology, March 2003. trast. In Proceedings of the 2Nd Pacific Rim Conference on [30] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks Advances in Image and Video Technology, PSIVT, 2007. for object detection. In Advances in Neural Information Pro- [15] S. Korman, D. Reichman, G. Tsur, and S. Avidan. Fast- cessing Systems, NIPS. 2013. match: Fast affine template matching. In IEEE Conference [31] F. Tombari, S. Mattoccia, and L. D. Stefano. Full-search- on Computer Vision and Pattern Recognition, CVPR, 2013. equivalent pattern matching with incremental dissimilarity [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet approximations. IEEE Transactions on Pattern Analysis and classification with deep convolutional neural networks. In Machine Intelligence, PAMI, 31(1):129–141, 2009. Advances in Neural Information Processing Systems, NIPS. [32] Q. Wang and S. You. Real-time image matching based on 2012. multiple view kernel projection. In IEEE Conference on [17] D. Lowe. Distinctive image features from scale-invariant Computer Vision and Pattern Recognition, CVPR, 2007. keypoints. International Journal of Computer Vision, IJCV, [33] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How trans- 60(2):91–110, 2004. ferable are features in deep neural networks?. In Advances [18] S. Mattoccia, F. Tombari, and L. D. Stefano. Fast full- in Neural Information Processing Systems, NIPS. 2014. search equivalent template matching by enhanced bounded [34] K. Zimmermann, J. Matas, and T. Svoboda. Tracking by correlation. IEEE Transactions on Image Processing, TIP, an optimal sequence of linear predictors. IEEE Transac- 17(4):528–538, 2008. tions on Pattern Analysis and Machine Intelligence, PAMI, [19] K. Mikolajczyk and C. Schmid. A performance evaluation 31(4):677–692, April 2009. of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI, 27(10):1615–1630, 2005. [20] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool. A com- parison of affine region detectors. International Journal of Computer Vision, IJCV, 65(1/2):43–72, 2005. [21] F. Moreno-Noguer, V. Lepetit, and P. Fua. Pose priors for simultaneously solving alignment and correspondence. vol- ume 5303 of Lecture Notes in Computer Science, 2008. [22] O. Pele and M. Werman. Accelerating pattern matching or how much can you slide?. In Asian Conference on Computer Vision, ACCV, 2007. [23] L. Porzi, S. R. Bulo,´ P. Valigi, O. Lanz, and E. Ricci. Learn- ing contours for automatic annotations of mountains pictures on a smartphone. In Eighth ACM/IEEE International Con- ference on Distributed Smart Cameras, 2014. [24] R. Rios-Cabrera and T. Tuytelaars. Discriminatively trained templates for 3d object detection: A real time scalable ap- proach. In IEEE International Conference on Computer Vi- sion, ICCV, 2013.