Working hard to know your neighbor’s margins: Local descriptor learning loss Anastasiya Mishchuk1, Dmytro Mishkin2, Filip Radenovic´2, Jiriˇ Matas2 1 Szkocka Research Group, Ukraine
[email protected] 2 Visual Recognition Group, CTU in Prague {mishkdmy, filip.radenovic, matas}@cmp.felk.cvut.cz Abstract We introduce a loss for metric learning, which is inspired by the Lowe’s matching criterion for SIFT. We show that the proposed loss, that maximizes the distance between the closest positive and closest negative example in the batch, is better than complex regularization methods; it works well for both shallow and deep convolution network architectures. Applying the novel loss to the L2Net CNN architecture results in a compact descriptor named HardNet. It has the same dimensionality as SIFT (128) and shows state-of-art performance in wide baseline stereo, patch verification and instance retrieval benchmarks. 1 Introduction Many computer vision tasks rely on finding local correspondences, e.g. image retrieval [1, 2], panorama stitching [3], wide baseline stereo [4], 3D-reconstruction [5, 6]. Despite the growing number of attempts to replace complex classical pipelines with end-to-end learned models, e.g., for image matching [7], camera localization [8], the classical detectors and descriptors of local patches are still in use, due to their robustness, efficiency and their tight integration. Moreover, reformulating the task, which is solved by the complex pipeline as a differentiable end-to-end process is highly challenging. As a first step towards end-to-end learning, hand-crafted descriptors like SIFT [9, 10] or detec- tors [9, 11, 12] have been replace with learned ones, e.g., LIFT [13], MatchNet [14] and DeepCom- pare [15].