M-Tree: an Efficient Accessmethod for Similarity Search in Metric Spaces

Total Page:16

File Type:pdf, Size:1020Kb

M-Tree: an Efficient Accessmethod for Similarity Search in Metric Spaces M-tree: An Efficient Access Method for Similarity Search in Metric Spaces Pa010 Ciaccia Marco Patella Pave1 Zezula DEIS - CSITE-CNR DEIS - CSITE-CNR CNUCE-CNR Bologna, Italy Bologna, Italy Piss, Italy [email protected] mpatella@deis . unibo . it [email protected] increased and resulted in the development of multi- media database systems aiming at a uniform manage- Abstract ment of voice, video, image, text, and numerical data. Among the many research challenges which the multi- A new access method, called M-tree, is pro- media technology entails - including data placement, posed to organize and search large data sets presentation, synchronization, etc. - content-based re- from a generic “metric space”, i.e. where ob- trieval plays a dominant role. In order to satisfy the ject proximity is only defined by a distance information needs of users, it is of vital importance to function satisfying the positivity, symmetry, effectively and efficiently support the retrieval process and triangle inequality postulates. We detail devised to determine which portions of the database algorithms for insertion of objects and split are relevant to users’ requests. management, which keep the M-tree always In particular, there is an urgent need of index- balanced - several heuristic split alternatives ing techniques able to support execution of similarity are considered and experimentally evaluated. queries. Since multimedia applications typically re- Algorithms for similarity (range and k-nearest quire complex distance functions to quantify similari- neighbors) queries are also described. Re- ties of multi-dimensional features, such as shape, tex- sults from extensive experimentation with a ture, color [FEF+94], image patterns [VM95], sound prototype system are reported, considering as [WBKW96], text, fuzzy values, set values [HNP95], se- the performance criteria the number of page quence data [AFS93, FRM94], etc., multi-dimensional I/O’s and the number of distance computa- (spatial) access methods (SAMs), such as R-tree tions. The results demonstrate that the M- [Gut841 and its variants [SRF87, BKSSSO], have been tree indeed extends the domain of applica- considered to index such data. However, the applica- bility beyond the traditional vector spaces, bility of SAMs is limited by the following assumptions performs reasonably well in high-dimensional which such structures rely on: data spaces, and scales well in case of growing files. 1. objects are, for indexing purposes, to be repre- sented by means of feature values in a multi- 1 Introduction dimensional vector space; Recently, the need to manage various types of data 2. the (dis)similarity of any two objects has to be stored in large computer repositories has drastically based on a distance function which does not in- troduce any correlation (or “cross-talk”) between Permission to copy without fee all or part of this material is feature values [FEF+94]. More precisely, an L, granted provided that the copies are not made or distributed for metric, such as the Euclidean distance, has to be direct commercial advantage, the VLDB copyright notice and used. the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee Furthermore, from a performance point of view, SAMs and/or special permission from the Endowment. assume that comparison of keys (feature values) is a Proceedings of the 23rd VLDB Conference trivial operation with respect to the cost of accessing a Athens, Greece, 1997 disk page, which is not always the case in multimedia 426 applications. Consequently, no attempt in the design cific metric distance function d. of these structures has been done to reduce the number Formally, a metric space is a pair, M = (D,d), of distance computations. where V is a domain of feature values - the indexing A more general approach to the “similarity index- keys - and d is a total (distance) function with the ing” problem has gained some popularity in recent following properties:l years, leading to the development of so-called metric 1. d(O,, 0,) = d(C),, 0,) (symmetry) trees (see [UhlSl]). M et ric trees only consider relative distances of objects (rather than their absolute posi- 2. d(O,, 0,) > 0 (0, # 0,) and d(O,, 0,) = 0 tions in a multi-dimensional space) to organize and (non negativity) partition the search space, and just require that the function used to measure the distance (dissimilarity) 3. d(O,, 0,) 5 4% 0,) + d(Oz,O,) between objects is a metric (see Section 2), so that the (triangle inequality) triangle inequality property applies and can be used to In principle, there are two basic types of similarity prune the search space. queries: the range query and the k nearest neighbors Although the effectiveness of metric trees has been query. clearly demonstrated [Chi94, Bri95, B097], current de- signs suffer from being intrinsically static, which limits Definition 2.1 (Range) their applicability in dynamic database environments. Given a query object Q E 2) and a maximum search Contrary to SAMs, known metric trees have only tried distance r(Q), the range query range(Q, r(Q)) selects to reduce the number of distance computations re- all indexed objects Oj such that d(Oj, Q) 5 r(Q). q quired to answer a query, paying no attention to I/O Definition 2.2 (k nearest neighbors (LNN)) costs. Given a query object Q E 2) and an integer k 2 1, In this article, we introduce a paged metric tree, the k-NN query NN(&, k) selects the k indexed objects called M-tree, which has been explicitly designed to which have the shortest distance from Q. 0 be integrated with other access methods in database systems. We demonstrate such possibility by imple- There have already been some attempts to tackle the menting M-tree in the GiST (Generalized Search Tree) difficult metric space indexing problem. The FastMap [HNP95] framework, which allows specific access meth- algorithm [FL951 transforms a matrix of pairwise dis- ods to be added to an extensible database system. tances into a set of low-dimensional points, which can The M-tree is a balanced tree, able to deal with then be indexed by a SAM. However, FastMap as- dynamic data files, and as such it does not require sumes a static data set and introduces approximation periodical reorganizations. M-tree can index objects errors in the mapping process. The Vantage Point using features compared by distance functions which (VP) tree [Chi94] partitions a data set according to either do not fit into a vector space or do not use an L, distances the objects have with respect to a reference metric, thus considerably extends the cases for which (vantage) point. The median value of such distances efficient query processing is possible. Since the design is used as a separator to partition objects into two of M-tree is inspired by both principles of metric trees balanced subsets, to which the same procedure can re- and database access methods, performance optimiza- cursively be applied. The MVP-tree [B097] extends tion concerns both CPU (distance computations) and this idea by using multiple vantage points, and exploits I/O costs. pre-computed distances to reduce the number of dis- After providing some preliminary background in tance computations at query time. The GNAT de- Section 2, Section 3 introduces the basic M-tree prin- sign [Bri95] applies a different - so-called generalized ciples and algorithms. In Section 4, we discuss the hyperplane [Uhlgl] - partitioning style. In the basic available alternatives for implementing the split strat- case, two reference objects are chosen and each of the egy used to manage node overflows. Section 5 presents remaining objects is assigned to the closest reference experimental results. Section 6 concludes and suggests object. So obtained subsets can recursively be split topics for future research activity. again, when necessary. Since all of the above organizations build trees by 2 Preliminaries means of a top-down recursive process, these trees are not guaranteed to remain balanced in case of inser- Indexing a metric space means to provide an efficient tions and deletions, and require costly reorganizations support for answering similarity queries, i.e. queries to prevent performance degradation. whose purpose is to retrieve DB objects which are ‘In order to simplify the presentation, we sometimes refer to “similar” to a reference (query) object, and where the 0, as an object, rather than as the feature value of the object (dis)similarity between objects is measured by a spe- itself. 427 3 The M-tree file.3 In summary, entries in leaf nodes are structured as follows. The research challenge which has led to the design of M-tree was to combine advantages of balanced and Oj (feature value of the) DB object dynamic SAMs, with the capabilities of static metric oid(Oj) object identifier trees to index objects using features and distance func- d(Oj, P(Oj)) distance of Oj from its parent tions which do not fit into a vector space and are only constrained by the metric postulates. 3.2 Processing Similarity Queries The M-tree partitions objects on the basis of their Before presenting specific algorithms for building the relative distances, as measured by a specific distance M-tree, we show how the information stored in nodes function d, and stores these objects into fixed-size is used for processing similarity queries. Although per- nodes,2 which correspond to constrained regions of the formance of search algorithms is largely influenced by metric space. The M-tree is fully parametric on the the actual construction of the M-tree , the correctness distance function d, so the function implementation is and the logic of search are independent of such aspects.
Recommended publications
  • Lecture 04 Linear Structures Sort
    Algorithmics (6EAP) MTAT.03.238 Linear structures, sorting, searching, etc Jaak Vilo 2018 Fall Jaak Vilo 1 Big-Oh notation classes Class Informal Intuition Analogy f(n) ∈ ο ( g(n) ) f is dominated by g Strictly below < f(n) ∈ O( g(n) ) Bounded from above Upper bound ≤ f(n) ∈ Θ( g(n) ) Bounded from “equal to” = above and below f(n) ∈ Ω( g(n) ) Bounded from below Lower bound ≥ f(n) ∈ ω( g(n) ) f dominates g Strictly above > Conclusions • Algorithm complexity deals with the behavior in the long-term – worst case -- typical – average case -- quite hard – best case -- bogus, cheating • In practice, long-term sometimes not necessary – E.g. for sorting 20 elements, you dont need fancy algorithms… Linear, sequential, ordered, list … Memory, disk, tape etc – is an ordered sequentially addressed media. Physical ordered list ~ array • Memory /address/ – Garbage collection • Files (character/byte list/lines in text file,…) • Disk – Disk fragmentation Linear data structures: Arrays • Array • Hashed array tree • Bidirectional map • Heightmap • Bit array • Lookup table • Bit field • Matrix • Bitboard • Parallel array • Bitmap • Sorted array • Circular buffer • Sparse array • Control table • Sparse matrix • Image • Iliffe vector • Dynamic array • Variable-length array • Gap buffer Linear data structures: Lists • Doubly linked list • Array list • Xor linked list • Linked list • Zipper • Self-organizing list • Doubly connected edge • Skip list list • Unrolled linked list • Difference list • VList Lists: Array 0 1 size MAX_SIZE-1 3 6 7 5 2 L = int[MAX_SIZE]
    [Show full text]
  • The Fractal Dimension Making Similarity Queries More Efficient
    The Fractal Dimension Making Similarity Queries More Efficient Adriano S. Arantes, Marcos R. Vieira, Agma J. M. Traina, Caetano Traina Jr. Computer Science Department - ICMC University of Sao Paulo at Sao Carlos Avenida do Trabalhador Sao-Carlense, 400 13560-970 - Sao Carlos, SP - Brazil Phone: +55 16-273-9693 {arantes | mrvieira | agma | caetano}@icmc.usp.br ABSTRACT time series, among others, are not useful, and the simila- This paper presents a new algorithm to answer k-nearest rity between pairs of elements is the most important pro- neighbor queries called the Fractal k-Nearest Neighbor (k- perty in such domains [5]. Thus, a new class of queries, ba- NNF ()). This algorithm takes advantage of the fractal di- sed on algorithms that search for similarity among elements mension of the dataset under scan to estimate a suitable emerged as more adequate to manipulate these data. To radius to shrinks a query that retrieves the k-nearest neigh- be able to apply similarity queries over elements of a data bors of a query object. k-NN() algorithms starts searching domain, a dissimilarity function, or a “distance function”, for elements at any distance from the query center, progres- must be defined on the domain. A distance function δ() ta- sively reducing the allowed distance used to consider ele- kes pairs of elements of the domain, quantifying the dissimi- ments as worth to analyze. If a proper radius can be set larity among them. If δ() satisfies the following properties: to start the process, a significant reduction in the number for any elements x, y and z of the data domain, δ(x, x) = 0 of distance calculations can be achieved.
    [Show full text]
  • An Investigation of Practical Approximate Nearest Neighbor Algorithms
    An Investigation of Practical Approximate Nearest Neighbor Algorithms Ting Liu, Andrew W. Moore, Alexander Gray and Ke Yang School of Computer Science Carnegie-Mellon University Pittsburgh, PA 15213 USA ftingliu, awm, agray, [email protected] Abstract This paper concerns approximate nearest neighbor searching algorithms, which have become increasingly important, especially in high dimen- sional perception areas such as computer vision, with dozens of publica- tions in recent years. Much of this enthusiasm is due to a successful new approximate nearest neighbor approach called Locality Sensitive Hash- ing (LSH). In this paper we ask the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how? We introduce a new kind of metric tree that allows overlap: certain datapoints may appear in both the children of a parent. We also intro- duce new approximate k-NN search algorithms on this structure. We show why these structures should be able to exploit the same random- projection-based approximations that LSH enjoys, but with a simpler al- gorithm and perhaps with greater efficiency. We then provide a detailed empirical evaluation on five large, high dimensional datasets which show up to 31-fold accelerations over LSH. This result holds true throughout the spectrum of approximation levels. 1 Introduction The k-nearest-neighbor searching problem is to find the k nearest points in a dataset X ⊂ RD containing n points to a query point q 2 RD, usually under the Euclidean distance. It has applications in a wide range of real-world settings, in particular pattern recognition, machine learning [7] and database querying [11].
    [Show full text]
  • Search Trees
    The Annals of Applied Probability 2014, Vol. 24, No. 3, 1269–1297 DOI: 10.1214/13-AAP948 c Institute of Mathematical Statistics, 2014 SEARCH TREES: METRIC ASPECTS AND STRONG LIMIT THEOREMS By Rudolf Grubel¨ Leibniz Universit¨at Hannover We consider random binary trees that appear as the output of certain standard algorithms for sorting and searching if the input is random. We introduce the subtree size metric on search trees and show that the resulting metric spaces converge with probability 1. This is then used to obtain almost sure convergence for various tree functionals, together with representations of the respective limit ran- dom variables as functions of the limit tree. 1. Introduction. A sequential algorithm transforms an input sequence t1,t2,... into an output sequence x1,x2,... where, for all n ∈ N, xn+1 de- pends on xn and tn+1 only. Typically, the output variables are elements of some combinatorial family F, each x ∈ F has a size parameter φ(x) ∈ N and xn is an element of the set Fn := {x ∈ F : φ(x)= n} of objects of size n. In the probabilistic analysis of such algorithms, one starts with a stochastic model for the input sequence and is interested in certain aspects of the out- put sequence. The standard input model assumes that the ti’s are the values of a sequence η1, η2,... of independent and identically distributed random variables. For random input of this type, the output sequence then is the path of a Markov chain X = (Xn)n∈N that is adapted to the family F in the sense that (1) P (Xn ∈ Fn) = 1 for all n ∈ N.
    [Show full text]
  • Similarity Search Using Metric Trees
    International Journal of Innovations & Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 4, Issue 11 November 2015 Similarity Search using Metric Trees Bhavin Bhuta* Gautam Chauhan* Department of Computer Engineering Department of Computer Engineering Thadomal Shahani Engineering College Thadomal Shahani Engineering College Mumbai, India Mumbai, India Abstract: d( a, b) = 0 iff | a - b | = 0, i.e., iff a = b (positive- The traditional method of comparing values against definiteness) database might work well for small database but when 2. Symmetry the database size increases invariably then the efficiency d(a, b) ≡ | a - b | = | b – a | ≡ d(b, a) takes a hit. For instance, image plagiarism detection 3. Triangle Inequality software has large amount of images stored in the d(a, b) ≡ | a – b | = | a – c + c – b | ≤ database, when we compare the hash values of input query image against that of database then the | a – c | + | c – b | ≡ d(a, c) + d(c, b) comparison may consume a good amount of time. For All indexing techniques primarily exploit Triangle this reason we introduce the concept of using metric Inequality property. We‟ll discuss four of these trees which makes use metric space properties for indexing techniques BK tree, VP tree, KD tree and indexing in databases. In this paper, we have provided Quad tree in detail [5]. the working of various nearest neighbor search trees for speeding up this task. Various spatial partition and BK-tree indexing techniques are explained in detail and their Named after Burkhard-Keller(BK tree). Many performance in real-time world is provided. powerful search engines make use of this tree reason being it is fuzzy search algorithm.
    [Show full text]
  • Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes
    Slim-trees: High Performance Metric Trees Minimizing Overlap Between Nodes Caetano Traina Jr.1, Agma Traina2, Bernhard Seeger3, Christos Faloutsos4 1,2 Department of Computer Science, University of São Paulo at São Carlos - Brazil 3 Fachbereich Mathematik und Informatik, Universität Marburg - Germany 4 Department of Computer Science, Carnegie Mellon University - USA [Caetano|Agma|Seeger+|[email protected]] Abstract. In this paper we present the Slim-tree, a dynamic tree for organizing metric datasets in pages of fixed size. The Slim-tree uses the “fat-factor” which provides a simple way to quantify the degree of overlap between the nodes in a metric tree. It is well-known that the degree of overlap directly affects the query performance of index structures. There are many suggestions to reduce overlap in multi- dimensional index structures, but the Slim-tree is the first metric structure explicitly designed to reduce the degree of overlap. Moreover, we present new algorithms for inserting objects and splitting nodes. The new insertion algorithm leads to a tree with high storage utilization and improved query performance, whereas the new split algorithm runs considerably faster than previous ones, generally without sacrificing search performance. Results obtained from experiments with real-world data sets show that the new algorithms of the Slim-tree consistently lead to performance improvements. After performing the Slim-down algorithm, we observed improvements up to a factor of 35% for range queries. 1. Introduction With the increasing availability of multimedia data in various forms, advanced query 1 On leave at Carnegie Mellon University. His research has been funded by FAPESP (São Paulo State Foundation for Research Support - Brazil, under Grants 98/05556-5).
    [Show full text]
  • Języki Programowania Wstęp
    Waldemar Smolik Języki programowania Wstęp 1 Program komputerowy (ang. computer programme) - sekwencja symboli opisujących działania procesora zapisanych według reguł języka programowania. Program to ciąg instrukcji opisujących modyfikacje stanu maszyny. Program jest wykonywany przez komputer, czasami bezpośrednio – jeśli wyrażony jest w języku zrozumiałym dla danej maszyny lub pośrednio – gdy jest interpretowany przez inny program (interpreter). Formalne wyrażenie metody obliczeniowej w postaci języka zrozumiałego dla człowieka nazywane jest kodem źródłowym, podczas gdy program wyrażony w postaci zrozumiałej dla maszyny (to jest za pomocą ciągu liczb, a bardziej precyzyjnie zer i jedynek) nazywany jest kodem maszynowym bądź postacią binarną (wykonywalną). Programy komputerowe można zaklasyfikować według ich zastosowań. Wyróżnia się zatem systemy operacyjne, aplikacje użytkowe, aplikacje narzędziowe, kompilatory i inne. Programy wbudowane wewnątrz urządzeń określa się jako firmware. 2 W najprostszym modelu wykonanie programu (zapisanego w postaci zrozumiałej dla maszyny) polega na umieszczeniu go w pamięci operacyjnej komputera i wskazaniu procesorowi adresu pierwszej instrukcji. Po tych czynnościach procesor będzie wykonywał kolejne instrukcje programu, aż do jego zakończenia (instrukcja end lub return). Program może zakończyć się: o poprawnie (zgodnie z życzeniem twórcy programu i jego użytkownika); o błędnie (z powodu awarii sprzętu bądź wykonania przez program niedozwolonej operacji, np. dzielenia przez zero). Wykonywanie programu
    [Show full text]
  • Searching in Metric Spaces
    Searching in Metric Spaces EDGAR CHAVEZ´ Escuela de Ciencias F´ısico-Matem´aticas, Universidad Michoacana GONZALO NAVARRO, RICARDO BAEZA-YATES Depto. de Ciencias de la Computaci´on, Universidad de Chile AND JOSE´ LUIS MARROQU´IN Centro de Investigaci´on en Matem´aticas (CIMAT) The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. Many solutions have been proposed in different areas, in many cases without cross-knowledge. Because of this, the same ideas have been reconceived several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of “intrinsic dimensionality.” We also present a unified Categories and Subject Descriptors: F.2.2 [Analysis of algorithms and problem complexity]: Nonnumerical algorithms and problems—computations on discrete structures, geometrical problems and computations, sorting and searching; H.2.1 [Database management]: Physical design—access methods; H.3.1 [Information storage and retrieval]: Content analysis and indexing—indexing methods; H.3.2 [Information storage and retrieval]:
    [Show full text]
  • Algorithms for Efficiently Collapsing Reads with Unique Molecular Identifiers
    bioRxiv preprint doi: https://doi.org/10.1101/648683; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Liu RESEARCH Algorithms for efficiently collapsing reads with Unique Molecular Identifiers Daniel Liu Correspondence: [email protected] Abstract Torrey Pines High School, Del Mar Heights Road, San Diego, United Background: Unique Molecular Identifiers (UMI) are used in many experiments to States of America find and remove PCR duplicates. Although there are many tools for solving the Department of Psychiatry, University of California San Diego, problem of deduplicating reads based on their finding reads with the same Gilman Drive, La Jolla, United alignment coordinates and UMIs, many tools either cannot handle substitution States of America errors, or require expensive pairwise UMI comparisons that do not efficiently scale Full list of author information is available at the end of the article to larger datasets. Results: We formulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. We implement our data structures and optimizations in a tool called UMICollapse, which is able to deduplicate over one million unique UMIs of length 9 at a single alignment position in around 26 seconds. Conclusions: We present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures.
    [Show full text]
  • Approximate Search for Textual Information in Images of Historical Manuscripts Using Simple Regular Expression Queries DEGREE FINAL WORK
    Escola Tècnica Superior d’Enginyeria Informàtica Universitat Politècnica de València Approximate search for textual information in images of historical manuscripts using simple regular expression queries DEGREE FINAL WORK Degree in Computer Engineering Author: José Andrés Moreno Tutor: Enrique Vidal Ruiz Experimental Director: Alejandro Hector Toselli Course 2019-2020 Acknowledgements I wanted to thank Enrique, for trusting in me and giving me the opportunity of devel- oping this project at the PRHLT. Also, I wanted to thank Alejandro, for helping me to go through the most technical parts of this project. Finally, I wanted to thank all the people who have helped me get to this point. « Mais non jamais, jamais je ne t’oublierai » - Edouard Priem iii iv Resumen Los archivos históricos así como otras instituciones de patrimonio cultural han estado digitalizando sus colecciones de documentos históricos con el fin de hacerlas accesible a través de Internet al público en general. Sin embargo, la mayor parte de las imágenes de los documentos digitalizados carecen de transcripción, por lo que el acceso a su contenido textual no es posible. En los últimos años, en el Centro PRHLT de la UPV se ha desarrollado una tecnolo- gía para la indexación probabilística de colecciones de estas imágenes (no transcritas). La principal aplicación de estos índices es facilitar la búsqueda de información textual en la colección de imágenes. El sistema de indexación desarrollado genera una tabla en la cual se indexa cada palabra con todas sus posibles localizaciones en el documento. Específi- camente, cada entrada de la tabla define una palabra con información de su localización: número de página y posición en la página, y una medida de certeza (o confianza) calcu- lada a partir de la probabilidad de aparición de dicha palabra en esa localización de la imagen.
    [Show full text]
  • Introduction to Cover Tree
    Introduction to Cover Tree Javen Qinfeng Shi SML of NICTA RSISE of ANU 2006 Oct. 15 Outline Goal Related works Cover tree definition Optimized Implement in C++ Complexity Further work Outline Goal Related works Cover tree definition Efficiently Implement in C++ Advantages and disadvantages Further work Goal Speed up Nearest-neighbour search[John Langford etc. 2006] ZPreprocess a dataset S of n points in some metric space X so that given a query point p ∈ X , a point q ∈ S which minimises d(p,q) can be efficiently found Also Kmeans Outline Goal Related works Cover tree definition Example Optimized Implement in C++ Advantages and disadvantages Further work Relevant works Kd tree [Friedman, Bentley & Finkel 1977] Ball tree family: ZBall tree [Omohundro,1991] ZMetric tree [Uhlmann, 1991] ZAnchors metric tree [Moore 2000] Z(Improved) Anchors metric tree [Charles2003] Navigating net [R.Krauthgamer, J.Lee 2004] ? Lots of discussions and proofs on its computational bound Kd tree Univariate axis-aligned splits Split on widest dimension O(N log N) to build, O(N) space Kd tree Kd tree y x (axis) Ball tree/metric tree A set of points in R2 [Uhlmann 1991], [Omohundro 1991] A ball-tree: level 1 [Uhlmann 1991], [Omohundro 1991] A ball-tree: level 2 A ball-tree: level 3 A ball-tree: level 4 A ball-tree: level 5 Outline Goal Related works Cover tree definition ZBasic definition ZHow to build a cover tree ZQuery a point/points Optimized Implement in C Advantages and disadvantages Further work Basic definition A cover tree
    [Show full text]
  • A Multicontext-Adaptive Query Creation and Search System for Large-Scale Image and Video Data
    Doctoral Dissertation Academic Year 2016 A Multicontext-adaptive Query Creation and Search System for Large-scale Image and Video Data Graduate School of Media and Governance Keio University Nguyen Thi Ngoc Diep Abstract of Doctoral Disseration Academic Year 2016 A Multicontext-adaptive Query Creation and Search System for Large-scale Image and Video Data by Nguyen Thi Ngoc Diep Abstract This thesis addresses the dynamic adaptability of a multimedia search system to the contexts of its users and proposes a new query creation and search system for large- scale image and video data. The contexts of a user in this system are defined as three types of preferences: content, intention, and response time. A content preference refers to the low-level or semantic representations of the data that a user is interested in. An intention preference refers to how the content should be regarded as relevant. And a response time preference refers to the ability to control a reasonable wait time. The important feature of the proposed system is the integration of context-based query creation functions with high-performance search algorithms into a unified search system. This system adapts the inverted list data structure to construct a disk- resident database model for large-scale data in high-dimensional feature space. The database model has an intrinsic property, the orthogonality of indexes, which facili- tates the functionality of the query creation and search system. The query creation process consists of two main operations: (1) context-based subspace selection and (2) subspace manipulation that includes several unary and binary functions to alter and combine subspaces in order to reflect content preferences of users.
    [Show full text]