Bilegsaikhan Naidan Bilegsaikhan Naidan

Total Page:16

File Type:pdf, Size:1020Kb

Bilegsaikhan Naidan Bilegsaikhan Naidan ISBN 978-82-326-2241-2 (electronic ver.) ISBN 978-82-326-2241-2(electronic ISBN 978-82-326-2240-5 (printed ver.) ISBN 978-82-326-2240-5(printed ISSN 1503-8181 ISSN Bilegsaikhan Naidan Bilegsaikhan Doctoral theses at NTNU, 2017:84 NTNU, at theses Doctoral NTNU Norwegian University of Science and Technology Thesis for the Degree of Philosophiae Doctor Faculty of Information Technology and Electrical Engineering Doctoral thesis Department of Computer Science Similarity Search Methods Similarity Search Engineering Efficient Naidan Bilegsaikhan thesesatNTNU,2017:84 Doctoral Bilegsaikhan Naidan Engineering Efficient Similarity Search Methods Thesis for the Degree of Philosophiae Doctor Trondheim, May 2017 Norwegian University of Science and Technology Faculty of Information Technology and Electrical Engineering Department of Computer Science NTNU Norwegian University of Science and Technology Thesis for the Degree of Philosophiae Doctor Faculty of Information Technology and Electrical Engineering Department of Computer Science © Bilegsaikhan Naidan ISBN 978-82-326-2240-5 (printed ver.) ISBN 978-82-326-2241-2 (electronic ver.) ISSN 1503-8181 Doctoral theses at NTNU, 2017:84 Printed by NTNU Grafisk senter Preface This thesis is submitted to the Norwegian University of Science and Technology (NTNU) in partial fulfillment of the requirements for the degree of philosophiae doctor. This doctoral work has been performed at the Department of Computer and Information Science, NTNU, Trondheim, with Associate Professor Magnus Lie Hetland as main supervisor and with co-supervisors Professor Svein Erik Bratsberg and Professor Arne Halaas. The thesis has been part of the research project Information Access Disruptions (iAd) project which is funded by the Norwegian Research Council. The main re- search of iAd project is to develop the next generation of search and the delivery technology in the information access domain. 3 Summary Similarity search has become one of the important parts of many applications in- cluding multimedia retrieval, pattern recognition and machine learning. In this problem, the main task is to retrieve most similar (closest) objects in a large data set to a query efficiently by using a function that gives the distance between two objects. This thesis is presented as a collection of seven papers with a tutorial on the topic and is mainly focused on efficiency issues of similarity search. • Paper I provides a survey on state-of-the-art approximate methods for met- ric and non-metric spaces. • Paper II proposes an approximate indexing method for the non-metric Breg- man divergences. • Paper III shows how to transform static indexing methods into dynamic ones. • Paper IV presents a learning method to prune in metric and non-metric spaces. • Paper V improves the existing approximate technique that can be applied to ball-based metric indexing methods. • Paper VI presents an open source cross-platform similarity search library and a toolkit called the “Non-Metric Space Library” (NMSLIB) for evalua- tion of similarity search methods. • Paper VII evaluates a metric index for approximate string matching. 5 Acknowledgements First and foremost, I would like to thank my supervisor Magnus Lie Hetland. He has always been available to discuss research ideas when I needed and has been supporting me throughout this PhD work and helping me overcome a personal hard time. He also taught me how to conduct experiments and write scientific papers. I would like to thank my supervisor Svein Erik Bratsberg for his support, research ideas and inspiring discussions. I would like also thank Professor Arne Halaas for being co-supervisor. Dr. Øystein Torbjørnsen has been an unofficial advisor to me. Despite being busy with his schedule, he led several fruitful meetings together with my supervisors when I struggled to produce publications. I would like to give a big thank to Leonid Boytsov for his collaboration and co- authoring several papers and co-authoring our research Non-Metric Space Li- brary and maintaining it. I would like to thank all people who contributed to the library. I would like to thank Professor Eric Nyberg for his support of our library. Finally, I would like to thank my family and they have been supporting me all time. 7 Contents Preface 3 Summary 5 Acknowledgements 7 I Introduction and Overview 11 1 Introduction 13 1.1 Motivation ................................. 13 1.2 Research goals ............................... 16 2 Background 17 2.1 Metric space model ............................ 17 2.2 Distance functions ............................. 18 2.3 Similarity queries ............................. 20 2.4 Problem description ........................... 21 2.5 Space partitioning principles ...................... 21 2.6 Pruning principles ............................ 22 2.7 Space decomposition methods ..................... 24 2.8 Approximate methods .......................... 26 2.9 Methods based on clustering principle ................. 30 2.10 Projection methods ............................ 30 2.11 Graph methods .............................. 37 2.12 Other methods ............................... 38 3 Research Summary 39 3.1 Research method ............................. 39 3.2 Included papers .............................. 40 3.2.1 Paper I ............................... 40 3.2.2 Paper II ............................... 41 3.2.3 Paper III .............................. 42 9 10 3.2.4 Paper IV .............................. 43 3.2.5 Paper V .............................. 43 3.2.6 Paper VI .............................. 44 3.2.7 Paper VII .............................. 45 3.3 Publication venue ............................. 45 3.4 Evaluation of contributions ....................... 46 3.5 NSMLIB in an approximate nearest neighbor benchmark ................................. 47 3.6 Future work ................................ 47 3.7 Conclusions ................................ 48 Bibliography 49 II Publications 55 A Paper I: Permutation Search Methods are Efficient, Yet Faster Search is Possible 57 B Paper II: Bregman hyperplane trees for fast approximate nearest neigh- bor search 73 C Paper III: Static-to-dynamic transformation for metric indexing struc- tures (extended version) 89 D Paper IV: Learning to Prune in Metric and Non-Metric Spaces 107 E Paper V: Shrinking data balls in metric indexes 121 F Paper VI: Engineering Efficient and Effective Non-metric Space Library 129 G Paper VII: An empirical evaluation of a metric index for approximate string matching 145 Part I Introduction and Overview 13 Chapter 1 Introduction Outline This thesis is a collection of papers with a basic introduction on similarity search. This chapter presents the motivation and research goals of this work. Chapter 2 describes the theoretical background and a tutorial on similarity search. Chapter 3 summarizes the results and reviews the papers. It also discusses future work and concludes the thesis. The research contribution of this thesis is defined by seven research papers which can be found in Part II. 1.1 Motivation Searching has become one of the crucial parts in our daily life and it allows us to find useful information. Traditional text-based search can not be directly ap- plied to more complex data types (such as multimedia data) which do not have any text information (for instance, caption, description and tags). So-called sim- ilarity search can be applied to those data types. In similarity search, domain objects (such as strings and multimedia data) are modeled in a metric or non- metric space and those objects are retrieved based on their similarity to some given query. For multimedia data, similarity search is often performed on the feature vectors extracted from the real data. Feature vectors are usually represented as high- dimensional data, so due to the so-called curse of dimensionality phenomenon, the efficiency of the indexing structures deteriorates rapidly as the number of di- mensions increases. This may be because there is a trend for the objects to be Some of the text and figures included in this thesis are based on my PhD plan which was partially included in a journal paper without my permission. 14 Introduction almost equidistant from the query objects in a high-dimensional space. Let us consider two use cases of similarity search in two different domains. Multimedia retrieval - In image retrieval, similar images to a query image are re- trieved by using image features such as shape, edge, position and color histogram. So-called the Signature Quadratic Form Distance [5] (SQFD) is shown to be effec- tive in image retrieval. In the method of Beecks [5], an image is mapped into a feature space and then its features are clustered with the standard k-means clus- tering algorithm. Each cluster is represented by a centroid and a cluster weight. Figure 1.1 shows three example images and the visualizations of their correspond- ing features as multiple circles. The centroid, weight and color of each cluster are represented by the center, diameter and color of a circle, respectively. We can see that the first two images are more similar to each other than the third one and that observation is also true for their feature signatures. Figure 1.1: Three example images at top and the corresponding visualizations of their feature signatures at bottom. Machine learning -Ink-nearest neighbor classification, a test instance needs to be assigned into one of the several classes. First, we need to find the k closest training instances of the test instance. Then, the most occurred
Recommended publications
  • The “Super Class” of Electronics 1991 and 1992 – What Was the Key? on Fostering Entrepreneurship Beyond Organized Support System
    Master thesis in Entrepreneurship, Innovation and Society The “super class” of electronics 1991 and 1992 – what was the key? On fostering entrepreneurship beyond organized support system. Trondheim, May 2017 Author: Karolina Lesniak Academic supervisor: Asbjørn Karlsen Cover figure: The “super class” of electronics of 1991 and 1992. Source: Adresseavisen Abstract The university gains constantly greater importance in the local and regional development policies. Apart from two traditional roles of university that is research and education, the third mission was added in recent decades – commercialization of science and technology by new firm formation. Although various institutionalized measures for fostering and supporting entrepreneurship were developed, like technology transfer offices, incubator, science parks, or formalized entrepreneurial education, the perfect formula for encouraging academic entrepreneurship has not yet been devised. The main aim of this study was to investigate the distinctive case of technology entrepreneurs from classes of 1991 and 1992 of electronics at NTNU. The case combines 9 entrepreneurs who established 6 highly successful technology companies. Among the companies 3 were categorized as corporate spin-offs and 3 as university spin-offs. The case stands out due to the fact that in the period in question the institutionalized support system for entrepreneurship was non-existent, and no forms of encouragement towards firm formation were noted. First research question asked in this study is concerned in the role that NTNU had in fostering entrepreneurial activities among the students in question. Through the case study it was argued that despite the lack of organized system for fostering commercial activities, NTNU indirectly influenced future entrepreneurs in several ways.
    [Show full text]
  • NTVA Review 2014 NORWEGIAN ACADEMY of TECHNOLOGICAL SCIENCES
    NTVA Review 2014 NORWEGIAN ACADEMY OF TECHNOLOGICAL SCIENCES 1 NTVA IN BRIEF The Norwegian Academy of Technological Sciences (NTVA) is an independent organization founded in 1955. NTVA is a member of the International Council of Academies of Engineering (CAETS) and of the European Council of Applied Sciences and Engineering (Euro-CASE). THE GOALS OF NTVA ARE TO: tatives of the management of companies and institutions – Promote research, education and development in Norway. The purpose of the council is to support NTVA of technology and natural sciences in fulfilling its missions and to strengthen its relations with – Encourage international cooperation within these fields society. In 2014, the Council had 39 members. – Further the understanding of technology and natural sciences among the authorities and the public to the THE MAIN ACTIVITIES benefit of Norwegian society and industrial development NTVA hosted 34 events in 2014. Two seminars were held in Norway. in Trondheim and two in Oslo. On 29 October a symposium on “Medical Technology – Meeting Tomorrow`s Health NTVA’s members are scientists and industrial leaders Care Challenges” was held at The Norwegian Academy recruited from academic institutions and industries in of Science and Letters in Oslo and had 165 participants. Norway and abroad. Individuals who have contributed Regular meetings open to the public were held in Bergen significantly to the technological sciences or in related (6), Oslo (6), Stavanger (6), Grimstad (1), Tromsø (1) and areas, or whose work has furthered practical applications Trondheim (7). Selected themes from these meetings are of technology are eligible for membership. The total reported in this Review.
    [Show full text]
  • Inventing in Norway Vol
    annual 49 • 2006/2007 research & development Inventing in Norway Vol. 2006 No. 3 ISSN 0029-3628 Published May 2006 by: facts: Norway in Brief table of contents Size: 385,155 sq. km (including the islands of Svalbard and Jan Mayen) Population 4.6 million Tvetenveien 4, NO-0661 Oslo, Norway Tel: +47 22 07 85 59 Main Cities Population (incl. suburbs) 2: Alphabetical Listing Fax: +47 22 07 85 51 Oslo 801,028 E-mail: [email protected] Bergen 212,626 Website: www.index.no Stavanger/Sandnes 171,342 3: Foreword in cooperation with Innovation Norway, the Trondheim 145,691 Confederation of Norwegian Enterprise, the 4: Index by Company/Organization & Industry Norwegian Ministry of Foreign Affairs, the GDP 2004 (estimate) NOK 1.69 trillion Research Council of Norway and the Norwegian Exchange Rate: 8–51: Norwegian R&D Articles Ministry of Trade and Industry. NOK/USD 6.74 (average 2004) NOK/EUR 8.37 (average 2004) Managing Director 8: Aquaculture Exports and Imports 2004 (Preliminary figures) Norvald M. Heidel Anticipating – and Designing for – Aquaculture’s Future NOK bill. USD bill. Amount of GDP Production Manager Total Exports 737 109.4 43.7% Frode Gulestøl Total Imports 499 74.1 29.6% 13: Biotechnology Synergies Breed Biotech Success Editor-in-Chief Main Export Commodities Scott LaHart Oil & gas, metals, machinery, chemical products, fish & fish products, Editorial Staff pulp & paper and ferro alloys 18: Centres of Excellence Robert Moses, Diane Oatley Mitigating Natural Disasters with Norwegian Expertise 13 Main Import Commodities International pharmaceutical
    [Show full text]
  • Cutting-Edge Research-Based Innovation
    AUTUMN 2007 THE COUNCIL RESEARCH OF NORWAY NEWS FROM Cutting-edge researCh-based innovation new search Fighting cancer Future Life saving technology with stem cells fish farms crashes page 10 page 12 page 20 page 32 Norway – more than deep fjords and rugged mountains MONA G. Rygh Photo: Susanne Moen Stephansen Norway’s academe is not quite as famous on the international arena as the country’s fjords and mountains. This is not so strange, really, considering that the country was a relatively poor agrarian society only about a century ago. However, times have changed and Norwegian research is now world-class in a number of fields. It has laid the foundation for prodigious economic development. The Research Council of Norway, its ‘owner’ (the Ministry of Ed- ucation and Research), and the Norwegian government have high ambitions for the further advancement of Norwegian research. Their targets include increased internationalisation and heavy emphasis on quality. One of the instruments available for attaining these targets is the establishment of special centres for outstanding research groups. The centres receive long-term funding which allows them to concentrate on carving out a position in the vanguard of in- ternational research. In 2002, Norway’s first Centres of Excellence (CoEs) were established (see: www.forskningsradet.no, and select English/Publications/Tell’Us 2003). A recent evaluation indicates a high success rate for these cen- tres, and eight more were added to the list this year after a new competition. The CoEs address basic research issues. In addition, the Research Council of Norway recently conferred special status on 14 Centres for Research-based Innovation (CRIs), which are presented in the current issue of this magazine.
    [Show full text]