<<

DEGREE PROJECT, IN , SECOND LEVEL STOCKHOLM, SWEDEN 2015

Taxonomy Based

TAXONOMY BASED IMAGE RETRIEVAL USING DATA FROM MULTIPLE SOURCES

JIMMY LARSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC) Taxonomy Based Image Retrieval

Taxonomy Based Image Retrieval using Data from Multiple Sources

JIMMY LARSSON

Master’s Thesis at CSC KTH - Royal Institute of Technology, Sweden Supervisor: Hedvig Kjellström Examiner: Danica Kragic

TRITA xxx yyyy-nn

Acknowledgements

This page is dedicated to everyone who has been involved in this work. I would thus like to start by acknowledging and thanking, my professor at the university, Hedvig Kjellström for accepting yet another project while already having many other projects to supervise as well as for the feedback and time that she has given me. I would like to thank Danica Kragic, my examiner who also accepted the position of examiner for this work while already having many other tasks to attend to. I would like to thank my supervisors at Findwise, Martin Nycander, Birger Rydback and Simon Stenström for their help and supervision during my time at the company. I would also like to thank Findwise for accepting this project. Finally I would like to thank my family who supported me during this time and who kept pushing me forward. Abstract

With a multitude of images available on the Internet, how do we find what we are looking for? This project tries to determine how much the precision and recall of search queries is improved by using a word taxonomy on tradi- tional Text-Based Image Search and Content-Based Image Search. By applying a word taxonomy to different data sources, a strong keyword filter and a keyword extender were implemented and tested. The results show that de- pending on the implementation, the precision or the recall can be increased. By using a similar approach on real life implementations, it is possible to force images with higher precisions to the front while keeping a high recall value, thus increasing the experienced relevance of image search. Referat

Taxonomibaserad Bildsök

Med den mängd bilder som nu finns tillgänglig på Inter- net, hur kan vi fortfarande hitta det vi letar efter? Den- na uppsats försöker avgöra hur mycket bildprecision och bildåterkallning kan öka med hjälp av appliceringen av en ordtaxonomi på traditionell Text-Based Image Search och Content-Based Image Search. Genom att applicera en ord- taxonomi på olika datakällor kan ett starkt ordfilter samt en modul som förlänger ordlistor skapas och testas. Resul- taten pekar på att beroende på implementationen så kan antingen precisionen eller återkallningen förbättras. Genom att använda en liknande metod i ett verkligt scenario är det därför möjligt att flytta bilder med hög precision längre fram i resultatlistan och samtidigt behålla hög återkallning, och därmed öka den upplevda relevansen i bildsök. Contents

Acknowledgements

List of Figures

List of Tables

1 Introduction 1 1.1 Concept ...... 2 1.2 Abbreviations ...... 4 1.3 Problem Statement ...... 4 1.3.1 Research Question ...... 4 1.3.2 Hypothesis ...... 5 1.4 Contributions ...... 5 1.5 Delimitations ...... 5

I Background 7

2 Image Retrieval 9 2.1 Early History ...... 9 2.2 Search Users ...... 10 2.3 Presentation ...... 10

3 Content-Based Image Retrieval 13 3.1 Semantic Gap ...... 13 3.2 Features ...... 14 3.2.1 Global Features ...... 15 3.2.2 Local Features ...... 15 3.2.3 Image Segmentation ...... 15 3.3 Visual Signature ...... 15 3.4 Learning Approaches ...... 16 3.4.1 Relevance Feedback ...... 16 3.4.2 Support Vector Machines ...... 16 3.4.3 Artificial Neural Networks ...... 17 3.4.4 Convolutional Neural Networks ...... 17 3.4.5 Random Forest ...... 18

4 Text-Based Image Retrieval 19 4.1 Relevant in Text ...... 19 4.1.1 ...... 20 4.2 Current Techniques ...... 21 4.2.1 Term Frequency and Inverse Document Frequency ...... 21 4.2.2 Natural Language Processing ...... 21 4.2.3 Part-of-Speech Tagging ...... 22 4.2.4 Stop Words ...... 22 4.2.5 and Lemmatization ...... 22 4.3 Indexing ...... 22 4.4 Word Taxonomy ...... 23

5 Related Work 25

II Method 27

6 Architecture 29 6.1 Data Retrieval ...... 30 6.2 Extraction of Relevant Information ...... 30 6.3 Content-Based Image Retrieval ...... 31 6.3.1 Classification ...... 31 6.4 Text-Based Image Retrieval ...... 32 6.4.1 Natural Language Processing ...... 32 6.4.2 Data Clean-Up ...... 32 6.5 WordNet Evaluation ...... 32 6.6 Search Platform ...... 33 6.6.1 Filters, Stemming and Tokenizing ...... 33

7 Evaluation Method 35 7.1 Evaluation Data ...... 35 7.2 Classifier Evaluation ...... 36 7.3 Evaluation Formulas ...... 37 7.4 Baseline and Comparison ...... 38

III Results and Discussion 39

8 Results 41 8.1 Average Precision ...... 42 8.2 Average Recall ...... 42 8.3 Average F Measures ...... 43 9 Discussion and Conclusion 47 9.1 Conclusions ...... 48 9.2 Future Work ...... 49

Bibliography 51

Appendices 56

A Tags 57

List of Figures

1.1 The system concept portraying a simplified view of the full system. . . . 3

4.1 A taxonomy example...... 23

6.1 An extended figure portraying the system architecture...... 29

6.2 Example of data that is extracted from a post. (Original blog post from Bites @ Animal Planet) ...... 31

7.1 An image containing a single animal. Tag: sloth. (Figure from Bites @ Animal Planet) 36

7.2 An image containing two or more animals. Tags: dog, fox. (Figure from Bites @ Animal Planet) ...... 36

7.3 A figure about positives and negatives. (Original figure from http: // en. . org/ wiki/ Precision_ and_ recall ) ...... 37

8.1 A graph of the averages...... 41 8.2 A graph of the average precision scores...... 42 8.3 A graph of the recall scores...... 43 8.4 A graph of the F2 scores...... 44 8.5 A graph of the F0.5 scores...... 45 8.6 A graph of the F1 scores...... 45 List of Tables

8.1 Average precision scores...... 42 8.2 Average recall scores...... 43 8.3 Average F scores...... 44

Chapter 1

Introduction

Within the multitude of images currently available on the Internet, how can we possibly find what we are looking for? With image retrieval now being applied to health and medical applications [29, 40] as well as military and traffic surveillance [25], rapid progress is not only inevitable but also fascinating. Image retrieval has for a couple of decades now, been a discipline with a constant flow of research be- ing done. Image retrieval is about making the images that are part of a computer systems, easily accessible through means such as search engines, and in the late 1970s the focus of image retrieval was found in what we today call Text-Based Im- age Retrieval (TBIR), also known as Context-Based Image Retrieval, Meta-Data Image Retrieval or Keyword-Based Image Retrieval. Initially with the being quite small, methods focused on the so called Management Systems (DBMS) [42] in which a user or an administrator would determine appropriate key- words for the images and store the keywords in a database. With the rapid growth of available data online however, the problem that is individual subjectivity became evident in which different people who were determining appropriate keywords, had their own subjective view of what keywords were appropriate. Another problem was the shear amount of data which had to be processed. Manually determining keywords was no longer a feasible option and thus Content-Based Image Retrieval (CBIR) was proposed. Content-Based Image Retrieval is a discipline with its ori- gin in the field of Computer Vision. The Content-Based Image Retrieval works on determining what an image may portray by looking at the image, its features, col- ors, textures and so on, and compare that to already known data. The Text-Based Image Retrieval of today on the other hand, is a field in which the information that surrounds an image is used as a basis for different Natural Language Processing (NLP) algorithms in order to determine to at least some degree, what the image may portray. While Content-Based and Text-Based Image Retrieval might not be enough, a strict word taxonomy applied to the result of such modern image retrieval systems might drastically increase the image retrieval precision, and as such, for this work, Extended Java WordNet Library [1] will be used. The original WordNet [39] which in essence is a lexical database for several languages have the capability to,

1 CHAPTER 1. INTRODUCTION given an input word, output information about said word. The output includes a description of the word, direct hyponyms of a word, inherited hypernyms and sister terms. The inherited hypernyms specifically can be seen as a tree structure with nodes and leaves. Using the WordNet tree structure feature on the results from the Content-Based Image Retrieval component, and the results from the Text-Based Image Retrieval component, it is possible to find words which has nodes in com- mon. The common nodes can then be used to augment the existing data or in order to filter out data which could be considered noise. As such, increasing the image recall at the cost of image precision or increasing the image precision at the cost of image recall should be possible. This work will therefore test different implemen- tations of the WordNet tree structure on the results of Content- and Text-Based output to measure the precision and recall of a Taxonomy-Based Image Retrieval (TaBIR) system. Following Chapter 1 which contains the introducing sections, the image retrieval background will be split into three chapters. This was done in order to avoid confusion when talking about image retrieval from the viewpoint of two similar, yet very different methodologies, namely Content-Based Image Retrieval as opposed to Text-Based Image Retrieval. As such, Chapter 2 will briefly talk about the history of image retrieval as well as what kind of users might use the search related systems and how different inputs and presentations might be used, which are shared by the two approaches. Chapter 3 will talk about the specifics related to Content- Based Image Retrieval while Chapter 4 will talk about the specifics related to the Text-Based Image Retrieval. In Chapter 4.4, background and information related to Word Taxonomy and WordNet will be available. Chapter 5 will contain information about work similar to this work while Chapter 6 explain about the architecture and method used in this work. Chapter 7 explain how the evaluation is done while Chapter 8 - 9 contains the results, discussions, conclusions and future work.

1.1 Concept

Figure 1.1 portrays the concept of the system which will be implemented and eval- uated in this work. A brief explanation of each box in the figure follows:

• Blog: The blog is in the case of this project a blog called Bites @ Animal Blog [43]. Each post in the blog is a source of data used in the project. More about the data can be found in Section 7.1.

• Crawler: The crawler component will access each blog post and send the html data of said post to the next component.

• Extractor: The extractor will fetch image links from the received data, as well as extract the surrounding text and any existing meta data. More about the extractor can be found in Section 6.2.

2 1.1. CONCEPT

• Content-Based IR: This component will classify an image referred to by a URL, received from the extractor. See Section 6.3 for more information about this component.

• Text-Based IR: This component will perform Part-of-Speech tagging (PoS tagging), clean up the received texts and attempt to find keywords in the text, describing the image. Section 6.4 contain further information regarding the Text-Based IR component.

• WordNet Evaluation: The WordNet component will create word taxonomy hierarchies on the data received from the previous components, The WordNet component will then attempt to filter out, merge or enrich the data depending on the hierarchy results. Further reading about this component can be found in Section 6.5.

• Search Platform: The search platform will perform stemming and tf/idf on the data.

: The final component is a simple interface for presenting the data.

Figure 1.1: The system concept portraying a simplified view of the full system.

3 CHAPTER 1. INTRODUCTION

1.2 Abbreviations

• TaBIR - Taxonomy-Based Imaged Retrieval • CBIR - Content-Based Image Retrieval • CLUE - CLUster-based rEtrieval of images • DBSM - Database Management Systems • TBIR - Text-Based Image Retrieval, also known as Context-Based Image Retrieval, Meta-Data Image Retrieval and Keyword-Based Image Retrieval. • QBIC - Query by Image Content • ANN - Artificial Neural Networks • CNN - Convolutional Neural Networks • HCI - Human-Computer Interaction • IDF - Inverted Document Frequency • NLP - Natural Language Processing • SVM - Support Vector Machine • PoS - Part-of-Speech • IR - • RF - Relevance Feedback • ST - Semantic Table • TF - Term Frequency

1.3 Problem Statement

There are two main methodologies in image retrieval. The first being that of Text- Based Image Retrieval and second being that of Content-Based Image Retrieval. In CBIR the main obstacle is known as the Semantic Gap [57] which will be explained in detail in Section 3.1. In TBIR the main problem is that of finding text that is actually relevant to an image. As each field has its problems, they also have their limitations; a problem which this work will attempt to narrow down.

1.3.1 Research Question Using a combination of Text-Based Image Search results and Content-Based Image Search results, how much can the precision and/or the recall of search queries be improved with the use of a word taxonomy?

4 1.4. CONTRIBUTIONS

1.3.2 Hypothesis By using data retrieved from a TBIR system and the data retrieved from a CBIR system, the combination of the two into a Taxonomy Based Image Retrieval system will increase the recall or the precision of image retrieval, depending on how the system is implemented.

1.4 Contributions

The purpose of this work is to evaluate how well a State-of-the-Art CNN based CBIR and a State-of-the-Art TBIR combined into a Taxonomy Based Image Retrieval system performs in image search on available online data. Should the resulting data determine that a Taxonomy Based Image Search can improve the precision or the recall of images, the same methodology may be used on other, similar systems. The contributions of this work will thus be a method for a Taxonomy Based Image Retrieval system which uses data retrieved from TBIR and CBIR to improve the precision and/or recall of image search. Other contributions will include suggestions which may further the research of image search systems.

1.5 Delimitations

Certain limitations will be put in place as to not over extend past the purpose of this work. These limitations may be based on time, hardware, software, available data as well as risks to the integrity of the work.

• This work will NOT be about development of new computer-vision or deep learning algorithms. Already finished components from these fields will be used as the main focus is on Information Retrieval, not on .

• No monetary expenses will be made for the specific purpose of this work.

• Query precision and recall of images will be the main focus of this work.

• The evaluation dataset will be made from one or several blogs on the Internet.

• The dataset category will be mainly animals and data from blogs related to animals.

• The data will be in English

• Only tags relevant to animals will be used in the evaluation data. An image will not be evaluated using tags such as sky, wall, tree, cute, beautiful but rather using tags such as cat, dog, bird for good evaluation in a specific area (see Section 3.1, Semantic Gap, and Section 7.1, Evaluation Data).

5

Part I

Background

7

Chapter 2

Image Retrieval

Image retrieval is an interesting area with roots in everything from Computer Vi- sion, Databases and Information Retrieval as mentioned by Smeulders et al. [57] to Semantic and Language fields where NLP plays a big role. This first background chapter will briefly talk about the history of image retrieval as well as informa- tion regarding the users of these systems and how image query responses can be presented.

2.1 Early History

The TBIR is an area which have been researched since the late 1970s [47]. In the early stages of TBIR, it was proposed, with relatively good success, the use of image annotations in database management systems (DBMS). That is, one would manually annotate images according to what they portrayed and then use the annotations as a basis for the image search [32, 53]. An example of such can be found in A Relational Database System for Images [5]. With the rapid expansion of the however, two very specific problems soon arose. The first problem being the matter of the subjectivity of each individual who were inputting the annotations into the system. While one individual might perceive an image in one way, another individual might perceive the image as something where other annotations would have been preferred. The second problem that arose was that of labour. To manually annotate each image is simply not feasible as systems grow larger. To cope with these problems, CBIR was proposed. CBIR is a discipline originating from the field of Computer Vision. In the works of Smeulders et al. [57] CBIR is referred to as a discipline born from Computer Vision and Information Retrieval as well as databases. Other well known names within the field such as M.S. Lew mentions in Lew et al. [31] how the early years of CBIR were generally based upon works from the field of Computer Vision. In 1991, the early years of CBIR, the efficiency of using colors and histograms were discovered and published in an article written by M.J. Swain and D.H. Ballard, known as Color Indexing [59]. Color histograms have since been considered the foundation of CBIR [66].The publication [59] has

9 CHAPTER 2. IMAGE RETRIEVAL according to Google Scholar [20] been cited more than 6275 times to date since its publication in 1991. Within the discipline of CBIR several impressive advancements have since been made, ranging from normalized cuts and image segmentations [54] to improvements in visual signatures, feature extractions and relevance feedback [48, 63, 67] to convolutional neural networks [55, 60] and annual competitions within the field of CBIR [50].

2.2 Search Users

How a user perceives a system depends very much so on the expectations of said user. Smeulder et al. [57] have tried to classify users into three groups depending of what the user’s end goal might be:

• Users who browse by association. These users does not have a specific goal but rather work on constantly refining their search through several iterations of associated images. Associated images might be images taken from similar sources, being rated in a similar manner after going through relevance feedback or simply by having similar keywords or labels.

• Users who search. In this case the user is looking for a specific image or something very similar to what the user has in mind.

• Users who categorise. These users might have a reference image and then try to look for images of the same category.

Datta et al. in Image Retrieval: Ideas, Influences, and Trends of the New Age [8] states that the intent of the user and the clarity of said intent affects whatever expectation a user might hold of a search system. Datta et al. then augments the categorisation made by Smeulder et al. of users but refers to them as browsers, surfers, searchers.

2.3 Presentation

One of the most important aspects of image retrieval is that of how the results are presented as it does not matter how well the image works if the results are not shown in an easily accessed manner. As such, different presentation methods may be preferred in different situations. Below, are some of the more relevant methods [8] used for presenting the images:

• Chronological order - Using this presentation method, images are shown in their chronological order. A chronological order might for example be ac- cording to the date in which the photos were taken or uploaded to a system. This order is seen more frequently in private albums as photos are usually automatically time-stamped, or in services where a timeline is featured.

10 2.3. PRESENTATION

• Clustered ordering - This method presents the images based upon their clus- tering in the database. A good example of when clustering can be used is after utilising the CLUE approach which will be mentioned in Section 3.1.

• Relevance ordered - The relevance ordered presentation is the most common method [8] used in search systems. In a relevance ordered presentation the relevance to a query is calculated and the most relevant responses are shown first.

11

Chapter 3

Content-Based Image Retrieval

The idea behind Content-Based Image Retrieval (CBIR) is that one should with the use of different features within an image be able to acquire visual signatures. Visual signatures can be used to find other, similar signatures, thus similar images. One of the major problems common to all of the CBIR areas however is the so called semantic gap which is the gap between high-level semantics used by humans and low-level semantics used by machines. Even so, CBIR is an exciting discipline that has seen rapid advancements in recent years. An example of this is that the annual Large Scale Visual Recognition Challenge that was performed for the first time in 2010 has seen a reduction in image classification error, from 28.2% in 2010 to 6.7% in 2014 [50]. The reduction in classification errors can be seen in the switch from SVM, random forest and so on, to CNN. This chapter will henceforth be split into different sections where each section describes an important factor in regards to CBIR. Section 3.1 will talk about the semantic gap, one of the main issues in regards to CBIR. Section 3.2 will explain the importance of Features while Section 3.3 will explain visual signatures, the impor- tant part in which the features are used. Section 3.4 will cover the attractive concept of learning in CBIR systems and will also cover some of the different approaches related to the subject.

3.1 Semantic Gap

One of the most difficult obstacles within the field of CBIR is the obstacle known as the Semantic Gap. Smeulders et al. calls it a critical problem [57] and it is one of the obstacles currently hampering the advancements of CBIR. The semantic gap refers to the gap between the high-level semantics that humans use on a daily basis in relation to the low-level semantics used by computers. If a human wanted to describe an image of a dog playing on an open field of grass it would be a simple matter of saying “The image portrays a dog playing on an open field of grass.”. The level of semantics used by the human is called high-semantics. We use words such as dog, open field, playing and grass, because to a human, all of those concepts are

13 CHAPTER 3. CONTENT-BASED IMAGE RETRIEVAL well known and easily recognisable at a glance. A computer however does not know at a glance what a dog is nor the concept of playing. To a computer, a dog might be a collection of eclipses in a certain order and maybe a few specific colors. That is, to a computer, a dog is a collection of features that in low-level semantics try describe what a human would call a dog. Different techniques have been proposed and tested in order to diminish the semantic gap. Examples include:

• Query By Image Content (QBIC) - in which an example image is used as a reference to find other images with similar features [14].

• Text-based assistance - One paper [64] wrote about using Latent Semantic Indexing in order to find features written in text and then combining that with CBIR.

• CLUster-based rEtrieval of images (CLUE) - Chen et al. [6] use the proposed CLUE method in which features that are usually seen in combination, are clustered together for increased precision.

In addition to these methods, Liu et al. suggests in a survey [32], other methods such as:

• Object Ontology - High-level semantics being mapped to low-level semantics. An example could be, mapping the keyword football to a circular shape and perhaps some specific texture.

• Semantic Tables (ST) - The opposite of Object Ontology. ST is about mapping low-level semantics to high-level keywords.

• Machine Learning - Machine Learning is within its own right, an extensive sub- ject on its own containing approaches such as Inductive Logic Programming, Support Vector Machines, Bayesian Networks, Genetic Algorithms, Artificial Neural Networks and Convolutional Neural Networks. This report will briefly cover the subject of Convolutional Neural Networks (CNN) in a later section.

3.2 Features

Features is the term used to explain the low-level semantics used by computers. Fea- tures are the colors of which use was introduced by M.J. Swain and D.H. Ballard in Color Indexing [59], as well as general shapes and textures that make up the image. That is, in an image, what colors, shapes and what textures are present? Sev- eral features when combined together, is what forms an image. Various techniques have been proposed for the retrieval of specific features such as the “blobworld” [2] representation or later the Wavelet-based texture retrieval [11] which made use of generalised Gaussian densities and Kullback-Leibler distance.

14 3.3. VISUAL SIGNATURE

3.2.1 Global Features Global features is a sub-domain belonging to the features domain. The global fea- tures consists of an accumulated set of features which form an overall “impression” of the image. The overall impression can be the average base color, average tone level or any other such overall impression.

3.2.2 Local Features In comparison to global features, local features are, as the name suggests, features specific to local areas of the image. The image is divided into several smaller areas in which features are then examined. During this phase, features may include textures within certain areas, color differences as well as other features that may differ from area to area.

3.2.3 Image Segmentation Image Segmentation in CBIR is mainly used as a method to examine shapes within an image. By segmenting an image into thin segments, the general shape of an object within the image can be found. Techniques such as the Normalized Cuts [54] exists for this purpose. Other techniques include

• Chain coded string [15]

• UNL Fourier features [45]

• Zernike moments [26]

However, since the purpose of this report is not a study of segmentation techniques, further explanations will be omitted but it should be noted that an extensive study on segmentation techniques is available in Babu M. Mehtre et al. [38].

3.3 Visual Signature

In CBIR, a visual signature is in effect, some visual features of an image which can be used in order to find other signatures with similar features. Visual signatures can be extracted using features present in the image and with the use of a segmentation technique. The visual signature is, by the techniques of today, a necessity in order to find similar images using CBIR. Step wise, an image is initially processed by dividing the image into different parts. Once the image has been divided into parts, features are examined as global or local features which are then used in acquisition of a visual signature. Once the signature has been acquired several motivational factors exist in the choice of images to be returned. R. Datta et al. [8] summarize the motivational factors in five points which have been quoted below:

• Agreement with semantics

15 CHAPTER 3. CONTENT-BASED IMAGE RETRIEVAL

• Robustness to noise (invariant to perturbations)

• Computational efficiency (ability to work in real time and in large scale)

• Invariance to background (allowing region-based querying)

• Local linearity (i.e., following triangle inequality in a neighborhood)

3.4 Learning Approaches

In CBIR the potential of letting a system learn what is right or wrong, or what the user want and does not want is seen as very attractive indeed. As mentioned in Section 3.1, one of the problems of CBIR is the semantic gap. If the system can learn what the high-semantic user is looking for in a query, then that allows for good improvements in the system precision when retrieving images in response to said query. Different options are available when it comes to teaching a system. Some of the possible options will be mentioned below.

3.4.1 Relevance Feedback Relevance Feedback (RF) is a well used technique that spans over several research fields. The basics behind relevance feedback in regards to CBIR begins with a response to a query in returning an image. The user may then give feedback to the system by letting the system know if the image was either relevant or not relevant. If the response was relevant to the query then a positive weight may be added for similar responses while a negative weight may be added in the opposite case, that is, when the image was not relevant. Note however that certain techniques in RF do not make use of negative weights to the same extent as some other techniques. This can be seen in “Relevance feedback in content-based image retrieval: some recent advances” by X.S. Zhou and T.S. Huang [67]. The idea behind RF is that the system will eventually move towards results which are more relevant to the user according to the feedback that is received.

3.4.2 Support Vector Machines Support Vector Machines (SVM) belong to a category of learning models known as . In SVM the goal is achieved by having a learning algorithm analyse data in order to learn to recognise patterns. The training data is essentially a set of examples where each example is marked as belonging in one of two categories. The SVM then creates a model according to learned patterns which will then try to classify in which category new examples should end up, by using the model. An excellent introduction to SVM can be found in the book An Introduction to Support Vector Machines and Other Kernel-based Learning Methods book written by N. Cristianini and J. Shawe-Taylor [7].

16 3.4. LEARNING APPROACHES

3.4.3 Artificial Neural Networks Artificial Neural Networks (ANN) is a technique in which humans try to mimic the properties of the brain. Several important factors make ANN desirable in CBIR where learning can heavily affect the performance of a system. Amongst several other factors, the properties of learning ability as well as adaptability are highly esteemed in CBIR. In an implementation and in an attempt to mimic the human brain the ANN can be viewed upon as a directed weighted graph in the sense that neurons are nodes, with directed edges, connecting them [22]. In 1943 McCulloch and Pitts [37] defined the artificial neuron architecture, McColloch-Pitts neuron, also known as Threshold Logic Unit (TLU) in three steps: 1) One or more connec- tions bring activation signals from other nodes. 2) A node, acting as a processing unit, sums the inputs and applies a function to the data. 3) The node sends the result through an output line to the other connected nodes. F. Rosenblatt later introduced the Perceptron [46] in which a layer of McColloch-Pitts neurons act as inputs and from there, feeds data forward to an output layer of McColloch-Pitts neurons. This is also the basic idea of Single-Layer Neural Networks. In regards to ANN, different node patterns can be created in order to accomplish different goals. Depending on the pattern used, the ANN will belong to one of two main categories. If the ANN does not contain loops they are called feed-forward networks. To the feed-forward category belong techniques such as: • Single-layer perceptron • Multilayer perceptron On the other hand, if the ANN does contain loops, they are known as feedback networks, or recurrent networks. To this category belong techniques such as: • Competitive networks • Hopfield network

3.4.4 Convolutional Neural Networks In recent years, advances in Convolutional Neural Networks (CNN) has lead to rapid advancements in regards to CBIR. CNN is a variant of a feed-forward ANN and LeCun et al. [28] has for a while spoken highly of the uses of CNN and its advantages within CBIR. In CNN an input layer, such as an image is analysed by several CNN layers where each CNN layer is one of three types [17]: • Convolutional - The convolutional layer is a layer consisting of neurons (nodes) in a rectangular grid-like pattern. Each neuron will take its input from a rectangular area of the previous layer. • Max-Pooling - This layer may be used after a convolutional layer. Its task is to produce a single output by sub-sampling a rectangular block in the convolutional layer prior to it.

17 CHAPTER 3. CONTENT-BASED IMAGE RETRIEVAL

• Fully-Connected - This layer connects all the attained information from the previous layers. This is done by taking all of the neurons in the previous layer and connect them to every neuron in the fully-connected layer.

By creating models consisting of these different types of layers, the CNN is then able to extract features from the source data as well as to make predictions as of where new data might end up being classified. With the attention on CNN in recent years, other new interesting approaches such as the Region-CNN (R-CNN) [19] and different Deep Convolution methods [55, 60] have garnered interest. The deep convolution method by Karen Simonyan and Andrew Zisserman focuses on adding convolution depth by reducing the sizes of the convolution filters [55]. In the latest Large Scale Visual Recognition Challenge [50] the two top teams both used deep convolution in their entries. For CNN an excellent framework known as Caffe [23] exist. Caffe is well documented and is constantly being updated, making it one of the top candidates for State-of-the-Art CNN frameworks.

3.4.5 Random Forest The Random Forest which was proposed by Leo Breiman in Random Forests [4] is an approach which improves the classification rate in comparison to traditional de- cision trees. The traditional decision tree is structured so that the leaves represent different classes while the branches that connects the leaves acts as conjunctions leading to one leaf or another. It has been shown that decision tree algorithms have the potential to consistently perform better than other classification methods such as the maximum likelihood as well as linear discriminant function classifiers [16]. In another publication known as Bagging [3], Breiman states that the introduction of a voting system, called an ensemble in machine learning, when performing classifica- tions can significantly improve the classification precision. Breiman then proceeds to use the ensemble reasoning as a part of his Random Forest. What Breiman in essence proposed with Random Forest was the following [4]: The generation of a large number of decision trees. The most popular tree would then be decided by an ensemble vote.

18 Chapter 4

Text-Based Image Retrieval

In Text-Based Image Retrieval the interesting part is not an entire document, but rather the small part of a document, the Relevant Information in Text, which might describe an image. Different methods to find the relevant text can be found sim- ply by inspecting several webpages. Some methods include looking in the HTML headers, title fields, ALT-fields and text in close proximity to the image, while other methods make use of the old DBMS models in which meta-data is stored in a database. Once potentially relevant text have been found, Natural Language Algorithms and other approaches such as the use of Term Frequency and Inverse Document Frequency can then be applied so that only the most relevant information is left behind for the Indexing stage. In Section 4.1 there will be explanations regarding the Relevant Information in Text. Section 4.1 will also mention Metadata while Section 4.2 is about different techniques used in Information Retrieval systems such as NLP as well as methods used in formal search systems such as the tf-idf. Section 4.3 will consist of an explanation regarding Indexing and different platforms used for Indexing in modern search systems.

4.1 Relevant Information in Text

In a traditional Information Retrieval system such as an Internet search engine, many problems exist but can be circumvented using different methods. Problems re- garding spellings can be circumvented using spell correction algorithms while prob- lems regarding phonetic corrections can be partially circumvented using phonetic algorithms. Since the problem of this report is not a problem of spellings or pho- netic algorithms however, more detailed explanations will be omitted but can be read about further in Techniques for Automatically Correcting Words in Text [27] as well as An Introduction to Information Retrieval [34]. While in a traditional Internet search engine, the information that is relevant to a document, that is, the source of text used for indexing, may be found anywhere on the site, the same is not always true for a system of which the purpose is to find

19 CHAPTER 4. TEXT-BASED IMAGE RETRIEVAL images and the text that describes those images. In a website document containing 10000 words and two pictures it is possible that only a few sentences worth is related to the images. The main issue in TBIR is basically to find out what words or sentences are related to what pictures. For this, there is no straight forward solution even though different options and approaches are still available. After inspecting the source codes of a few webpages, some different possible approaches could be seen, such as:

• ALT-text - The intended purpose of the ALT-text feature is for users to add a descriptive text to an image. Most users do not use this feature but if used, it can be a good source of information regarding the image.

• Image as reference point - Using the image as a point of reference in the website, it is possible to look for k number of sentences before the image as well as after the image and assume that they are related to the image.

• Titles - The titles in a webpage might hint at what the images could be about. One common feature of most of the traditional search engines is that the title is used as the most important factor [18] affecting the relevance of a document.

• Headers - Some headers might in a few words describe a section in which an image is located.

• Nearby

tags - If the image is within a text tag or if there are text tags in close HTML proximity to the image, then it is reasonable to assume that they might contain information regarding the image.

4.1.1 Metadata In IR systems, Metadata or METAtags refers to the underlying information which can be used for various tasks such as sorting or searching. For search engines metadata such as keywords and descriptions, which was popularised by the now shut-down search engine Altavista, has for a long time existed. Due to extensive abuse of metadata however, such as listing false information in the metadata or by simply repeating the same keyword hundreds of time to appear more relevant [18], many of the search engines that exist today have stopped using the existing user or author written metadata and instead moved toward the use of automatic metadata generation in which the metadata is automatically generated based upon content. Many corporations however still make use of user or author written metadata in their internal systems as they are not affected by previously mentioned abuse. For more information in regards of metadata generation, study of two metadata gener- ators can be found in Metadata Extraction and Harvesting: A Comparison of Two Automatic Metadata Generation Applications by J.Greenberg [21].

20 4.2. CURRENT TECHNIQUES

4.2 Current Techniques

As mentioned in Section 4.1, the problem in TBIR is more a problem of finding out what text is relevant than finding information about the whole document. Even so, techniques such as tf-idf, Removal of stop words, Stemming and other NLP techniques still play a major role in the precision and performance of modern TBIR systems.

4.2.1 Term Frequency and Inverse Document Frequency If one assumes the goal to be the following: Find out which documents in a set of N documents are most relevant to the query “The White Rabbit”, there are some methods in which this can be accomplished. To begin with, one could disregard all of the documents not containing the query words “The”, “White” and “Rabbit”. One could then simply count the number of times that the words “The”, “White” and “Rabbit” occur in the documents, that is, the term frequency (tf) of each word, and return the documents with the highest tf. By doing so however certain words would such as “The” which is a very common word even if it’s not used in the context of “White” and “Rabbit”, would cause some documents to seem more related to the query than they really are. For this, the inverse document frequency (idf) is used. The idf is a way of lowering the impact of words which are common in a document and raising the impact of words which are uncommon. The tf is as mentioned simply the frequency of a term, denoted tft,d where t is the term and d is a document. The idf for a term t can be calculated according to Formula 4.1 where N is the same as mentioned earlier and dft is the number of documents in which the term t occurs. Thus, by using tf and idf in what’s called tf-idft,d, where tf-idft,d is calculated using Formula 4.2, it is possible to find documents which are related to the query. More about this subject can be read in an extensive study called Term Weighting Approaches in Automatic Text Retrieval [51], written by and Chris Buckley and in a book written by C.Manning et al. called An Introduction to Information Retrieval [34].

N idft = log (4.1) dft

tf − idf t,d = tft,d ∗ idft (4.2)

4.2.2 Natural Language Processing NLP can be seen as a process in which the goal is to turn the natural language used by humans, into a language which is easier to understand for the computer. NLP consist of different methods for the computer to understand human language, such as Part-of-Speech tagging (PoS tagging), stemming, stop words, and so on.

21 CHAPTER 4. TEXT-BASED IMAGE RETRIEVAL

4.2.3 Part-of-Speech Tagging PoS tagging, also known as grammatical tagging is the process in which a computer may receive a sentence and assign a grammatical tag for each word in the sentence. A grammatical tag might be a tag such as verb, adjective, noun and so on. One such example of a PoS tagger is the Stanford PoS tagger, which is part of the Stanford NLP [35, 36].

4.2.4 Stop Words Stop words are words which in a query perspective is of so little value that it is removed from the system. For example, if there exists a set of N documents and every one of these N documents contain hundreds of occurrences of word w and w itself does not add any additional information to a query, it might be considered a and thus being removed from the system. Examples of words which are frequently considered stop words are “a”, “it”, “the”, “was” and so on [35].

4.2.5 Stemming and Lemmatization In almost every language, different grammatical forms are used depending on the scenario in which they are used. The meaning of the words close, closing, closed, closes, are the same in all of the cases but the use depends on the situation or whether the conversation is about the past, present or future. Stemming is used in order to remove the variations of a word. An example could be the words cat, cats, cat’s, cats’; after the stemming process they would all become cat. The use of stemming allows for reduced redundancy in the database as well as easier query creations. If a picture of a cat playing with a ball of strings was inserted into the database and the descriptive text The cat’s playing with a ball of strings was found, then the stemming would change cat’s into cat and thus allowing a user to find the image by entering the query cat. One of the early stemming algorithms, known as Lovins stemmer [33] was in- troduced in 1968 by Julie Beth Lovins. The stemming algorithm which contained 294 suffixes were mainly designed for stemming of scientific texts [65]. The orig- inal Porter stemmer [44] however, only containing approximately 60 suffixes were evaluated and shown to perform at least as well as more complicated stemming algorithms [30] making it a good choice within the field of Information Retrieval. While stemming act upon a set of rules and suffixes, lemmatization on the other hand, also take note of the context by performing more complex tasks. The words is, was, am and being would in lemmatization all turn into be [13].

4.3 Indexing

The action of indexing is an action in which the input document, after going through different techniques such as the tf-idf, word stemming, removal of stop words, ap-

22 4.4. WORD TAXONOMY plication of NLP algorithms and so on, is stored into a database. Currently, many open platforms are available which handle the database and allows for easy and efficient storing and accessing of indexed data. Examples of such platforms are: • Solr [58] - An open source search engine based on Lucene. The engine is writ- ten in Java and allows for easy, yet powerful input and output manipulation. • ElasticSearch [13] - An open source search engine that also comes with pow- erful analytic options. Similarly to Solr, ElasticSearch is based on Lucene. • Dezi [10] - A Perl language written project similar in functionality to SOLR and ElasticSearch

4.4 Word Taxonomy

Taxonomy is considered the practice of classification. One may for example classify the word “dog”. The word “dog” may belong to the “canine” class which may not only contain other classes such as “wolf” and “fox” but may in turn also belong to the “carnivore” class, and so on. Figure 4.1 show an example of a taxonomy.

Figure 4.1: A taxonomy example.

On the subject of taxonomy and information retrieval the work known as Word- Net [39] is of extra interest for this work. WordNet is a manually constructed

23 CHAPTER 4. TEXT-BASED IMAGE RETRIEVAL taxonomy of words for, mainly, the English language. If an input word exist in the WordNet lexicon, an output in the form of a synonym set (synset) is returned. Each item in the synset contain useful information such as the sense in which a word might occur, its synonyms and if the word is a noun, verb, adjective etc. More importantly however, the synset also contains the hyponyms, meronyms, holonyms and hypernyms of a given word.

• Hyponym - lower level classes of a word, i.e. more specific. “Dog” is a hyponym to “canine”.

• Meronym - something that is a part of something else. “Paw” is a meronym to “canine”

• Holonym - a term where another term may belong. “Canine” is a holonym of the “Canidae family (dogs; wolves; jackals; foxes)”

• Hypernym - higher level classes of a word, i.e. more generalised. “Canine” is a hypernym to “dog”

24 Chapter 5

Related Work

Since its introduction, a lot of work on Information Retrieval has been done using WordNet. In 2005 WordNet was used to prune CBIR features [24]. This was done with the calculations of several similarity measures combined using the Dempster- Shafter Evidence Combination. A work by G.Varelas et al. [61] is similar to this project. Their work made use of text descriptions and image descriptions which were then sent to a WordNet component. The text-descriptions were, similarly to this work, alt text, titles, captions and image file names. The image descriptions however focused on image features such as frequency spectrum and moment invariants in comparison to this work which use actual classifications for an approach more in line with information retrieval. In their WordNet component they looked for semantic similarities using similarity measures while this work focus on high precision by strict filtering. Their work also assume that the most common WordNet sense is always the one of interest and thus ignore other senses. In their future work section however, it was discussed that it would be interesting to work on a sense disambiguation to determine which of the actual senses is actually looked for. This project iterates through all of the senses and looks for results in all of them. S.Zinger et al. [68] used WordNet in a pruning approach to create portrayable objects which would then represent a cluster of images which had been clustered together using CBIR. X.Wang et al. [62] used a contextual weighting approach on vocabulary tree based image retrieval while J.Deng et al. [9] used a hierarchical semantic in the form of semantic attributes on images, to retrieve images similar to a query image. In 2011 M.Douze et al. [12] combine attributes and the image descriptor based on the Fisher [52] vector. The Fisher vector alone has shown to outperform the Bag of Features (BoF) [56] approach and M.Douze et al. manage to show improved performance with the attribute and Fisher combination. At the time of writing, to the best of knowledge, there are no other works using the Caffe CBIR classifier in combination with traditional image retrieval techniques and a WordNet based taxonomy filter.

25

Part II

Method

27

Chapter 6

Architecture

This chapter will contain sections where each section is dedicated to one of the components in the implemented system architecture. The design of the system is as seen in Figure 6.1.

Figure 6.1: An extended figure portraying the system architecture.

29 CHAPTER 6. ARCHITECTURE

6.1 Data Retrieval

For the crawling and indexing of data on the Internet, a tool known as Norconex Http Collector [41] was used upon recommendation from the supervisors at Find- wise. Norconex is a tool that is easy to setup and allow for flexible crawling and indexing of webpages. In Norconex it is easy to setup filters and it was therefore a simple matter of getting only the data which was of interest for this particular work. Henceforth each set of data, such as a blog post will be referred to using the Information Retrieval term, document. For storing of each raw document as well as storing of processed documents, the Findwise-created Indexsvc, short for Index Service was used. Indexsvc works well with Norconex and support a good interface for managing and accessing data. But in terms of function as a location for storing information, it is no different than any other storage tools or methods.

6.2 Extraction of Relevant Information

The document which the crawler have retrieved and stored in the Indexsvc has to be analyzed and during which, only the relevant information should be extracted. First, a few criteria were made as to whether or not the document would be processed further or discarded:

• An image - If no images could be found then the document would be discarded as it is of no relevance to image retrieval.

• An unspecified image size or a size of at least 100x100 - Since many websites use images with the specified dimensions of 1x1 pixels to pre-load data which is used later on the website, images with specified dimensions of less than 100x100 were ignored. Images with no specified size are accepted.

For each image Ii in a set of n acquired images, the following information were to be sent on for further processing in accordance to modern Information Retrieval techniques:

• Title

• Alt-text

• Source URL

-tags in “close” proximity to the image

• First x number of characters in data field

• Image links

30 6.3. CONTENT-BASED IMAGE RETRIEVAL

Figure 6.2: Example of data that is extracted from a post. (Original blog post from Bites @ Animal Planet)

6.3 Content-Based Image Retrieval

For this work, the CBIR component is one of the core architectural components. As mentioned in Section 1.5, this work is not be about the development of new computer vision or deep learning algorithms and as such, the Caffe [23] framework is used in an off-the-shelf manner. The CBIR component does for each image Ii, i ∈ [1...n], classify image Ii and will for Ii return the result as a vector in the following form [(x1, y1)...(xn, yn)] where each (x, y) tuple contain a predicted class x and the certainty as a value y, y ∈ [0...1], of which the classifier thinks that the predicted class x is correct. In order to avoid too much clutter, only the top five predictions are used.

6.3.1 Classification The classification can, as mentioned in Section 3.4, be done using different methods. In Section 3.4, SVM and CNN were mentioned as two machine learning approaches

31 CHAPTER 6. ARCHITECTURE both capable of predicting classes. For this work the CNN was chosen. The reasons for using CNN lie partly with CNN being a very interesting approach that as pre- viously mentioned, has garnered great attention within the methodology of CBIR. Another reason for using CNN was due to the Caffe framework which allows for quick, yet good, off-the-shelf use of feature extractions and classifications.

Dataset The classifier uses a pre-trained neural network, trained on the ILSVRC12 [49] dataset as the dataset is included in the Caffe framework. As no actual training is done in this project, the only training that the CBIR component has, is from the out-of-the-box Caffe framework. A different dataset is used for the evaluation and will be described further in Section 7.1.

6.4 Text-Based Image Retrieval

Similarly to the CBIR component, the TBIR component is also one of the core components for this work. The TBIR component performs NLP actions as well as cleaning up and pruning of extracted text.

6.4.1 Natural Language Processing For the NLP functionality of the TBIR component, the Stanford NLP [36] was used. With reference to the delimitation “Only tags relevant to animals will be used in the evaluation data” in Section 1.5 the NLP component therefore performs PoS tagging on the extracted text, discarding all non-noun words. Words such as running, beautiful and calm are therefore discarded.

6.4.2 Data Clean-Up For the remaining steps, it is also important that the data does not contain charac- ters which might interfere with the other components. As such, two main actions are taken:

• Non-letter characters are removed.

• Characters are made into lower-case

Other clean-up actions such as removal of multi-spacing are also done in order to improve the chances of finding a word in the Word Taxonomy component.

6.5 WordNet Evaluation

The Word Taxonomy component, in this architecture called WordNet Evaluation, is the last of the main components for this work. The WordNet Evaluation component

32 6.6. SEARCH PLATFORM is implemented as two versions. The first implementation focuses on recall while the second version focuses on precision. For the first implementation, a taxonomy hierarchy of one CBIR classification and one TBIR keyword is retrieved. The component then, for each document such as an image with the related text and meta data, iterates through both of the hierarchies to look for the smallest common node. The result list for the document is extended by the keyword, classification, the common node and synonyms for all of the involved nodes. The second implementation does instead of iterating through both of the hi- erarchies, only iterate through one of the two hierarchies, while keeping the other at input level, thus creating a filter that will ignore all non-matching nodes. The TBIR was chosen as the static hierarchy due to the probability of a higher pre- cision and recall compared to the CBIR. Tests were performed in both directions however, meaning that the CBIR was also kept as the static hierarchy but did as expected, result in lower precision and recall and was thus discarded. Like in the first implementation, the synonyms for all involved nodes are added to the result.

6.6 Search Platform

The search platform used in this work is ElasticSearch. ElasticSearch was chosen due to its support of multiple indexes as well as the build in tools for stemming, tokenizing and tf-idf. In addition, ElasticSearch is also easy to setup, use and distribute should very large databases become an issue.

6.6.1 Filters, Stemming and Tokenizing Each index in the search platform used the following functionality:

• Stop words - Stop words were removed from all of the index fields in order to reduce the amount of redundant data.

• Standard Tokenizer - The ElasticSearch standard tokenizer was chosen due to its good support and performance of the English language.

• Porter stemming - The Porter stemmer was chosen due to its good perfor- mance and on basis of being the recommended stemmer by ElasticSearch on documents containing English data.

33

Chapter 7

Evaluation Method

Henceforth for the sake of presenting the result, the Text-Based Image Retrieval component will be annotated as TBIR. The Content-Based Image Retrieval com- ponent will be annotated as CBIR, while the improved version that makes use of information from both sources as well as a taxonomy will be annotated as TaBIR, short for Taxonomy Based Image Retrieval.

7.1 Evaluation Data

The evaluation data consist of approximately 900 blog posts which were accessed on 2015-03-20. The posts were accessed by crawling through Bites @ Animal Blog [43] using a crawl depth of 30, resulting in 823 images. Each image Ii, i ∈ [1...n], n = 823 was manually examined and tagged by its content. Due to the many tagging pos- sibilities and in order to remain as objective as possible, a limitation, see Section 1.5, was put in place. The tags would strictly contain the animals which could be seen in the images. As such, tags such as wall, water, tree and subjective tags such as cute, beautiful, weird were not used. Images containing one animal was tagged with said animal, see Figure 7.1, while images containing two or more animals were tagged with all animals present in the image as seen in Figure 7.2. The ground truth data of this project is thus the n images collected using above mentioned, manually tagged method.

35 CHAPTER 7. EVALUATION METHOD

Figure 7.1: An image containing a single animal. Tag: sloth. (Figure from Bites @ Animal Planet)

Figure 7.2: An image containing two or more animals. Tags: dog, fox. (Figure from Bites @ Animal Planet)

From the evaluation data, tags containing less than 20 images are not evaluated since tags with less than 20 images will cause high impact on the results while there is not sufficient data to support this impact. See Appendix A for a full list of the tags used and the number of occurrences for each tag.

7.2 Classifier Evaluation

Considering the ground truth set of n images acquired according the method men- tioned in Section 7.1, the system will for each image Ii, i ∈ [1...n] attempt to classify Ii. The result of a classification will , given a query, belong to one out of four cate- gories. To explain this, consider Figure 7.3 and a search query for the term “dog” while reading the following bullet points.

• True positive: A true positive is an image which would show up should the query be performed on the system classified data as well as on the ground truth data. The number of true positives is calculated according to

36 7.3. EVALUATION FORMULAS

• False positive: A false positive is an image which would show up should the query be performed on the system classified data, but not on the ground truth data.

• True negative: A true negative is an image which would not show up, no matter if the query is performed on the system classified data, or the ground truth data.

• False negative: A false negative is an image which would not show up should the query be performed on the system classified data but would show up should the query be performed on the ground truth data.

Figure 7.3: A figure about positives and negatives. (Original figure from http: // en. wikipedia. org/ wiki/ Precision_ and_ recall )

7.3 Evaluation Formulas

If one assumes that, according to Section 7.2, for each image Ii, a classification Ci exists, then ∀Ii∃Ci, i ∈ [1...n]. As such, for each classification, given a query q, each image will end up in one of four categories Ci,q ∈ [tp, fp, tn, fn], i ∈ [1...n], the number of true positives, false positives, true negatives and false negatives are calculated according to Formula 7.1 - 7.4.

37 CHAPTER 7. EVALUATION METHOD

n X tpq = Ci,q,Ci,q = tp, i ∈ [1...n] (7.1) i=1 n X tnq = Ci,q,Ci,q = tn, i ∈ [1...n] (7.2) i=1 n X fpq = Ci,q,Ci,q = fp, i ∈ [1...n] (7.3) i=1 n X fnq = Ci,q,Ci,q = fn, i ∈ [1...n] (7.4) i=1 The evaluation methods used in this work is specified below: The precision (pr) of a query can be calculated using the true positives (tp) and false positives (fp) of the response. The tp of the response, divided by the sum of tp and fp will return a value ranging from 0-1. The resulting value is an indication of how many of the returned documents are correct. tp pr = (7.5) tp + fp The recall (rc) on the other hand, also make use of the false negatives (fn). The tp of the response, divided by the sum of tp and fn is the resulting rc value ranging from 0-1. In this case, the value indicates how many of the correct images the query was able to retrieve. tp rc = (7.6) tp + fn

The F-measures F1, F2 and F0.5 which put emphasis on balance, recall and precision respectively, is calculated according to the following formula: pr ∗ rc F = (1 + β2) ∗ (7.7) β β2 ∗ pr + rc Since each formula work on a specific search query the average result is thus calculated for each formula respectively.

7.4 Baseline and Comparison

For this work, the higher of two baselines will be used. The first baseline is that of the CBIR component on its own. The second baseline is that of the TBIR compo- nent on its own. These two baselines will be compared to the results from TaBIR which utilises information retrieved from both the TBIR and CBIR components in its functions.

38 Part III

Results and Discussion

39

Chapter 8

Results

This chapter contains all of the results from the implementations. Precision TaBIR is the implementation which focuses on precision over recall while Recall TaBIR is the implementation which focuses on recall over precision. In Figure 8.1 the CBIR and TBIR implementations can be seen and compared to the precision and recall implementations of TaBIR. The following sections will contain graphs for each of the measures made as well as the resulting values while the results will be discussed further in the Chapter 9.

Figure 8.1: A graph of the averages.

41 CHAPTER 8. RESULTS

8.1 Average Precision

The average precision was calculated using Formula 7.5. As can be seen in Figure 8.2 and Table 8.1 the TBIR component performed relatively well compared to the CBIR and Recall TaBIR with an average precision of 0.77. The Precision TaBIR implementation however managed to average at 0.90. The CBIR and the Recall TaBIR implementation performed comparatively bad with precision values of 0.53 and 0.56 respectively.

Table 8.1: Average precision scores.

CBIR 0.53 TBIR 0.77 Precision TaBIR 0.90 Recall TaBIR 0.56

Figure 8.2: A graph of the average precision scores.

8.2 Average Recall

The average recall was in contrast to the average precision, calculated using For- mula 7.6. In Figure 8.3, Table 8.2 one can clearly see that once again, the TBIR performed well with a recall score of 0.79. The Precision TaBIR performed worse in this case, with a score of 0.45, just slightly higher than the CBIR which scored 0.44. The Recall TaBIR however managed to score an average of 0.89 and thus

42 8.3. AVERAGE F MEASURES performing slightly better than the TBIR.

Table 8.2: Average recall scores.

CBIR 0.44 TBIR 0.79 Precision TaBIR 0.45 Recall TaBIR 0.89

Figure 8.3: A graph of the recall scores.

8.3 Average F Measures

Considering that image retrieval is dependent on both precision and recall, it was considered appropriate to include the F Measures as well, which translates into a score based upon the precision and recall. Looking at Figure 8.4 one can see the average results of the F2 Measure. The F2 which puts emphasis on recall shows that the top performing TBIR and Recall TaBIR achieved an F2 score of 0.78 and 0.79 respectively while the CBIR and Precision TaBIR only got an average score of 0.43 and 0.49 respectively. In Figure 8.5 the F0.5 scores can be seen. F0.5 which put more emphasis on precision show an average CBIR score of 0.45, TBIR score of 0.77 and Precision TaBIR and Recall TaBIR score of 0.68 and 0.60 respectively. F1 does neither put emphasis on recall nor any emphasis on perception but is rather a balanced measure between the two. Figure 8.6 show the CBIR, TBIR,

43 CHAPTER 8. RESULTS

Precision TaBir and Recall TaBIR on 0.43, 0.77, 0.56 and 0.68 respectively. Table 8.3 contain all of the previously mentioned F Scores.

Table 8.3: Average F scores.

CBIR F2 Measure 0.43 TBIR F2 Measure 0.78 Precision TaBIR F2 Measure 0.49 Recall TaBIR F2 Measure 0.79 CBIR F0.5 Measure 0.45 TBIR F0.5 Measure 0.77 Precision TaBir F0.5 Measure 0.68 Recall TaBIR F0.5 Measure 0.60 CBIR F1 Measure 0.43 TBIR F1 Measure 0.77 Precision TaBir F1 Measure 0.56 Recall TaBIR F1 Measure 0.68

Figure 8.4: A graph of the F2 scores.

44 8.3. AVERAGE F MEASURES

Figure 8.5: A graph of the F0.5 scores.

Figure 8.6: A graph of the F1 scores.

45

Chapter 9

Discussion and Conclusion

Starting from the results of the previous chapter, if one assume that the only goal is increased precision on the query result, then one could see that the Precision TaBIR was able to achieve a precision score of 0.90 compared to the TBIR which scored 0.77 and Recall TaBIR which scored 0.56. If the goal is instead the recall then one could in the previous chapter see that the Recall TaBIR scored 0.89 compared to the TBIR which scored 0.79 and Precision TaBIR which score 0.45. Thus, the results show that depending on the way that the TaBIR is implemented, it is possible to increase either the recall at the cost of precision or vice verse. The results from the different implementations are not unexpected as the recall focused implementation add keywords while the precision focused implementation filter out keywords from the text-based implementation. The precision focused im- plementation which, in simple terms use the CBIR component to filter out keywords of the TBIR component is thus directly affected by the recall of the CBIR compo- nent. If the TaBIR system were to be implemented in e.g. a company, the CBIR component should first and foremost be trained on data that the company is ex- pected to use rather than using any general datasets in order to maximize the recall of the CBIR component which is then used by the TaBIR. Once again, looking at the results from the previous chapter, one can see that though either the precision or the recall is high in the two implementations, the F score is not too impressive. Only in the F2 measure was one of the TaBIR implementations able to outdo the TBIR component, and then, only by a very small margin. While the F measure is a good measuring methodology, the results are not always as clear as the numbers might suggest. In the instance of the Internet where the number of images might be beyond counting, not being able to recall some of the images might not be of importance, as long as the images that are shown are precise in response to the query. On a smaller scale however, such as in a system of a small company, recall might instead be very important. With this reasoning, the following question may be asked, “what should be shown?”. In Section 2.3 three methods were mentioned with the most common method being relevance ordered presentation. Relevance ordered presentation applied to the Precision TaBIR could

47 CHAPTER 9. DISCUSSION AND CONCLUSION show the true potential of TaBIR. In such a system, one could perform a query on both the Precision TaBIR data as well as the TBIR data and simply put a boosting value to the Precision TaBIR which performed better than the TBIR precision wise. In a relevance ordered presentation this would force images that are more relevant to the query to show up first while other results are pushed to the back. The recall would thus go up with the use of the TBIR data while the Precision TaBIR put the more relevant results in front.

9.1 Conclusions

How much can the precision and/or recall of image search on the Internet, be increased by applying a word taxonomy on data retrieved from a Text- and Content- Based Image Retrieval system individually? The TBIR system extract surrounding keywords and meta data such as the title, alt-text and so on before applying NLP techniques such as PoS tagging, stem- ming and stop word lists. The CBIR component would classify the images before sending the classifications on to the component handling the taxonomy. The compo- nent handling the taxonomy were implemented into two versions. The first version focused on improving the recall of image search and was thus named Recall Tax- onomy Based Image Retrieval (Recall TaBIR) the second implementation focused on improving the precision and was thus called Precision Taxonomy Based Image Retrieval (Precision TaBIR). Recall TaBIR created a hypernym hierarchy tree for every word received from the CBIR and the TBIR, and iterate through them in search for common taxonomy nodes. In the Precision TaBIR, the same hierarchies were made but the iteration would only be done in one of the hierarchy trees for the purpose of creating a strong filter which would remove all dissimilar keywords. The results in Section 8 show that while it is possible to increase either the precision or the recall factors depending on the implementation, the other factor, will drop as a result. The F score (Table 8.3) which take both the recall as well as the precision into account, show that while the two implementations are able to compete in either precision or recall, they are not performing as well when both recall as well as precision is taken into account. Only one of the two TaBIR implementations was able to score higher than the TBIR component and then, only by a slight margin in the F2 measure (see Table 8.3 or Figure 8.4). While the F score of TaBIR is not very good, the precision of Precision TaBIR was still able to outperform both the TBIR and CBIR baselines. One of the very important factors to consider is that while the F score of the two TaBIR imple- mentations are not as high as the baselines, the true power of the TaBIR might be seen in a relevance ordered search platform which utilises the TaBIR results as well as the TBIR and possibly also the CBIR results. By applying a boosting value to the TaBIR results, one would be able to retain a high recall while pushing high

48 9.2. FUTURE WORK precision results to the front and thus improving the image search experience.

9.2 Future Work

As to future work, increasing the size of the evaluation dataset would be interesting in order to see how the system performs when there are thousands or even tens of thousands of images in the system. Performing CBIR training on a specific type of data to see how TaBIR could per- form in a real life corporate implementation; E.g. assuming that the system would be used at a car retailer or car enthusiast info site, training the CBIR component on different cars before running the system would be an interesting work. Another interesting approach would be to use some recent similarity measure algorithm rather than filtering on the different data sources to see how that affect the system. A user study on the system to compare the CBIR, TBIR and TaBIR com- ponents in a relevance sorted presentation could potentially determine if there is indeed more to the F scores than might be assumed from numbers alone and if potentially, the true potential of TaBIR can indeed be found in a combination of several methodologies such as TaBIR with TBIR and possibly also CBIR. It would be interesting to try out different type of architectures in the CBIR component as the architecture of a deep learning network can drastically change the outcome of the system. Further research in word might drastically improve the accuracy of Taxonomy Based Image Search systems.

49

Bibliography

[1] Aliaksandr Autayeu. Extended java library. http://extjwnl. sourceforge.net/. Accessed: 2015-04-03.

[2] Serge Belongie, Chad Carson, Hayit Greenspan, and Jitendra Malik. Color- and texture-based image segmentation using em and its application to content- based image retrieval. In Computer Vision, 1998. Sixth International Confer- ence on, pages 675–682. IEEE, 1998.

[3] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.

[4] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[5] Ning-San Chang and King Sun Fu. A relational database system for images. Springer, 1980.

[6] Yixin Chen, James Ze Wang, and Robert Krovetz. An unsupervised learning approach to content-based image retrieval. In Signal Processing and Its Ap- plications, 2003. Proceedings. Seventh International Symposium on, volume 1, pages 197–200. IEEE, 2003.

[7] Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000.

[8] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Surveys (CSUR), 40(2):5, 2008.

[9] Jia Deng, Alexander C Berg, and Li Fei-Fei. Hierarchical semantic indexing for large scale image retrieval. In Computer Vision and (CVPR), 2011 IEEE Conference on, pages 785–792. IEEE, 2011.

[10] Dezi. Dezi - rest search platform. http://dezi.org/. Accessed: 2015-03-07.

[11] Minh N Do and Martin Vetterli. Wavelet-based texture retrieval using gener- alized gaussian density and kullback-leibler distance. Image Processing, IEEE Transactions on, 11(2):146–158, 2002.

51 BIBLIOGRAPHY

[12] Matthijs Douze, Arnau Ramisa, and Cordelia Schmid. Combining attributes and fisher vectors for efficient image retrieval. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 745–752. IEEE, 2011.

[13] Elasticsearch. Elasticsearch. https://www.elastic.co. Accessed: 2015-03-07.

[14] Christos Faloutsos, Ron Barber, Myron Flickner, Jim Hafner, Wayne Niblack, Dragutin Petkovic, and William Equitz. Efficient and effective querying by im- age content. Journal of intelligent information systems, 3(3-4):231–262, 1994.

[15] Herbert Freeman and Larry S. Davis. A corner-finding algorithm for chain- coded curves. IEEE Transactions on Computers, 26(3):297–303, 1977.

[16] Mark A Friedl and Carla E Brodley. Decision tree classification of land cover from remotely sensed data. Remote sensing of environment, 61(3):399–409, 1997.

[17] Andrew Gibiansky. Convolutional neural networks. http://andrew. gibiansky.com/blog/machine-learning/convolutional-neural- networks/. Accessed: 2015-03-05.

[18] Tony Gill. Metadata and the world wide web. Introduction to metadata: path- ways to digital information, Los Angeles, Calif.: Getty Information Institute, 9:18, 1998.

[19] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.

[20] Google. Google scholar. https://scholar.google.com/. Accessed: 2015-03- 01.

[21] Jane Greenberg. Metadata extraction and harvesting: A comparison of two automatic metadata generation applications. Journal of Internet Cataloging, 6(4):59–82, 2004.

[22] Anil K Jain, Jianchang Mao, and KM Mohiuddin. Artificial neural networks: A tutorial. Computer, 29(3):31–44, 1996.

[23] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[24] Yohan Jin, Latifur Khan, Lei Wang, and Mamoun Awad. Image annotations by combining multiple evidence & wordnet. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 706–715. ACM, 2005.

52 BIBLIOGRAPHY

[25] Young-Kee Jung, Kyu-Won Lee, and Yo-Sung Ho. Content-based event re- trieval using semantic scene interpretation for automated traffic surveillance. Intelligent Transportation Systems, IEEE Transactions on, 2(3):151–163, 2001.

[26] Alireza Khotanzad and Yaw Hua Hong. Invariant image recognition by zernike moments. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(5):489–497, 1990.

[27] Karen Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR), 24(4):377–439, 1992.

[28] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[29] Thomas M Lehmann, MO Gold, Christian Thies, Benedikt Fischer, Klaus Spitzer, Daniel Keysers, Hermann Ney, Michael Kohnen, Henning Schubert, and Berthold B Wein. Content-based image retrieval in medical applications. Methods of Information in Medicine, 43(4):354–361, 2004.

[30] Martin Lennon, David S Peirce, Brian D Tarry, and Peter Willett. An evalua- tion of some conflation algorithms for information retrieval. Journal of infor- mation Science, 3(4):177–183, 1981.

[31] Michael S Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. Content- based multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2(1):1–19, 2006.

[32] Ying Liu, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma. A survey of content-based image retrieval with high-level semantics. Pattern Recognition, 40(1):262–282, 2007.

[33] Julie B Lovins. Development of a stemming algorithm. MIT Information Pro- cessing Group, Electronic Systems Laboratory, 1968.

[34] Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduc- tion to information retrieval, volume 1. Cambridge university press Cambridge, 2008.

[35] Christopher D Manning and Hinrich Schütze. Foundations of statistical natural language processing. MIT press, 1999.

[36] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP natural language pro- cessing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, 2014.

53 BIBLIOGRAPHY

[37] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.

[38] Babu M Mehtre, Mohan S Kankanhalli, and Wing Foon Lee. Shape measures for content based image retrieval: a comparison. Information Processing & Management, 33(3):319–337, 1997.

[39] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.

[40] Henning Müller, Nicolas Michoux, David Bandon, and Antoine Geissbuhler. A review of content-based image retrieval systems in medical applicationsâclinical benefits and future directions. International journal of medical , 73(1):1–23, 2004.

[41] Norconex. Norconex. http://www.norconex.com/. Accessed: 2015-03-10.

[42] Virginia E Ogle and Michael Stonebraker. Chabot: Retrieval from a relational database of images. Computer, 28(9):40–48, 1995.

[43] Animal Planet. bites @ animal planet. http://blogs.discovery.com/bites- animal-planet/. Accessed: 2015-03-20.

[44] Martin F Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

[45] TW Rauber and AS Steiger-Garção. Shape description by unl fourier features- an application to handwritten character recognition. In Pattern Recognition, 1992. Vol. II. Conference B: Pattern Recognition Methodology and Systems, Proceedings., 11th IAPR International Conference on, pages 466–469. IEEE, 1992.

[46] Frank Rosenblatt. The perceptron: a probabilistic model for information stor- age and organization in the brain. Psychological review, 65(6):386, 1958.

[47] Yong Rui, Thomas S Huang, and Shih-Fu Chang. Image retrieval: Current techniques, promising directions, and open issues. Journal of visual communi- cation and image representation, 10(1):39–62, 1999.

[48] Yong Rui, Thomas S Huang, Michael Ortega, and Sharad Mehrotra. Relevance feedback: a power tool for interactive content-based image retrieval. Circuits and Systems for Video Technology, IEEE Transactions on, 8(5):644–655, 1998.

[49] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.

54 BIBLIOGRAPHY

[50] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014.

[51] Gerard Salton and Christopher Buckley. Term-weighting approaches in auto- matic text retrieval. Information processing & management, 24(5):513–523, 1988.

[52] Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Com- pressed fisher vectors for large-scale image classification. Rapport de recherche RR-8209, INRIA, 2013.

[53] Ishwar K Sethi, Ioana L Coman, and Daniela Stan. Mining association rules between low-level image features and high-level concepts. In Aerospace/Defense Sensing, Simulation, and Controls, pages 279–290. International Society for Optics and Photonics, 2001.

[54] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888– 905, 2000.

[55] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[56] Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object matching in videos. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 1470–1477. IEEE, 2003.

[57] Arnold WM Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. Content-based image retrieval at the end of the early years. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(12):1349–1380, 2000.

[58] Apache Solr. Solr - popular open source platform built on apache lucene. http://lucene.apache.org/solr/. Accessed: 2015-03-07.

[59] Michael J Swain and Dana H Ballard. Color indexing. International journal of computer vision, 7(1):11–32, 1991.

[60] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Ra- binovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.

[61] Giannis Varelas, Epimenidis Voutsakis, Paraskevi Raftopoulou, Euripides GM Petrakis, and Evangelos E Milios. Semantic similarity methods in wordnet and their application to information retrieval on the web. 2005.

55 BIBLIOGRAPHY

[62] Xiaoyu Wang, Ming Yang, Timothee Cour, Shenghuo Zhu, Kai Yu, and Tony X Han. Contextual weighting for vocabulary tree based image retrieval. In Com- puter Vision (ICCV), 2011 IEEE International Conference on, pages 209–216. IEEE, 2011.

[63] Yong Wang, Tao Mei, Shaogang Gong, and Xian-Sheng Hua. Combining global, regional and contextual features for automatic image annotation. Pattern Recognition, 42(2):259–266, 2009.

[64] Thijs Westerveld. Image retrieval: Content versus context. In RIAO, pages 276–284. Citeseer, 2000.

[65] Peter Willett. The porter stemming algorithm: then and now. Program, 40(3):219–223, 2006.

[66] John M Zachary Jr and Sitharama S Iyengar. Content based image retrieval systems. In Application-Specific Systems and and Tech- nology, 1999. ASSET’99. Proceedings. 1999 IEEE Symposium on, pages 136– 143. IEEE, 1999.

[67] Xiang Sean Zhou and Thomas S Huang. Relevance feedback in content-based image retrieval: some recent advances. Information Sciences, 148(1):129–137, 2002.

[68] Svitlana Zinger, Christophe Millet, Benoit Mathieu, Gregory Grefenstette, Patrick Hède, and Pierre-Alain Moëllic. Extracting an ontology of portrayable objects from wordnet. In MUSCLE/ImageCLEF workshop on Image and Video retrieval evaluation, pages 17–23, 2005.

56 Appendix A

Tags

This appendix contains the tags and the number of times that each tag were used in the tagging process. (’dog’, 296) (’cat’, 115) (’rabbit’, 23) (’lion’, 22) (’sloth’, 21) (’elephant’, 21) (’bird’, 18) (’polar bear’, 18) (’panda’, 16) (’fish’, 15) (’shark’, 13) (’alligator’, 11) (’hamster’, 10) (’rhino’, 8) (’horse’, 8) (’cheetah’, 7) (’hippo’, 6) (’monkey’, 6) (’cat- fish’, 6) (’goat’, 6) (’leopard’, 6) (’whale’, 6) (’bear’, 5) (’wolf’, 5) (’reindeer’, 5) (’tiger’, 5) (’seal’, 5) (’crocodile’, 4) (’owl’, 4) (’penguin’, 4) (’snake’, 4) (’eagle’, 4) (’monster’, 4) (’frog’, 4) (’chimpanzee’, 4) (’otter’, 4) (’bee’, 4) (’snow leopard’, 3) (’turtle’, 3) (’fox’, 3) (’ape’, 3) (’camel’, 2) (’lamprey’, 2) (’bat’, 2) (’walrus’, 2) (’orca’, 2) (’anaconda’, 2) (’red panda’, 2) (’kangaroo’, 2) (’orangutan’, 2) (’por- poise’, 2) (’lungfish’, 2) (’dolphin’, 2) (’osprey’, 2) (’deer’, 2) (’pig’, 2) (’beaver’, 2) (’lamb’, 2) (’koala’, 2) (’’, 2) (’groundhog’, 2) (’gorilla’, 2) (’white sturgeon’, 1) (’mongoose’, 1) (’blue lobster’, 1) (’snow monkey’, 1) (’snow fox’, 1) (’cougar’, 1) (’ferret’, 1) (’giraffe’, 1) (’llama’, 1) (’jaguar’, 1) (’parrot’, 1) (’spider’, 1) (’goose’, 1) (’zebra’, 1) (’leopard seal’, 1) (’cockroach’, 1) (’muskox’, 1) (’goldfish’, 1) (’slug’, 1) (’racoon’, 1) (’hellbender’, 1) (’tortoise’, 1) (’cow’, 1) (’octopus’, 1) (’donkey’, 1) (’weasel’, 1) (’duck’, 1) (’sheep’, 1) (’hen’, 1) (’squirrel’, 1) (’pangolin’, 1) (’skua’, 1) (’armadillo’, 1) (’coati’, 1) (’lizard’, 1) (’lynx’, 1) (’pelican’, 1) (’dragonfly’, 1) (’guinea pig’, 1) (’warthog’, 1) (’mouse’, 1) (’maggot’, 1) (’manatee’, 1) (’manta ray’, 1) (’buffalo’, 1) (’hedgehog’, 1) Number of tags: 108 Total sum of tags: 823

57 www.kth.se