Automatic Recognition of Historical Buildings in using Smartphone Technology

Donna Agius

Supervisor: Dr. George Azzopardi

Faculty of ICT

University of

May 2016

Submitted in partial fulfillment of the requirements for the degree of B.Sc. ICT in Artificial Intelligence (Hons.) Abstract

Building classification is a widely researched area in . In this re- search project a smartphone application is developed which uses computer vision tech- niques. The idea is introduced by pointing out several advantages and disadvantages of such a system. The application allows the user to take a picture of historical buildings, and then it automatically classifies it and gives information about the building. The technical terms that are used in this project, are described in the following chapter, and previous work from the literature is highlighted. Building recognition is ultimately an object recognition problem, so in the second chapter we look at different approaches to object recognition algorithms applied to various building datasets. An overview of the system is described next, and the design of how the system is implemented is highlighted. The section describes how the user can upload a photo, receive information regarding the respective building and give back feedback to the system. In the same chapter there is also a description of the process to create three distinct datasets using data from thirteen Maltese buildings. Furthermore, an ”Un- known” category was included to make the project more realistic. The implementation is discussed in the fourth chapter, and the system to recognise a building is described in detail. The full operation is discussed in detail, i.e. uploading the image, recognising the building, sending back the data to the user, and the user sends back feedback to the system. A flowchart is also drawn to show the full process to recognise the building in a query image. Next, several experiments were implemented, such as using different image pro- cessing techniques, or applying the algorithm on various datasets. The application is also distributed to users for evaluation. Results show that the proposed system is very effective. The algorithm gained also an accuracy which is equal to the state-of-the-art on the Zurich Building benchmark dataset. Finally, ideas are discussed to further improve the application, such as implement- ing augmented reality, and using deep learning algorithms. Acknowledgements:

I would like to thank my supervisor Dr. George Azzopardi for the advice, guidance and constant feedback in order to create such a project. I would also like to thank my parents, my sisters and Julian for their continuous support and encouragement. Contents

1 Introduction 1 1.1 Thesis Statement ...... 1 1.2 Motivation ...... 1 1.3 Scope ...... 2 1.4 Approach ...... 2 1.5 Aims and Objectives ...... 4 1.6 Report Layout ...... 4

2 Background and Literature Review 5 2.1 Background ...... 5 2.1.1 Feature detection and description ...... 5 2.1.2 Feature description: Scale Invariant Feature Transform (SIFT) 6 2.1.3 Local Binary Patterns (LBP) ...... 7 2.1.4 Bag of visual words ...... 8 2.1.5 Machine Learning ...... 8 2.1.5.1 K-means ...... 9 2.1.5.2 Support Vector Machines ...... 9 2.2 Literature Review ...... 9 2.2.1 Application Context ...... 14

3 Specification and Design 15 3.1 Client-Server Model ...... 15 3.1.1 Dataset Acquistion ...... 15 3.1.2 Client Application ...... 17

4 Implementation 18 4.1 Client ...... 18 4.2 Server ...... 19 4.2.1 System configuration ...... 20 4.2.2 Application ...... 23

i 5 Evaluation 25 5.1 Evaluation Protocol ...... 25 5.2 Bag of Words and Vector of Locally Aggregated Descriptors ...... 26 5.3 Local Binary Patterns (LBP) ...... 27 5.4 Kernel Fusion of SIFT and LBP Features ...... 28 5.5 Normalised and unnormalised data ...... 28 5.6 Cropped and uncropped datasets ...... 29 5.7 Dataset containing images sourced online ...... 29 5.8 Confusion Matrix ...... 30 5.9 Investigating the “Unknown” Category ...... 32 5.10 Zurich Building Dataset (ZuBuD) ...... 32 5.11 Mobile Application ...... 32 5.12 Discussion ...... 33

6 Future Work 35

7 Conclusion 36

Appendix: Discover Valletta - Manual 42

ii List of Figures

1 A photo of a building, captured using a smartphone...... 1 2 Multiple buildings in the same view...... 3 3 An example of flat, edge and corner regions in an image...... 6 4 Example of local feature matching...... 7 5 The figure shows how the basic LBP operator works...... 7 6 Overview of the system data flow...... 15 7 Some buildings from the dataset...... 16 8 Three different categories from the three different datasets...... 17 9 A diagram showing the application, and its different functions...... 19 10 A spatial pyramid using three levels...... 22 11 A flowchart that shows the required steps to label a test image. . . . . 24 12 Various images from the dataset, where images are sourced online. . . . 30 13 Confusion matrix of the Valletta Buildings dataset...... 31 14 True positive, false positive and false negative samples from the dataset. 31

List of Tables

1 An overview of the building data sets and their complexity...... 13 2 The datasets acquired...... 16 3 Performance across different image processing techniques...... 27 4 Results when testing normalised and data which is not normalised. . . 28 5 Accuracy garnered from cropped and uncropped datasets...... 29 6 Accuracy gained from images sourced online, using both techniques. . . 30 7 Results obtained when omitting the “Unknown” category from the dataset...... 32 8 Results on ZuBuD Dataset...... 33

iii 1 Introduction

1.1 Thesis Statement

When tourists wander in a city, they may not be familiar with every building. They may find an interesting building and may want to find information about it, as they are walking along. In this research project, I develop a smartphone application, which makes use of computer vision techniques. The application allows the user to take a picture of historical buildings (Fig. 1), and then it automatically analyses it and gives information about the building. Valletta is chosen as the most suitable candidate for the application for multiple reasons; the city is rich of many historical buildings in a rather small area. Moreover, Valletta is popular amongst tourists, and it will be the European capital city of the year 2018.

Figure 1: A photo of a building, captured using a smartphone.

1.2 Motivation

Building recognition and classification is an important task and an ongoing research topic, which is used in several applications such as video surveillance [1], navigation [2], robot localisation [3] and 3D city reconstructions [4] [5], among others. The ap- plication that I developed, can be used by people of all ages, particularly tourists and it will be beneficial to users to develop further their cultural knowledge, especially in tourism. Since Valletta is going to be the European capital city in 2018, it would be an opportunity to increase its popularity in Europe.

1 1.3 Scope

In the application proposed, the user will be able to identify a single building for each picture submitted, i.e. multiple buildings in a single photo will not be considered. As a result, the user may crop any unnecessary clutter from the image, such as other buildings, trees, cars, etc. Finally, in order to predict the name of the building, the system will use computer vision techniques rather than the Global Positioning System (GPS) system. For a human, the identification of objects, such as buildings, is an effortless op- eration. It is also easy for a human to identify the same objects, even if they are in different angles or the objects themselves are skewed. For a machine, however, this is less simple, since the machine needs to identify the same building at different times of the day, different weather conditions, and in different angles, and each scenario may present with its own challenges. In this project, the data contains different photos of buildings in different conditions, such as photos taken in the morning or at night time. The buildings may also suffer from partial occlusions from trees, moving vehicles, or other buildings. This issue may interfere with the identification of the building as well.

1.4 Approach

The proposed project can be implemented as a desktop application, but I opted for a smartphone implementation because of numerous advantages. By using a smartphone, the information about the building is given on the spot. Information is provided in real-time and therefore the user does not need to do research beforehand. As a result, this is particularly beneficial for tourists, since they do not need to research in advance, but can learn on-the-fly. The application enhances their experience while wandering around the city. Furthermore, the proposed system is highly accessible, as it only requires a smart phone and internet access. Internet connection may be expensive to tourists however, this issue will be minimised with the removal of roaming charges in the EU within the next few months. The mobile platform offers several advantages over desktop which makes it the pre-

2 ferred choice of target platform. The user has access to the information on the building in real-time, presented through a familiar interface with minimal distractions. Acces- sibility to the application also improves since most mobile devices have application stores from where you can download the software in a few steps. These stores also have application version control, meaning updates can be delivered to the user in more direct and convenient way. Adding more features, information and identifiable buildings to the application could be done through these updates. In this research project, the building is identified using computer vision techniques. Identifying the building could also be done using the user’s position and orientation. Using computer vision techniques, however, will result in a better approach than using a GPS, and this is explained in the following advantages. Firstly, the picture of the building can be taken from a distance, and then zoomed into the building of interest, since the user is not always able to take the picture of the building in front of the building itself. For example, suppose there is a river between the user and the building he wants to identify. The proposed application will be able to identify the building, without the need for the user to cross the river. In such cases, using a GPS is not suitable, since the user is expected to be in front of the building itself and then identify with his location. Another issue is that, a certain location may identify with one or more buildings. For instance, in Fig. 2 we demonstrate an example of a scene with two buildings. This creates certain challenges for GPS-based solutions. When the user is uploading the photo, he can specify which building he wants to identify by zooming on the building and cropping where necessary. In this way it is much easier to distinguish which building the user is trying to identify.

Figure 2: Multiple buildings in the same view.

3 Finally, in order not to overload the limited memory and battery life of the smart- phone, the application will only serve as a medium to upload photos and receive in- formation. The process that connects these two ends is composed of a small program, on a server, where computer vision algorithms are implemented.

1.5 Aims and Objectives

The aim of this project is to build a smartphone application that allows a user to take a picture, to select a region of interest, and in return it automatically identifies the building within the region of interest, (if any). The user is then notified with the name and description of the building. The following is a list of objectives:

1. To build a database of images of buildings in different angles, lighting conditions and weather conditions. 2. To implement a computer vision-based solution for building recognition. 3. To implement a user-friendly smart phone application that accepts an image, allows the user to enhance it, send it to the server, and displays information received by the server. 4. To ask users to evaluate the application. 5. To compare the proposed system with state-of-the-art on a public benchmark data set.

1.6 Report Layout

Chapter 2 consists of background and literature review. The design and specifications are discussed in Chapter 3, and the implementation is discussed in Chapter 4. Results and experiments are evaluated in Chapter 5 and ideas about future work are examined in Chapter 6. Finally the last chapter contains a conclusion.

4 2 Background and Literature Review

2.1 Background

Below is a description of the computer vision techniques that are used in this thesis.

2.1.1 Feature detection and description

To solve computational tasks in computer vision, such as image matching, visual features are used. These are pieces or parts of information of an image. In images, features may have a specific form or structure, such as points, edges, or a combination of both. Features are used for neighbourhood operations, where the system compares the specific feature with other neighbouring features. This allows the system to extract any differences or similarities between the points. The process of detecting visual features is called feature detection. Feature detection involves methods that compute image information and make local decisions as to whether there is a feature at that particular point of the image. The process of feature detection is the first operation on an image. Thus it is the basis of the algorithm, and will reflect on the results of the algorithm itself. Highlighted in Figure 3 below, there are different types of image features. First, a feature is said to be flat when it consists of homogeneous intensity. Next, a feature is called an edge when it is caught between two image regions. The corner feature is also caught between two image regions, but also consists of a curvature to another direction. Another type is a blob or a region of interest, which is a complementary description of image structure in terms of regions, rather than a point. However, a region may also contain a preferred point, such as a local maxima. Finally, the notion of ridges can also be detected. A ridge is thought to be a one-dimensional curve. Some applications of ridge detection are extracting roads from aerial images [6], and extracting blood vessels from medical images [7] [8].

5 Figure 3: An example of flat, edge and corner regions in an image [9].

2.1.2 Feature description: Scale Invariant Feature Transform (SIFT)

There are multiple algorithms that perform feature description, namely gradient lo- cation and orientation histogram (GLOH) [10], speeded up robust features (SURF) [11], local binary patterns (LBP) [12], histogram of oriented gradients (HOG) [13], scale invariant feature transform (SIFT) [14], Fast Retina Keypoint (FREAK) [15] and Biologically Inspired Local Descriptor (BILD) [16]. The SIFT algorithm is used to detect and describe local features in an image (see Figure 4). It is used in various applications such as object recognition and video tracking. The first step is the detection of keypoints. SIFT algorithm uses Difference-of- Gaussians for locating keypoints. A Difference-of-Gaussians function is an approximation of Laplacian of Gaussian with the advantage of being much faster. Secondly, the key point localisation step eliminates low contrast points. The remaining are the strong interest points. The next step is to assign orientation to the keypoints. This is done to achieve invariance to image rotation. By using the neighbours of the current point and a histogram, it creates new keypoints with the same locations and scale, but to different directions. The information from the keypoints is then presented as a high dimensional vector, by computing the orientation of every pixel in its neighbourhood, using a spatial tiling of 4x4 and for each tile constructing a histogram of 8 quantized orientations. In this project, I use the SIFT keypoint descriptor to extract information about features. Even though SURF provides a faster interface to extract and describe key-

6 Figure 4: An example of local feature matching, where green lines represent correctly matched keypoints and red lines indicate mismatched keypoints. [17] points, SIFT is used because the process of keypoint extraction and description is implemented offline.

2.1.3 Local Binary Patterns (LBP)

LBP is a keypoint descriptor, that extracts information about the texture of local patterns. To compute the local binary patterns of an image, a 3 × 3 sliding window is applied to the image. For each position, the 8-neighbours are compared to the pixel under consideration. The descriptors results in an 8-bit binary string where the one bits correspond to the pixels that greater than the concerned pixel, and zero bits correspond to the remaining ones. The 8-bit binary string is then converted to a decimal value which is in the range [0,255]. Figure 5 shows an illustration of this operator. Finally, an image is described with an L2-normalized histogram of 256 bins. In this project, I use a combination of LBP descriptors and SIFT descriptors in order to further improve the accuracy of the system.

Figure 5: The figure shows how the basic LBP operator works [18].

7 2.1.4 Bag of visual words

The bag of visual words [19] is an image classification technique, which follows a number of steps. First, the interest points from the images in the training set need to be detected and described, using for example SIFT or other techniques. The points are then quantised into a vocabulary of visual words, by using a clustering technique such as K-means. Quantisation introduces generalisation and reduces redundant features. Next, each training image is described again, and for each detected keypoint we find the nearest (in terms of Euclidean distance) visual word in the vocabulary. These frequencies are recorded in form of a histogram, with K number of bins, which is equal to the size of the vocabulary of visual words. The histograms generated from the training images are then used to learn a classification model.

2.1.5 Machine Learning

Machine learning [20] is a field in artificial intelligence, where computer programs are able to learn from experience. These computer programs are able to teach themselves and grow, when exposed to different data. Some machine learning techniques require labels to be able to train themselves to learn and grow, and some do not. The former are called supervised learning algorithms and the labelled training data consists of desired input and output examples. Thus given the training data, which consists of a series of inputs and outputs, a learning algorithm learns a function that maps input vectors to output labels. Some exam- ples of supervised learning algorithms include Neural Networks [21], Learning Vector Quantisation [22] and Support Vector Machines [23]. On the other hand, unsuper- vised learning does not use any labelled data. Instead, given some unlabelled data, the unsupervised learning algorithm tries to identify a hidden structure, also known as clustering. Examples of unsupervised learning algorithms include K-Means, Self- Organizing Maps [24], Neural Gas [25] and Gaussian Mixture Models [26]. Besides applications of computer vision, machine learning is used in many non- visual based applications, including adaptive websites [27], game playing [28], natural language processing [29], search engines [30], among others.

8 2.1.5.1 K-means

K-means clustering is an unsupervised machine learning algorithm. Given a value of k and a set of data points, the algorithm determines k subsets of data points. This is achieved by assigning points to clusters by repeatedly estimating the distance between each point and randomly initialized centroids, and then taking an average for the new position. K-means clustering is an algorithm which is used for vector quantisation, i.e. to minimise the number of data points to a smaller number. In this project, k-means is used to construct a vocabulary of visual words. Hundreds of thousands of keypoints are extracted from all training images, and by using k-means this set of keypoints is represented with a vocabulary of k visual words. There are different alternatives how to best determine the value of k. One of them is using a cluster analysis index such as Dunn index [31], and another one is with cross validation. In this project, 10-fold cross validation is applied to test different values of k, because it prevents overfitting of data.

2.1.5.2 Support Vector Machines

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. It is a supervised algorithm, i.e. given labeled training data, the algorithm learns an optimal hyperplane which classifies new examples. In this project, SVMs are used to classify a new instance to predict the correct label.

2.2 Literature Review

The main focus of the research is smartphone technology and computer vision tech- niques, and how they evolved throughout the years. In recent years a survey was created [32], which reviews papers that use computer vision techniques and apply them to building recognition. Back in the year 2000, the first camera phone was invented by Sharp. A 0.3 mega pixel camera, was embedded into the phone. As a result, developing a similar project would not be possible at that time. In the year 2003, Hyper-Polyhedron with Adaptive Threshold Indexing (HPAT) [33] was a method proposed to retrieve images

9 from a database based on local features. To execute the method, first intensity based regions were derived at multi-scale intensity extrema of a Gaussian scale space and by a set of nine generalised colour moment invariants, each region was described. As a result, the derived local features were robust to illumination and viewpoint changes. In conclusion, HPAT Indexing was able to identify buildings from a wide range of viewpoints, occlusions, partial visibility, changes in lightning conditions, but lacks good performance for larger databases, because of too similar regions. As reviewed in the 2014 survey [32], the method scored 77.3% accuracy rate on the Zurich Building Database (ZuBuD) [34]. A couple of years later, Robin J Hutchings and Walterio W . Mayol [35] were able to match buildings using photographs from mobile phones, by using an external mobile application called Mobile Bristol. The matching process was implemented by locating the interests points using SIFT descriptors [14] and Harris-Stephens corner detector [36], and then matched against interests points in the database of images. The GPS was also used to simplify the search space, as it eliminated buildings which were not close to the user, thus reducing the computational cost. Furthermore, a system of planar rectification was also developed which deals with perspective changes. Finally, the system was tested for a wide range of distinct buildings, conditions and viewpoints, and the buildings were identified correctly in most cases except in extreme changes or lighting. The following year, Groeneweg et al. [37] implemented a fast offline building recog- nition method based on intensity-based region detection [38] and Principal Compo- nent Analysis (PCA) [39], which was also tested on a mobile platform, (Sony Ericsson K700i), which consists of a 0.3MP camera. To implement the method, every image in the database was downsampled, invariant regions on local intensity extrema were detected, and a parallelogram was fitted into each detected region. In order to make the parallelogram more distinctive, its size was doubled. The contents of each region were resized to a fixed size of 10 x 10, and the Red Green Blue (RGB) colour values of each pixels were computed. To make the characterised region invariant to illumination changes, the region was normalised by dividing each value by the sum of the intensities

10 of all pixels in the region and for compact representations, PCA was applied. Next, to remove repeated regions, the features are grouped into clusters and characterized each cluster by its centroid. For every image in the database 100-bin histograms were built for the R and G channels. Each histogram was then normalised and stored in the database. Finally, chi-squared distance was used to measure distances between the query images and the histogram stored in the database. In a mobile phone platform, this method reduces the computational cost and storage capacity, but it was not in- variant to rotation and sensitive to illumination changes. A 92% recognition rate was noted, based on ZuBuD [34]. Back in 2007, the infamous iPhone was created, which was one of the first smart- phones to be developed. With the creation of smartphones, the development for better cameras was stalled. During that time, the cameras featured were from 5MP to 8MP. In the same year, a couple other methods were created for the building recognition task, but were not tested on mobile devices. Zhang and Koeck [40] proposed a hierarchical building recognition (HBR) system based on vanishing point detection and localized colour histograms. To achieve this, line segments were detected and grouped into dom- inant vanishing directions, and using the expectation maximisation (EM) algorithm, vanishing points were estimated. Next, each image pixel with a gradient magnitude above a previously defined threshold, was assigned to a particular group, and localized colour histograms are computed. Finally, chi-squared distance was computed between histograms, and results were also refined by extracting SIFT [14] features and applying a simple probabilistic model to integrate the evidence from individual matches. This method tends to be quite efficient, but some limitations include identifying only one building in one picture, long processing time for extracting features, and good recogni- tion is noted only when the building is large enough and with simple background. The HBR technique scored 95% accuracy rate on ZuBuD [34]. In the same year a rank- ing scheme [41] to search for building fa¸cades in a large corpus was proposed. This was done by first extracting affine invariant Hessian regions and describing them by 128-dimensional SIFT descriptors. Using k-means, the descriptors are clustered into a visual vocabulary. Consequently, each affine region is mapped to the closest visual

11 word, and thus the image is represented as a bag of visual words. Next, the search en- gine implements a vector-space model, and calculates similarity between query vector and each image vector in the database, using the visual words and their corresponding term frequencyinverse document frequency (tf-idf) [42] weighting. Finally, the best results are re-ranked by transformation between query region and each target image, and then re-ranked target images based on the discriminability of the spatially veri- fied visual words. The algorithm recorded one of the best performances on the Oxford Building Database (OBD) [43], of 95.3%, despite the spatial matching process, which adds to the computational burden. Trinh et al. [44] proposed a method to identify multiple buildings using the image database of Ulsan metropolitan city in South Korea. The windows, doors and walls were extracted based on line segments and vanishing point detection. Next, wall colour histograms were calculated on the pixels. As a result, the candidate model contrasts between a multi-building recognition task or a single building containing several faces. In addition, SIFT [14] features are used to describe each building and using the nearest neighbour rule, the closest model was selected for each test image. Currently, this is the only algorithm which is able to extract and recognise multiple buildings and is also claimed to outperform all other approaches. The multiple buildings technique scored 97.5% recognition rate. In 2009, another state-of-the-art system to construct a biologically-plausible build- ing recognition (BPBR) scheme [45], i.e. a biologically-inspired feature extraction with dimensionality reduction, was planned. First, the images were divided into subregions, and the Global Invariant Scale Transform (GIST) features were extracted from each subregion. As a result, each column of the GIST feature matrix corresponded to a visual feature in the original image. Using an information based dimensionality reduc- tion, the method was able to determine that two feature matrices can be neighbours when constructing a neighbour graph. To conclude, the approaches delivered various advantages including, invariant to geometric and photometric transformation to dif- ferent lighting conditions, and low computational cost. A performance of 85.3% was noted in the survey [32].

12 Finally, in 2013, a novel building recognition model, Steerable filter-based building recognition (SFBR) model [46] was proposed. The method uses a combination of local oriented features, feature pooling, and dimensionality reduction for computational effectiveness and efficiency. During the feature representation part, both global, i.e. features that represent the image as a whole, and local, i.e. patterns that differ from their neighbours, features were used. Next, feature pooling was used to achieve small shifts in position and changes in lighting conditions, and dimensionality reduction was used to remove redundant features, making data compact for representations. Finally, for classification of the query images, a Support Vector Machine (SVM) [47] was used to discriminate different buildings. As a result, SFBR is practical, modular and an easy-to-implement model which is very effective for building recognition. However, this model is not meant to substitute other building recognition algorithms, but can be used as an alternative solution, especially for many vision based applications. In the survey [32], the SFBR technique scored 94.7% on the Sheffield Building Image Database (SBID) [45]. Table 1 shows the different datasets used in the literature. Depending on the user’s mobile device processing power, the building identification process will need to be tuned to be more accurate than efficient or vice versa. Tests will need to be performed to see the impact of the processing on the device and on the user experience. Experiments need to be made to see if the user prefers a more accurate but slow result, or a less accurate, but a faster result.

Dataset Name Number of buildings Training Images Testing Images Total ZuBuD [34] 201 1005 115 1120 SBID [45] 40 1596 1596 3192 OBD [43] 17 5062 55 5117

Table 1: An overview of the building data sets and their complexity.

Finally, introducing the image classification idea into a smartphone makes the applications flexible and convenient for its users. Moreover, this work is needed because there is no system yet suitable for Maltese historical buildings. Even though there exist algorithms for different buildings, such as buildings in Zurich [34] and Oxford [43], the same algorithms may not be suitable for the Maltese ones, since not every

13 building has the same architectural structure. Moreover, I opt to use the bag-of-visual words approach with handcrafted features (SIFT) in contrast to convolutional neural networks [48] because the latter technique requires to learn millions of parameters and are therefore mostly suitable for applications where there are thousands of training examples. Here we only deal with few hundreds of training examples.

2.2.1 Application Context

When developing a mobile application, the human-computer interaction needs to be taken into consideration. Developing context-aware computer applications, should make the users interacting with the device easier [49], while implementing automated data collection, at the same time. Context is defined on multiple occasions and by different people. GD Abowd et. al [49] try to summarise different definitions of context. To sum it all up, context is defined as any information that can be used to describe the situation of an entity, where an entity is a person, place or object relevant to the user-application interaction. In this project, context is really important to the application. The user centric design of the app, i.e. an application revolving around the user, tends to work only if the user is in Valletta. If the pictures of other buildings are taken, they may result into an “Unknown” result, or another label by mistake. Furthermore, the context tends to change, since the photos of the building can be taken at different times, i.e. morning, afternoon, evening, night. Moreover, the photos can be taken into different lighting conditions, which can be as a result to the weather, i.e. rainy, sunny, cloudy. On the other hand, data collection is done after the user submits if the result is correct, and also the actual name of the building. Apart from that, other data is collected such as timestamp and GPS location.

14 3 Specification and Design

The following section describes aspects in the design of the project. This includes dis- cussions about the interface of the mobile application and justifications for its proposed design.

3.1 Client-Server Model

In this section, the architecture of the client-server model is described. There are three major parts that make up this model; the client interface, the server, and the database; Fig. 6 illustrates the architecture of the proposed system.

Figure 6: Overview of the system data flow.

3.1.1 Dataset Acquistion

The dataset that I created contains 250 images for each of the thirteen individual buildings1, examples of which are shown in Figure 7. The images were taken using a variety of cameras, including modern smartphones. Two professional cameras were used, a Nikon D5100 and a Canon EOS 450D. These produced high quality images with resolutions of 4160×3120 pixels, 4928×3264 pixels and 4608×3456 pixels. The rest of the images were taken using the following smartphones: Samsung Galaxy S4 Mini, and One Plus 2; with resolutions ranging from 1280 × 720 pixels to 3088 × 2056 pixels. For efficient processing, the images were downsized to a fixed width of 256 pixels, whilst

11) President’s , 2) National Library, 3) Law Courts, 4) St Catherine of Italy Church, 5) St Paul Cathedral, 6) Gunpost, 7) , 8) Auberge de Bavier, 9) Upper , 10) Siege Bell Memorial, 11) , 12) , 13)

15 (a) (b) (c) (d) (e) (f)

Figure 7: Some buildings from the dataset, (a) National Library, (b) Auberge de Castille, (c) Upper Barrakka Lift, (d) Law Courts, (e) Siege Bell Memorial, (f) St. Catherine of Italy Church maintaining the aspect ratio. Time of day and light conditions were factored in, to further improve the variety in the dataset. Furthermore, occlusions from different sources, like trees, cars, people and other buildings were purposely introduced, to challenge the algorithms to identify the building despite these occlusions. During the testing and evaluation phase, the 250 images were randomly split 150/100 for training and testing respectively. Two other datasets were also created; the first one consists of images from the original dataset but without occlusions, which I call “Cropped Valletta Buildings”, and the second is a much smaller dataset consist- ing of images taken from online sources, and it is called “Online Valletta Buildings”. Table 2 shows the list of the datasets and Figure 8 shows the images from the different datasets.

Dataset Total no. of pictures Valletta Buildings 3500 Cropped Valletta Buildings 3500 Online Valletta Buildings 554 Table 2: The datasets acquired.

16 (a) (b) (c)

Figure 8: Three different categories, 5 - St. Paul’s Cathedral, 12 - Parliament House, 11 - Victoria Gate, from the three different datasets. (a) Valletta Buildings, (b) Cropped Valletta Buildings, (c) Online Valletta Buildings

3.1.2 Client Application

The first part of the design is the ability to create an application for the user to be able to select photos and upload them. The application that is created, is able to take photos, edit them by cropping or rotating, and uploading them to the server be queried. When the processing from the server is done, the user is presented with his results regarding the photo that he submitted. The results include the name and description of the building and a link for further reading. In order to enhance the parameters of the application, the user is also asked to give his feedback regarding the results, if he knows the building. Implementing the application in this manner, requires internet connection, and even though this may be a disadvantage, roaming charges are soon to be removed in the EU, and there are various spots that offer free Wi-Fi in the city.

17 4 Implementation

The following section describes every aspect of the implementation of the project. This includes a detailed explanation about the used computer vision techniques, in order for the application to recognise buildings.

4.1 Client

When the user launches the smartphone application, the home screen is presented. Pressing the Get Started Now! button will prompt the user to select a picture; by either taking a photo using the camera or selecting an existing one from the gallery. In the photo capture screen, the user has the option to use flash, focus on a particular region, and change camera modes. After a photo is selected, either from the gallery or from the photo capture screen, the user has the opportunity to crop the image. This is convenient for the user, to remove any clutter surrounding the building, which could improve the recognition rate. In addition, a floating circular menu is present with the following options: changing the photo if needed, cropping a region of interest, and rotating clockwise and anti-clockwise. An upload progress notification is shown as soon as the upload starts. Finally, the image saved on the device is uploaded to the server, in order for the image to be processed. The processing done on the image on the server side is discussed in a later sub-section. The results are displayed after they are received from the server. These include the name and description of the corresponding building, along with a hyperlink for further reading. Finally, the user is prompted to specify whether he is satisfied with the result, through the feedback form. If the user is not satisfied, for instance, due to a result which is completely unrelated to the building, then the user is asked to input the name of the building, if known. For example, the President’s Palace was misclassified as the National Library, and the user inputs the correct building, i.e. the President’s Palace. Figure 9 shows the different functions of the application mentioned in this section.

18 Figure 9: A diagram showing the application, and its different functions.

4.2 Server

This section describes all the processing required on the server, in order to receive the image and give back results to the client. Once a connection is established between the client and the server, the image saved as a bitmap on the device, is encoded into a Base64 string, and using an HTTP POST request, the image data is sent to the server. The image data is received on the server side by use of a PHP script. The script creates a unique name for the image by concatenating “image” and the database record ID, so that no two images can have the same name on the filesystem. If the received image were cropped by the client, “CROP” is also concatenated to the unique image name. Using MATLAB, a User Datagram Protocol (UDP) interface was implemented, so that when the PHP script saves a received image, the name of the image is sent to the interface. When the MATLAB program receives this notification, it loads the new im-

19 age into memory and starts the recognition algorithms. In order to use these computer visions algorithms, two libraries were used, namely the VLFEAT [50] library, an image processing library and LIBSVM [51], a support vector machine library. Furthermore, the necessary files, such as vocabulary, histograms and kernels, are stored beforehand, in order to save expensive computational cost. As mentioned beforehand, the dataset consists of 13 distinct buildings, and for each building there are 250 different images. Another label named “Unknown” was included in the dataset, in order to distinguish the buildings from other unknown objects. The “Unknown” category has the same number of images, so that there is no bias. To predict the label of the building, the bag of words approach is being used, as mentioned in Section 2. This was chosen after matching the bag of words with other techniques, such as Vector of Locally Aggregated Descriptors (VLAD) [52] technique, which will be discussed in the testing and evaluations part of this project. Using the bag of words approach introduces generalisation, and reduces redundant features, thus improving the accuracy.

4.2.1 System configuration

The system configuration requires several stages in order to be able to process different images. The training process consists of keypoint detection, feature description, and vocabulary creation. The test images are then matched with the processed dataset by the use of Support Vector Machines. For the bag of words approach, a vocabulary of words is required to quantise the number of keypoints from the training images. Since the images are large and vary in size, they are downsized to a width of 256 pixels, so that the process becomes more efficient; the height is resized automatically to keep the same aspect ratio. Next, images are described using Dense SIFT, which bypasses the keypoint detection step, and instead considers a set of keypoints that are uniformly sampled in intervals of n pixels. A descriptor is then computed for each considered point by uniformly sampling keypoints in intervals of n pixels and then compute descriptor for the local pattern

20 around each keypoint. As a result, Dense SIFT is slower, but a larger number of key- points is extracted. The difference in speed between Dense SIFT and SIFT is negated by the fact that the processing will not be done in real-time. A visual vocabulary is created by applying the K-means algorithm on all the SIFT keypoint descriptors extracted from the training images. A 10-fold cross validation on the training set is used in order to determine the best value of K, i.e. the size of the vocabulary. After numerous tests it was determined that the best value of K is 300, for both cropped and non-cropped systems. After the vocabulary is created, the next step in the bag of words approach is com- puted. The technique requires the image to be represented as a histogram of words. Hence, the next part discusses the process to turn every image into this histogram, known as bag of visual words. The histogram in an image describes the frequency of different keypoints; for example, a particular keypoint may appear ten times in one image, but two times in another image. In this way, a larger amount of information is gathered from an image, rather than just pixels. During the most simplistic form of the bag of words, Dense SIFT keypoints are again described for each image, and the k-nearest neighbour algorithm is applied to find the closest points to the vocabulary established in the previous part. Thus, an image can then be represented as a his- togram of keypoints, i.e. 1×Kbins, with K bins being the same size of the vocabulary, in this case 300. An improvement can be made by preserving spatial information to some extent. To preserve spatial information the image is split into multiple spatial tiles, and a histogram is created for each tile. The total number of keypoints increases for each image, thus this increases the information about the image. Having more information about an image, may improve results. Lazebnik [53] describes a method to combine spatial information and the bag of words approach, by using a spatial pyramid. Figure 10 shows how an image is dissected for different levels. Since the bag of words approach is applied to each tile, then each tile is represented by 1 × 300 keypoints. The system follows the computation of the first three levels of each image. Concatenating the three levels of each image, results

21 in a total of 21 × 300 = 6300 keypoints. This is a significantly larger number than the initial 300 keypoints per image.

Figure 10: A spatial pyramid using three levels. This example has a vocabulary of three keypoints indicated by the plus, circle and diamond markers. The bars in the bottom show the quantities of each keypoint present in the respective tiles. Image taken from [53].

One disadvantage of this technique is the computational cost, compared to the simple bag of words approach. It is mentioned above, that k-nearest-neighbour clas- sification is preferred to classify each keypoint to a word in the vocabulary. As an improvement, this technique was made faster by using a k-dimensional tree (k-d tree). A k-d tree is a special type of binary space partitioning tree, which allows to search in logarithmic time. In this case, the k-d tree was built based on the vocabulary, and its fast search was used instead of k-nearest neighbour. Nevertheless, the computational time was not an issue, since most of the processing is done off-line. The mobile appli- cation only classifies one building at a time, so the computational time is reduced to seconds. The final step in building identification, is to classify the images. The histograms of both training and test images are used to identify the buildings in the test images. This is done by using a machine learning algorithm called Support Vector Machines (SVM). In this work I use the pyramid match intersection kernel with SVM as it is the most appropriate kernel to compare histograms in different levels [54]. This type of kernel offers two advantages: it is very fast, and allows for multi-scale features, as shown in Figure 11. The pyramid match kernel, A, is a similarity function that is made up by taking the

22 sum of the minimum values for each level, L, between two histograms, Ψ(x) and Ψ(y). A weighted sum of the number of feature matchings for each level is calculated. The

1 weight, wi determined by wi = (2L−i) where L is also the number of levels and i is the current level (starting from 0), is applied to the function. wi is directly proportional to the level i.e the higher the level the bigger the weighting. Finally, Ni represents the number of newly matched pairs at level i.

L X A(Ψ(x), Ψ(y)) = wiNi (1) i=0

As a result, if A is large then the histograms are similar, but if A is small then the histograms are different. The kernel A is computed for both training and testing images. To save time, the kernel for the training images is saved on the server. As a result, when a test image is received and Dense SIFT features are described, a pyramid match kernel is computed for that particular image.

4.2.2 Application

When an image is introduced, it is split into levels in order to be processed. For each level in the spatial pyramid, Dense SIFT is applied to describe uniformly sam- pled keypoints of the image. The descriptors are then quantised using a k-d tree to create a bag of words, i.e. a histogram. In addition, each histogram is normalized to unit length. Then the normalized histograms of the tiles are concatenated in one long histogram. Finally, the image is classified using SVM and the pyramid match kernel. For cropped test images, the images are matched with the “Cropped Valletta Buildings” dataset discussed in Section 3.1, whilst for non-cropped images, we use the original reference “Valletta Buildings” dataset. If a match is found, the corresponding unique building number is sent back in an HTTP response to the client, in order for the client to display the name, description, and further information hyperlink of the corresponding building. The full process of each test image is shown in Figure 11. The user is encouraged to give feedback about the result. If the user realizes that

23 Figure 11: A flowchart that shows the required steps to label a test image. the result is incorrect and she/he manages to learn the real name of the building from elsewhere, the user may submit the correct name of the building. When the user submits his response, an HTTP POST request containing a number of parameters is sent to the server, these parameters are stored in the database when received. The parameters mentioned include: whether the prediction was correct, the prediction in question, the user submitted name of the building if the prediction was wrong, a timestamp, the image name, and the GPS location if available. In this work, we use this feedback to measure the performance of the proposed system. In future, we intend to use this feedback to fine-tune the parameters of the system.

24 5 Evaluation

The evaluation process was carried out on different datasets, some of which I created and some belong to other sources. This section includes also a discussion about the results obtained.

5.1 Evaluation Protocol

In order to achieve proper evaluation and testing, different performance measurements are used. This building recognition problem is defined as a multi-class problem, i.e. a problem of classifying various instances into multiple classes, in this case 14. Accu- racy is a statistical measure that is able to measure these types of multi-classification problems, by calculating true positive TP, true negative TN, false positive FP, and false negative FN rates. The accuracy rate is the ratio of correct matches to the number of samples: TP + TN accuracy = (2) TP + FP + FN + TN Using an example of 7 - Auberge de Castille, a particular building in the dataset and a test to predict the building, the terms describe how the test classifies the instances.

1. A true positive (TP) instance is when the system predicts the label Auberge de Castille and the building is really Auberge de Castille.

2. A true negative (TN) instance is when the system does not predict Auberge de Castille and the building is not Auberge de Castille.

3. A false positive (FP) instance is defined when the system predicts Auberge de Castille, but the building is not Auberge de Castille.

4. A false negative (FN) instance is defined when the system does not predict Auberge de Castille, and the building is really Auberge de Castille.

25 5.2 Bag of Words and Vector of Locally Aggregated Descrip- tors

A number of tests were computed on the original dataset, i.e. the dataset containing Valletta buildings along with occlusions, such as people and trees. The first test performed was to apply the bag of words approach by creating a vocabulary and implementing histograms for each image by using Dense SIFT, and support vector machines with a linear kernel to classify each test image. The technique resulted in an accuracy of 72.85%. An improvement to the bag of words technique was to include the spatial information of the image by dissecting the images into tiles. In fact, in the next iteration of tests, the images were split into 2 × 2 tiles, scoring an accuracy of 82.42%. Support vector machines with a linear kernel was also used to classify the test images. Finally, following the approach adopted [53], the authors implement a spatial pyramid as an improvement to the bag of words approach. Thus, the image was split into tiles for the first three levels, resulting in a hierarchy of regions. As a result, support vector machines was also used, but a pyramid match kernel was implemented instead of a linear kernel. By using the mentioned techniques, the accuracy gained was that of 84.64%. Complete results are shown in table 3. Another number of tests were implemented to test a different technique in extract- ing and describing the keypoints. These tests evaluate the performance of the Vector of Locally Aggregated Descriptors (VLAD) technique [52]. This technique works by finding the nearest keypoints in the test images for each visual word in the vocabulary and the residuals of each dimension are summed up. This means that in our case with 128 dimension for each visual word we end up with a vector of 128 accumulated residuals. Finally we end up with a feature vector of 300 × 128 elements long and the vector is normalised using L2 normalisation [55].

X vij = (xj − cij) (3)

x|x=NN(ci) where v is the VLAD vector, ci is a visual word and x is the local descriptor associated

26 to its nearest visual word ci=NN(x). Similar to the tests above, three tests were performed on the images using the VLAD technique. First, the VLAD technique was applied on each image as a whole, and then support vector machines were used with a linear kernel for classification. Secondly, the images were split into 2 × 2 tiles and the VLAD technique was applied for each tile. For classification, support vector machines with a linear kernel were also used. Finally, the spatial pyramid was implemented for the first three levels, and the VLAD technique was executed on each tile of the three levels. Support vector machines with pyramid match kernel, was used to classify the test images. The performance recorded was that of 85.78%, 84.71% and 85.78% respectively. Even though the performance is slightly higher when using the VLAD technique against the bag of words, the processing time was noted to be significantly longer. Also, when using VLAD technique, tiling is not necessary, as it does not improve the results. Table 3 reports the complete results for both the VLAD technique and the bag of words technique, and their respective classification time for processing one image. Even though VLAD is slightly longer, when combined with LBP, the classification time for one image is 28.130 seconds, while for both BOW and LBP the processing time is only 5.397 seconds.

Accuracy Classification Accuracy Classification Image Split BOW (%) Time (s) VLAD (%) Time (s) No tiling 72.85 0.914 85.78 2.421 Spatial Tiling (2 x 2) 80.42 1.009 84.71 12.673 Spatial Pyramid 84.64 1.179 85.78 10.780 Table 3: Performance across different image processing techniques.

5.3 Local Binary Patterns (LBP)

LBP is a visual descriptor, where unlike SIFT, it extracts information about the texture of local patterns. LBP descriptors were used to test whether extracting the texture of the image, score a higher accuracy than extracting the shape of the local patterns, i.e. using SIFT descriptors. LBP was applied using spatial pyramid and pyramid match kernel. By applying LBP on the original dataset, an accuracy of 85.5% was reported, a result which is slightly higher than by using SIFT descriptors. Complete results with LBP descriptors using several datasets are given in Table 5.

27 5.4 Kernel Fusion of SIFT and LBP Features

In order to try and achieve a higher accuracy rate, the next step was to combine both SIFT and LBP based kernels and form a fusion pyramid match kernel. The combined kernel, F , was created by summing both the training kernel for the SIFT descriptors,

ASIFT and the training kernel for the LBP descriptors, ALBP , shown in the equation below. A weight α was used to give different importance of the kernels.

F = (1 − α) × ASIFT + α × ALBP (4)

The combined kernel was used in both systems discussed in the previous sections, i.e. bag of words (BOW) and vector locally aggregated descriptors (VLAD). Cross validation on the training set was also used for both systems to determine the best weighting parameter α. From this cross validation we found that the best alpha parameters for BOW and VLAD are 0.98 and 0.58, respectively. Using this approach we then obtain an accuracy rate of 0.88 with BOW and 0.8836 with VLAD on the original dataset.

5.5 Normalised and unnormalised data

Several tests were executed to determine whether L2 normalised data, would achieve a higher accuracy than data which was not normalised. In this test the original dataset discussed in Section 3.1 was used, and both the bag of words using spatial pyramid and pyramid match kernel, and VLAD using no tiling and a linear kernel techniques were implemented. SIFT descriptors are only used in this experiment, as LBP descriptors are already normalised within the function from the library. As shown in Table 4, it does not matter whether the histograms are normalized or not because the results are comparable.

Image State BOW Accuracy (%) VLAD Accuracy (%) Normalised 84.64 85.78 Unnormalised 84.35 86.71 Table 4: Results when testing normalised and data which is not normalised.

28 5.6 Cropped and uncropped datasets

Since the mobile application allows the user to crop out occlusions of a photo, tests were carried out to investigate the effectiveness of matching cropped images with the reference data set of cropped buildings. To carry out these evaluations, a dataset was created consisting of the original images with occlusions cropped out, for both training and testing images. Tests were implemented using both bag of words with spatial pyramid and pyramid match kernel and also the VLAD technique, using no spatial tiling and a linear kernel. Table 5 shows results when comparing images from the two datasets.

Dataset Accuracy BOW (%) Accuracy VLAD (%) Training Test SIFT LBP SIFT + LBP SIFT LBP SIFT + LBP Original Original 84.64 85.50 88.00 86.14 85.50 88.36 Original Cropped 80.07 75.42 81.86 85.57 75.42 85.50 Cropped Original 64.86 61.64 62.00 64.36 61.64 65.21 Cropped Cropped 90.21 89.64 92.50 91.57 86.64 92.71 Table 5: Accuracy garnered from cropped and uncropped datasets.

The first row shows the accuracy garnered from using both training and test im- ages from the original images. The second row shows an accuracy gathered from an experiment where the test images were matched with the original dataset. The third record shows the result of comparing the original images with the cropped dataset and the final experiment shows cropped test images matched with training cropped images, where the accuracy is increased. This increase may be caused by cropping out occlusions and only matching the building with the corresponding building. As a result, two systems were created in the application, where cropped test images are matched with the cropped training images in the dataset, and uncropped test images are matched with the original training images from the dataset, irrespective of which method used, i.e. SIFT or VLAD.

5.7 Dataset containing images sourced online

As mentioned briefly in Section 3.1, I created another dataset that contains images of Maltese buildings from Google Images. Ten of the thirteen categories contain 50 im-

29 ages, while the Gunpost category contains 24 images, the Auberge de Bavier category contains 18 images, and the Fort St. Elmo contains 12 images. The images in this new test data set were evaluated with the original Valletta Buildings data set that I collected and the Cropped Valletta Buildings data set, using both the bag of words with spatial pyramid and pyramid match kernel and VLAD using no spatial tiling and a linear kernel. The results as presented in Table 6, were not plausible, as the images contained different watermarks, such as names, in order to avoid the copyright claim. Some of the images were also old and contained occlusions different from the training images. A few of the images were also heavily edited, resulting in different intensities than those in the training images. Figure 12 shows some images from the dataset.

Figure 12: Various images from the dataset, where images are sourced online.

Dataset Accuracy BOW (%) Accuracy VLAD (%) Training Test SIFT LBP SIFT + LBP SIFT LBP SIFT + LBP Valletta Buildings Google 49.28 39.17 49.10 52.17 39.17 51.62 Cropped Valletta Google 38.45 28.70 34.30 36.30 28.70 37.00 Buildings Table 6: Accuracy gained from images sourced online, using both techniques.

5.8 Confusion Matrix

Figure 13 contains a confusion matrix which shows the correctly and incorrectly la- belled categories, when applying spatial pyramid with bag of words and support vec- tor machines using pyramid match kernel, when using both training and test images

30 from the original dataset. The indices represent each building in the following or- der: 1) President’s Palace, 2) National Library, 3) Law Courts, 4) St Catherine of Italy Church, 5) St Paul Cathedral, 6) Gunpost, 7) Auberge de Castille, 8) Auberge de Bavier, 9) Upper Barrakka Lift, 10) Siege Bell Memorial, 11) Victoria Gate, 12) Parliament House, 13)Fort Saint Elmo, and 14) Unknown.

Figure 13: Confusion matrix of the Valletta Buildings dataset, i.e. both training and test images from the original dataset. The values represent the correctly labelled buildings for each category, over a total of 100 images.

(a) (b) (c)

Figure 14: The three columns show (a) true positive samples, (b) false positive samples and (c) false negative samples, with respect to 7 - Auberge de Castille.

A cell (i,j) in the confusion matrix indicates how many test images with category i were automatically labeled as category j. Ideally the confusion matrix contains values

31 in the diagonal, which indicates perfect classification. It is evidently shown that most of the Unknown categories are misclassified to other buildings and vice versa. Figure 14 shows the true positive samples, false positive samples, and false negative samples for this experiment, for the Auberge de Castille category.

5.9 Investigating the “Unknown” Category

As shown in Figure 13, the “Unknown” category is the category with the most mis- labelled buildings. Other building datasets, such as ZuBuD, do not contain a similar category. In this case, having such a category makes the experiments more realistic. Experiments were also computed by omitting the category entirely, which presents a slightly better result. Both VLAD and BOW systems were used in this experiments, shown in Table 7.

Accuracy BOW (%) Accuracy VLAD (%) Dataset SIFT LBP SIFT + LBP SIFT LBP SIFT + LBP Valletta Buildings 87.85 87.08 90.08 90.46 87.08 91.31 Cropped Valletta Buildings 92.92 91.31 94.31 94.31 91.31 94.38 Table 7: Results obtained when omitting the “Unknown” category from the dataset.

5.10 Zurich Building Dataset (ZuBuD)

Tests were also made on the ZuBuD [34] in order to compare the algorithm with other state-of-the-art techniques. Both bag of words with spatial pyramid and pyramid match kernel and VLAD with no tiling and linear kernel techniques were implemented. LBP was also applied on ZuBuD, and an accuracy of 89.56% was achieved, which is less than the accuracy garnered from SIFT descriptors. Furthermore, by combining both SIFT and LBP descriptors, a high accuracy is also scored. Complete results are seen in Table 8.

5.11 Mobile Application

The mobile application was also distributed to various users to gather their feedback. In total 167 images were submitted with eight cases of unusable result. In this exper-

32 Accuracy BOW (%) Accuracy VLAD (%) SIFT LBP SIFT + LBP SIFT LBP SIFT + LBP 94.61 89.56 94.61 88.69 89.56 89.57 Table 8: Results on ZuBuD Dataset. iment we obtain an overall accuracy of 62.89%. The users submitted 102 non-cropped photos, and out of these 62.74% were correctly labelled. On the other hand, 57 of the submitted photos were cropped and 63.16% of these give a correct result. Following the collection of results, some observations were noted; cropping the photo can im- prove recognition rate. If the test images contain people the image would be sometimes classified as “Unknown”. Lighting, intensity and similar architecture may also result in a misclassification. Occlusions in the images, such as trees and road signs, proved to be the most problematic, resulting in the image being classified as “Unknown”.

5.12 Discussion

After implementing the tests and evaluating the results, a number of points arise. In the literature I referenced to several datasets that other authors used to imple- ment their algorithms and to compare the algorithm created for this project, it was applied on one of the datasets mentioned in the literature, ZuBuD [34]. The accu- racy gained was equal to that of the state-of-the-art hierarchical building recognition (HBR) [40]. Benefits of this mobile application were noted during the evaluation of the project. First, the application takes around 15 seconds to upload and give back the result to the user on the spot. The accuracy rate gained from the tests was good, on the different datasets. Furthermore, the application provides utilities in order to overcome the limitations mentioned later on. This includes the ability to crop the clutter out of the building - which will ultimately have a higher chance of getting a good result, and rotate the image - in order to submit the image in the respective orientation. These points show that the users benefit from using the application, since it provides instant information to what they are searching for, especially for landmarks which are less popular than others. Using such a system which is specific to Maltese buildings,

33 may generate a better result than by using systems such as Google Googles, where the system is trained on a lot of training images. On the other hand, a number of limitations were also noted during the evalua- tion phase. My system assumes that the given image is in upright position. This means that to recognise the building correctly, the image should be in the respective orientation. For a future idea, the image can automatically be rotated using the sen- sor information from the smartphone. Another limitation was that a building to be recognised correctly, sometimes needed a clear background. If some clutter, such as people are in the surroundings, it may mislabel the building to another building or the unknown category. Another point is that some buildings in the dataset may be similar to each other, such as Auberge de Castille and Auberge de Bavier share some similar architectural features. Finally, there are also challenges regarding the time it takes to retrieve the result from a particular image. Complex calculations can take up a lot of time and time is limited, as this is a mobile application. As pointed out in the evaluation, VLAD takes a long classification time, which was not optimal in this case, even though it achieves a higher accuracy. Furthermore, as pointed out in the evaluation, by combining two kernels, i.e. one using LBP descriptors and one using SIFT descriptors, a greater accuracy is scored, however this results in a slightly longer calculation time. The dataset may also be improved, by taking photos in different time of day, such as at night, evening, sunset, etc. The difference between the dataset I created and ZuBuD, is that ZuBuD offers more buildings. On the other hand, the dataset of Maltese buildings offers more photos (250) for each category, unlike ZuBuD (5). Experiments were also implemented to investigate the “Unknown” category. As shown previously, by removing the “Unknown” category, the accuracy increases, however by introducing such a category, the experiment is more realistic. A similar category is not found in other building recognition datasets, including ZuBuD.

34 6 Future Work

In this section, numerous ideas are discussed that can be done to extend this work further. First, the number of buildings that the application can recognise, can be extended. Currently, the application recognises thirteen buildings, which can be ex- tended to recognise a few more categories. Another idea that can be done is an augmented reality description [56]. In the project, the description for any building includes a text-based description and a link for further reading. This idea can be extended by implementing augmented reality based on the results of the building. For example, if a given image contains the national library of Malta, then the augmented reality description may show historians talking, reading, etc. Another example may be in churches, people may be seen going in and out of church, wearing different clothes that signifies different time periods. Another direction is to investigate a pre-trained convolutional neural network [57] to the approach that I proposed here as well as training a convolutional neural network on the training images, as convolutional neural networks are currently the state-of- the-art in computer vision. In this project, I am using SIFT to extract the shape information of local patterns and LBP to extract texture information. Apart from these techniques, experiments can be implemented to extract information about colour [58] in these images. The algorithm may also be extended by implementing a way to recognise multiple buildings [44] in the same image. As mentioned in Section 3, the mobile application requires internet connection, thus another direction may also be made to investigate an embedded solution, i.e. a solution that does not require internet connection, but implemented in the smartphone itself. GPS data from the smartphone may also be included to remove unlikely candidates and provide a more accurate result. Finally, this project is ultimately an object recognition problem, and even though the algorithm was applied to buildings, it may be applied to different scenarios.

35 7 Conclusion

To conclude this project, I summarise all the investigations made throughout my research. In order to satisfy the first objective, a dataset of 13 Maltese buildings was built, by taking photos from different angles and lighting conditions. A computer vision-based solution was investigated and tests were made using two systems. The most effective system was to use VLAD, with no spatial tiling and a support vector machine using a linear kernel, resulting in an accuracy rate of 92.50%, throughout various experiments. On the other hand, the best trade-off between effectiveness and efficiency is achieved with a system that uses a bag of visual words and the combination of SIFT and LBP descriptors. A smartphone application that is able to take photos was also created, and the computer vision techniques were applied to the submitted photos. As a result, the application returns the description of the corresponding building. The application was then evaluated by distributing it to a number of users, and some observations were made. Finally, the algorithm was applied to a public benchmark dataset called ZuBuD, and the accuracy of 94.61% that I obtain is equal to that of the state-of-the-art.

References

[1] D. Bruckner, C. Picus, R. Velik, W. Herzner, and G. Zucker. Hierarchical seman- tic processing architecture for smart sensors in surveillance networks. Industrial Informatics, IEEE Transactions on, 8(2):291–301, May 2012. 1

[2] Haider Ali, Gerhard Paar, and Lucas Paletta. Semantic indexing for visual recog- nition of buildings. In 5th Int. Symp. on Mobile Mapping Technology. Citeseer, 2007. 1

[3] I. Ulrich and I. Nourbakhsh. Appearance-based place recognition for topological localization. In Robotics and Automation, 2000. Proceedings. ICRA ’00. IEEE International Conference on, volume 2, pages 1023–1029 vol.2, 2000. 1

[4] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Brian Curless, Steven M. Seitz, and Richard Szeliski. Reconstructing . Computer, 43(6):40–47, 2010. 1

[5] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3d. ACM Trans. Graph., 25(3):835–846, July 2006. 1

36 [6] Ivan Laptev, Helmut Mayer, Tony Lindeberg, Wolfgang Eckstein, Carsten Steger, and Albert Baumgartner. Automatic extraction of roads from aerial images based on scale space and snakes. Machine Vision and Applications, 12(1):23–31, 2000. 5

[7] Joes Staal, Michael D Abr`amoff,Meindert Niemeijer, Max A Viergever, and Bram Van Ginneken. Ridge-based vessel segmentation in color images of the retina. Medical Imaging, IEEE Transactions on, 23(4):501–509, 2004. 5

[8] George Azzopardi, Nicola Strisciuglio, Mario Vento, and Nicolai Petkov. Trainable COSFIRE filters for vessel delineation with application to retinal images. Medical image analysis, 19(1):46–57, 2015. 5

[9] Tinne Tuytelaars and Krystian Mikolajczyk. Local invariant feature detectors: a survey. Foundations and Trends R in Computer Graphics and Vision, 3(3):177– 280, 2008. 6

[10] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(10):1615–1630, 2005. 6

[11] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346– 359, 2008. 6

[12] Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. Face description with local binary patterns: Application to face recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(12):2037–2041, 2006. 6

[13] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human de- tection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. 6

[14] David G. Lowe. Object recognition from local scale-invariant features. In Pro- ceedings of the International Conference on Computer Vision-Volume 2 - Volume 2, ICCV ’99, pages 1150–, Washington, DC, USA, 1999. IEEE Computer Society. 6, 10, 11, 12

[15] Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst. Freak: Fast retina keypoint. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 510–517. Ieee, 2012. 6

[16] Yun Zhang, Tian Tian, Jinwen Tian, Junbin Gong, and Delie Ming. A novel biologically inspired local feature descriptor. Biological cybernetics, 108(3):275– 290, 2014. 6

[17] James Hays. Local feature matching. http://cs.brown.edu/courses/cs143/ proj2/matches.jpg, 2013. 7

37 [18] Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. Face description with local binary patterns: Application to face recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(12):2037–2041, 2006. 7

[19] Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and C´edric Bray. Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, volume 1, pages 1–2. Prague, 2004. 8

[20] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 1997. 8

[21] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. 8

[22] Teuvo Kohonen. Learning vector quantization. Springer, 1995. 8

[23] Marti A. Hearst, Susan T Dumais, Edgar Osman, John Platt, and Bernhard Scholkopf. Support vector machines. Intelligent Systems and their Applications, IEEE, 13(4):18–28, 1998. 8

[24] Teuvo Kohonen and Panu Somervuo. Self-organizing maps of symbol strings. Neurocomputing, 21(1):19–30, 1998. 8

[25] Bernd Fritzke et al. A growing neural gas network learns topologies. Advances in neural information processing systems, 7:625–632, 1995. 8

[26] Douglas A Reynolds and Richard C Rose. Robust text-independent speaker iden- tification using gaussian mixture speaker models. Speech and Audio Processing, IEEE Transactions on, 3(1):72–83, 1995. 8

[27] Mike Perkowitz and Oren Etzioni. Towards adaptive web sites: Conceptual frame- work and case study. Artificial intelligence, 118(1):245–275, 2000. 8

[28] Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3):210–229, 1959. 8

[29] Christopher D Manning and Hinrich Sch¨utze. Foundations of statistical natural language processing, volume 999. MIT Press, 1999. 8

[30] Thorsten Joachims. Optimizing search engines using clickthrough data. In Pro- ceedings of the eighth ACM SIGKDD international conference on Knowledge dis- covery and data mining, pages 133–142. ACM, 2002. 8

[31] Malay K Pakhira, Sanghamitra Bandyopadhyay, and Ujjwal Maulik. Validity index for crisp and fuzzy clusters. Pattern recognition, 37(3):487–501, 2004. 9

[32] Jing Li, Wei Huang, Ling Shao, and Nigel Allinson. Building recognition in urban environments: A survey of state-of-the-art and future challenges. Information Sciences, 277:406 – 420, 2014. 9, 10, 12, 13

38 [33] Hao Shao, Tom´aˇsSvoboda, Tinne Tuytelaars, and Luc Van Gool. Hpat indexing for fast object/scene recognition based on local appearance. In Image and Video Retrieval, pages 71–80. Springer, 2003. 9

[34] Hao Shao, Tom´aˇsSvoboda, and Luc Van Gool. Zubud-zurich buildings database for image based recognition. Computer Vision Lab, Swiss Federal Institute of Technology, Switzerland, Tech. Rep, 260, 2003. 10, 11, 13, 32, 33

[35] R Hutchings and Walterio Mayol-Cuevas. Building recognition for mobile devices: incorporating positional information with visual features. Comput. Sci., Univ. Bristol, Bristol, UK, Tech. Rep. CSTR-06-017, 2005. 10

[36] Chris Harris and Mike Stephens. A combined corner and edge detector. In Alvey vision conference, volume 15, page 50. Citeseer, 1988. 10

[37] N.J.C. Groeneweg, B. de Groot, A.H.R. Halma, B.R. Quiroga, M. Tromp, and F.C.A. Groen. A fast offline building recognition application on a mobile tele- phone. In Jacques Blanc-Talon, Wilfried Philips, Dan Popescu, and Paul Scheun- ders, editors, Advanced Concepts for Intelligent Vision Systems, volume 4179 of Lecture Notes in Computer Science, pages 1122–1132. Springer Berlin Heidelberg, 2006. 10

[38] Tinne Tuytelaars and Luc J Van Gool. Wide baseline stereo matching based on local, affinely invariant regions. In BMVC, volume 412, 2000. 10

[39] Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002. 10

[40] Wei Zhang and Jana Koeck. Hierarchical building recognition. Image and Vision Computing, 25(5):704 – 716, 2007. 11, 33

[41] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pages 1–8, June 2007. 11

[42] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988. 12

[43] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. 12, 13

[44] Hoang-Hon Trinh, Dae-Nyeon Kim, and Kang-Hyun Jo. Facet-based multiple building analysis for robot intelligence. Applied Mathematics and Computation, 205(2):537 – 549, 2008. Special Issue on Advanced Intelligent Computing Theory and Methodology in Applied Mathematics and Computation. 12, 35

[45] Jing Li and Nigel M. Allinson. Subspace learning-based dimensionality reduction in building recognition. Neurocomputing, 73(13):324 – 330, 2009. Timely Devel- opments in Applied Neural Computing (EANN 2007) / Some Novel Analysis and

39 Learning Methods for Neural Networks (ISNN 2008) / Pattern Recognition in Graphical Domains. 12, 13

[46] Jing Li and N. Allinson. Building recognition using local oriented features. In- dustrial Informatics, IEEE Transactions on, 9(3):1697–1704, Aug 2013. 13

[47] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. Springer, 1998. 13

[48] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 14

[49] Gregory D Abowd, Anind K Dey, Peter J Brown, Nigel Davies, Mark Smith, and Pete Steggles. Towards a better understanding of context and context-awareness. In Handheld and ubiquitous computing, pages 304–307. Springer, 1999. 14

[50] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008. http://www.vlfeat.org/. 20

[51] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector ma- chines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. 20 [52] Relja Arandjelovic and Andrew Zisserman. All about vlad. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1578–1585, 2013. 20, 26

[53] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 2169–2178. IEEE, 2006. 21, 22, 26

[54] Subhransu Maji, Alexander C Berg, and Jitendra Malik. Classification using intersection kernel support vector machines is efficient. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008. 22

[55] Bruce D Lucas, Takeo Kanade, et al. An iterative image registration technique with an application to stereo vision. In IJCAI, volume 81, pages 674–679, 1981. 26

[56] Ronald T Azuma. A survey of augmented reality. Presence: Teleoperators and virtual environments, 6(4):355–385, 1997. 35

[57] Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. 35

40 [58] Alaa E Abdel-Hakim and Aly A Farag. Csift: A sift descriptor with color invariant characteristics. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1978–1983. IEEE, 2006. 35

41 Discover Valletta - Manual

1 What is Discover Valletta?

Discover Valletta is an application that is able to recognize buildings, by using image processing techniques. The application is a partial fulfilment of BSc. (ICT) Final Year Project 2016. The idea is to let the user take a photo of a building, submit it to the application and in return the application will provide a description of that particular building. The following is a list of known buildings by the application:

 Grandmaster's Palace  National Library  Law Courts  St. Catherine of Italy Church  St. Paul's Cathedral  A World War II gunpost  Auberge de Castille  Auberge de Bavier  Upper Barrakka Lift  Siege Bell Memorial  Victoria Gate  Parliament House  Fort Saint Elmo Figure 1: The first screen is shown. 2 Selecting an Image

First, the user needs to select the photo to be processed, either by taking a picture using the camera or selecting an existing one from the gallery.

Figure 2: A figure showing a user taking photo of a building in Valletta.

41

Figure 3: A figure showing the picture taken, inside the application, along with a circular menu.

3 Image Adjustments

Using the app, the user has also the ability to crop part of the image. This can useful to remove any clutter that surrounds the building. Moreover, the user can rotate the image clockwise or anticlockwise.

Figure 4: An example of cropping a building.

4 Uploading the Image

By selecting the check option, the image begins uploading.

Figure 5: The loading screen shown when the image is uploading.

5 The Result

After uploading the image, the application gives the user the results of that particular building. The results include the name and description of the building. Also a hyperlink is provided for further reading. The user is asked whether the results given are correct in order to further tune the system.

42

Figure 6: The loading screen shown when the image is uploading.

6 Feedback

Feedback on the system is much appreciated, especially if the result is not correct. If the result is not correct, the application asks the user to enter the actual name of the building, if they know it.

Figure 7: If the result is not correct, the user is asked if they know the name of the building.

Figure 8: If the user knows the name, the user is asked to provide it.

43