DEGREE PROJECT IN THE FIELD OF TECHNOLOGY MEDIA TECHNOLOGY AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018

Content based filtering for application

DAVID LINDSTRÖM

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Content based filtering for application software

DAVID LINDSTRÖM

Master in Computer Science Date: February 27, 2018 Supervisor: Jeanette Hellgren Kotaleski Examiner: Anders Lansner Principal: ISOFT Services AB Swedish title: Innehållsbaserad filtrering för applikationsprogramvara School of Computer Science and Communication

iii

Abstract

In the study, two methods for recommending application software were imple- mented and evaluated based on their ability to recommend alternative applica- tions with related functionality to the one that a user is currently browsing. One method was based on Term Frequency–Inverse Document Frequency (TF-IDF) and the other was based on Latent Semantic Indexing (LSI). The dataset used was a set of 2501 articles from Wikipedia, each describing a distinct application. Two experiments were performed to evaluate the methods. The first experi- ment consisted of measuring to what extent the recommendations for an applica- tion belong to the same software category, and the second was a set of structured interviews in which recommendations for a subset of the applications in the dataset were evaluated more in-depth. The results from the two experiments showed only a small difference between the methods, with a slight advantage to LSI for smaller sets of recommendations re- trieved, and an advantage for TF-IDF for larger sets of recommendations retrieved. The interviews indicated that the recommendations from when LSI was used to a higher extent had a similar functionality as the evaluated applications. The recom- mendations from when TF-IDF was used had a higher fraction of applications with functionality that complemented or enhanced the functionality of the evaluated applications. iv

Sammanfattning

I studien implementerades och utvärderades två alternativa implementationer av ett rekommendationssystem för applikationsprogramvara. Implementationerna ut- värderades baserat på deras förmåga att föreslå alternativa applikationer med rela- terad funktionalitet till den applikation som användaren av ett system besöker eller visar. Den ena implementationen baserades på Term Frequency-Inverse Document Frequency (TF-IDF) och den andra på Latent Semantic Indexing (LSI). Det data som användes i studien bestod av 2501 artiklar från engelska Wikipedia, där varje artikel bestod av en beskrivning av en applikation. Två experiment utfördes för att utvärdera de båda metoderna. Det första expe- rimentet bestod av att mäta till vilken grad de rekommenderade applikationerna tillhörde samma mjukvarukategori som den applikation de rekommenderats som alternativ till. Det andra experimentet bestod av ett antal strukturerade intervjuer, där rekommendationerna för en delmängd av applikationerna utvärderades mer djupgående. Resultaten från experimenten visade endast en liten skillnad mellan de båda metoderna, med en liten fördel till LSI när färre rekommendationer hämtades, och en liten fördel för TF-IDF när fler rekommendationer hämtades. Intervjuerna visa- de att rekommendationerna från den LSI-baserade implementationen till en högre grad hade liknande funktionalitet som de utvärderade applikationerna, och att re- kommendationerna från när TF-IDF användes till en högre grad hade funktionali- tet som kompletterade eller förbättrade de utvärderade applikationerna. Contents

1 Introduction 1 1.1 Definitions ...... 1 1.1.1 Application ...... 1 1.1.2 Synonymy ...... 2 1.1.3 Polysemy ...... 2 1.1.4 Hyponymy ...... 2 1.1.5 Hypernymy ...... 2 1.1.6 Semantic similarity and relatedness ...... 2 1.1.7 Structured and Semi-structured text ...... 3 1.1.8 Recommender systems ...... 3 1.1.9 Wikipedia ...... 3 1.2 Research question ...... 4 1.3 Objective ...... 4 1.4 Delimitation ...... 4

2 Background 6 2.1 Vector space model ...... 6 2.1.1 Term frequency-inverse document frequency ...... 6 2.1.2 Latent Semantic Indexing ...... 7 2.2 Cosine similarity ...... 9 2.3 Text pre-processing ...... 9 2.3.1 Tokenization ...... 9 2.3.2 Stop words ...... 10 2.3.3 Stemming ...... 10 2.4 Evaluation metrics ...... 10 2.4.1 Precision at K ...... 10 2.4.2 Mean average precision ...... 10

3 Method 12 3.1 Data collection and labelling ...... 12 3.2 Index creation ...... 13

v vi CONTENTS

3.3 Recommendation process ...... 14 3.4 Experiments ...... 15 3.5 Used software ...... 16 3.5.1 Natural Language Toolkit (NLTK) ...... 16 3.5.2 Gensim ...... 16 3.5.3 Wikipedia (Python package) ...... 16 3.5.4 Django REST framework ...... 17 3.5.5 React ...... 17 3.6 Platform ...... 17

4 Results 18 4.1 Comparison by software category ...... 18 4.1.1 Precision at K ...... 18 4.1.2 Mean average precision ...... 19 4.2 Interviews ...... 19 4.2.1 Evaluated applications ...... 19 4.2.2 Interview Process ...... 21 4.2.3 Interview process delimitation ...... 22 4.2.4 Interview results ...... 22

5 Discussion and conclusion 28 5.1 Discussion ...... 28 5.1.1 Comparison by software category ...... 28 5.1.2 Interviews ...... 29 5.2 Obstacles ...... 32 5.3 Conclusion ...... 32 5.4 Future work ...... 33

Bibliography 34

A List of stop words 36

B Software categories 37

C Evaluated applications 39

D Interview scoring criteria 40 Chapter 1

Introduction

The rapid growth of the internet together with accelerating processes of digitaliza- tion has led to an increasing amount of software applications available over the last years. A problem with this is that for most people, only a tiny part of the software in the world is of interest. If faced with a specific need for functionality, the infor- mation that is of relevance is at risk of being obscured by the growing amount of data available. There is a general need for help in filtering out information that is of relevance, and numerous services of different types either incorporate this as a part of their functionality or base their whole functionality around it. One common mechanism for doing so is to provide recommendations for con- tent that might be of interest to a user. Today we see the presence of recommender systems on many parts of the internet, including when we buy or browse for books, movies or read the news online. The purpose of this study is to investigate and evaluate the quality of a recommender system for application software, where rec- ommendations of alternative software is given based on which application a user is currently browsing. As there are numerous technologies and algorithms for rec- ommending content, two specific methods have been selected for this study. The first is a method based on Term Frequency–Inverse Document Frequency (TF-IDF) and the second is a method based on Latent Semantic Indexing (LSI). The dataset on which they are evaluated is a set of articles from English Wikipedia, each de- scribing an application.

1.1 Definitions

The following section contains definitions for terms that are central to the report.

1.1.1 Application An application is a program (such as a word processor or spreadsheet) that per- forms a particular task or set of tasks (Merriam-Webster’s online dictionary, n.d.).

1 2 CHAPTER 1. INTRODUCTION

Throughout the report, the terms application, software and application software will be used interchangeably.

1.1.2 Synonymy A synonym is a word or a phrase that means exactly or nearly the same as another word or phrase in the same language. For example, shut is a synonym of close (OED Online, n.d.-a).

1.1.3 Polysemy Polysemy is the coexistence of many possible meanings for a word or phrase (OED Online, n.d.-b). An example is the word book which can have multiple meanings. It can refer to a physical book that you can read, or it can refer to registering or scheduling some event (e.g. to book a hotel room).

1.1.4 Hyponymy A hyponym is a word of more specific meaning than a general or superordinate term applicable to it. For example, spoon is a hyponym of cutlery (OED Online, n.d.-c).

1.1.5 Hypernymy A hypernym is a word with a broad meaning, constituting a category into which words with more specific meanings fall; a superordinate. For example, colour is a hypernym of red (OED Online, n.d.-d).

1.1.6 Semantic similarity and relatedness Semantic similarity is defined as a subset of the more general notion semantic re- latedness. Semantically related terms or texts refer to any type of relation between the two, whereas the more specific notion of semantic similarity refer to them be- ing related by either synonymy, hyponymy or hypernymy. In this sense, the words train and bus are semantically similar, as they are both means of transportation. On the other hand, the words bus and road would be considered as semantically related but not semantically similar, as they often cooccur, but with different roles in the context in which they appear (Ballatore, Bertolotto, & Wilson, 2014). In the report, the use of similarity and relatedness between articles or words will refer to semantic similarity and semantic relatedness. CHAPTER 1. INTRODUCTION 3

1.1.7 Structured and Semi-structured text Structured text refers to text that resides in a fixed structure, so that sets of tu- ples of the same kind can be stored and processed in a database. This includes spreadsheets, table oriented text as in a relational model or sorted-graph as in ob- ject databases (Arasu & Garcia-Molina, 2003). Unstructured text is raw text that does not have any pre-defined data model or structure (Abiteboul, 1997). Semi-structured text is text that is neither raw data nor strictly typed, but is normally associated with a schema that is contained within the text. Such text is often called self-describing, and includes tagged text such as HTML, XML or JSON documents (Buneman, 1997).

1.1.8 Recommender systems Recommender systems have the purpose of suggesting content that might be of interest to a user. The purpose of this is generally to help the user to find the infor- mation that is of relevance in a large space of possible options. Today we see the presence of recommender systems in many parts of the internet, including when we buy or browse for books, movies or read the news online. The systems them- selves are often divided into groups based on what mechanism or mechanisms that are used to provide recommendations. Below is a brief overview on two of the most commonly used categories (Ricci, Rokach, & Shapira, 2011).

Content based filtering Content based filtering systems recommend items based on their content, or a set of attributes describing the items. The decision on which items to recommend is typically based on finding the items with a similar content as, for example, the item that the user is currently browsing, or have browsed or liked before. Similarity between items is computed as a function of their contents, and the items with the highest similarity are returned as recommendations (Pazzani & Billsus, 2007).

Collaborative filtering In collaborative filtering methods, the items that are rec- ommended to a user are items that have been rated highly by other users with a similar profile. Depending on what the system is designed for, the similarity of profiles can be based on earlier ratings and preferences, items bought in the past or other information available on the user profiles (Debnath, Ganguly, & Mitra, 2008).

1.1.9 Wikipedia In March 2017, there were over 5 million articles in the English version of the web based encyclopaedia Wikipedia, and this number has been growing by the size of hundreds of thousands of new articles every year since 2004 (Wikipedia, n.d.). The content of the Wikipedia is user created, which means that anyone can edit or 4 CHAPTER 1. INTRODUCTION

create an article. This has led to some debate on the quality of the information it contains. In a study from 2005 a comparison was made between Wikipedia and Encyclopaedia Britannica, which is a more traditional and well-established ency- clopaedia with around 100 full time editors and over 4000 external contributors, including 110 Nobel prize winners and 5 American presidents. The study com- pared 50 science related entries in the both encyclopaedias, and showed that there were little to no difference in the quality of the entries between the both (Giles, 2005). Other studies performed and the nature of Wikipedia suggests that the arti- cles are not of uniformly good quality since anyone can edit or create an article and therefore will be able to consciously or unconsciously contribute with false or fake content (Hu, Lim, Sun, Lauw, & Vuong, 2007).

1.2 Research question

Given a dataset of Wikipedia articles, each about a distinct application, is an item representation using either TF-IDF weighting or Latent Semantic Indexing (LSI) the most suitable for a content based filtering system for recommending applications with related functionality to the one that a user is currently browsing?

1.3 Objective

The desired outcome is an evaluation on how well a content based filtering system using Wikipedia as the source of information will work for application software. More specifically, the objective is to investigate two possible implementations of such a system, one using TF-IDF weighting and one using LSI. The two will be evaluated both individually, and in comparison, to come to a conclusion to which implementation that is best able to recommend applications with related function- ality, if any of them are capable of doing so.

1.4 Delimitation

Related functionality between applications are limited to two types of relations. The first relation is similarity in functionality and features, which refers to applica- tions that are able to perform the same or a similar set of tasks, in a similar manner. The second type of relation are applications that have a functional synergy. This refers to recommendations that in some way complement or enhance the function- ality of the evaluated application. Examples include software integration, applica- tions from the same suite or a relation by available plugins. In the implementation phase of the study, the Wikipedia articles will be be col- lected once, and then handled offline. Articles will be fetched from the English version of Wikipedia, and are all assumed to be in English. Libraries for natural CHAPTER 1. INTRODUCTION 5

language processing and machine learning will be used, to the extent that this is possible. The runtimes of the algorithms evaluated will not be a part of the evalu- ation, as focus will be put solely on the quality of the recommendations. Chapter 2

Background

This section contains the background and theory to the methods that were imple- mented and evaluated.

2.1 Vector space model

Many of the content based systems researched for the study have a shared founda- tion in how the documents or texts are represented, including the ones that were se- lected for implementation and evaluation. This representation is commonly found in information retrieval systems, and ignores the order in which the terms in a doc- ument or text appears. Because of this, models using this representation are often referred to as bag-of-words models. Given the set of terms that each document in a collection contains, each of the documents can be represented as a vector, with each element in the vector repre- senting a term, or a set of terms, that occur within the collection (Salton, Wong, & Yang, 1975). Representing each document as such a vector is normally referred to as using the vector space model. The process of transforming the documents in a collection to vector-representations is called indexing, and the set of pre-calculated document-vectors is called an index (Jurafsky & Martin, 2009). The value of each feature in the vector-representation of a document is called the term weight, and is normally a function of the frequencies of the terms in the document. In the fol- lowing subsections, two common heuristic techniques for constructing document vectors are presented.

2.1.1 Term frequency-inverse document frequency A common heuristic technique for calculating the term weights is the Term frequen- cy-inverse document frequency weighting scheme (TF-IDF). As seen in equation 2.1, the term weight for term i in document j is calculated by multiplying its term frequency tfi,j with what is referred to as the inverse document frequency, or idfi,

6 CHAPTER 2. BACKGROUND 7

for that term. The term-frequency is a measure of how many times term i appears in document j. The inverse document frequency have the purpose to give higher weight to terms that appear in fewer documents in the collection, and lower weight to terms that occur in a higher number of documents. The idea behind this is that terms that occur in fewer documents in the collection have higher discriminative power, and should get a higher weight than words that appear in a higher num- ber of documents. As shown in equation 2.2, the idfi for a word i is commonly calculated by dividing the total number of documents in the collection, N, by the number documents in which the term i appears, ni. To avoid overvaluing the im- pact of words that appear only in a very small number of documents, a common heuristic is to normalize the idf by taking the logarithm of this quotient. For a vec- tor space representation using the TF-IDF weighting scheme, each document in a collection is represented as a vector where each value corresponds to the TF-IDF value for a term appearing in that document.

wi,j = tfi,j × idfi (2.1)

N idfi = log (2.2) ni

Despite its strengths and wide-spread use in information retrieval, the TF-IDF representation has its limitations. It is unable to model relationships between terms, such as synonymy, polysemy, hyponymy or hypernymy. When calculating similar- ity between two documents, this could have consequences both in documents be- ing considered more similar than they actually are (false positives) and documents being considered less similar than they actually are (false negatives) (Ramos, 2003).

2.1.2 Latent Semantic Indexing Latent semantic indexing (LSI) is a method used within many areas of natural lan- guage processing and information retrieval. In addition to recommender systems, it has been used in areas such as text summarization (Gong & Liu, 2001), spam fil- tering (Gee, 2003), automatic grading of essays (Foltz, Laham, & Landauer, 1999) and gene-clustering (Homayouni, Heinrich, Wei, & Berry, 2004). The method, which is sometimes also referred to as latent semantic analysis (LSA), makes use of singular value decomposition (SVD) with the purpose of un- covering some underlying semantic structure partially obscured by a randomness in the authors choice of words. The method uses SVD to cast documents and queries in to a low-rank vector-representation, where each element in the vector represents an abstract concept consisting of words that appear in similar contexts throughout the document collection. The idea behind this is that words that appear 8 CHAPTER 2. BACKGROUND

in similar contexts tend to have a similar meaning (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990). More specifically, the singular value decomposition is used to create a lower- rank representation of, what is normally referred to as the term-document matrix of a document collection. A term-document matrix, such as C shown in equation 2.3, is an M × N matrix where each row represents a distinct term in the collection, and each column represents a document. An element xi,j in the matrix constitutes how many times the term i appears in document j.

 x1,1 ··· x1,N   . . .  C =  . .. .  (2.3) xM,1 ··· xM,N

The dot product between two term vectors in C can be used to give the correla- tion between the two terms over the set of documents. The full set of dot products between term vectors is given by the matrix product CCT , where the entry (i, j) in the product represents the overlap between terms i and j based on their co- occurrences in the documents. Similarly, the correlations between the documents over the terms are given by the product CT C. As shown in equation 2.4, there is a singular value decomposition of the matrix C in to three matrices U, Σ and V such that U and V are orthogonal matrices and Σ is a diagonal matrix. The columns in the matrix U are the orthogonal eigenvectors of CCT and the columns in V are the orthogonal eigenvectors of CT C. The eigen- λ , . . . , λ CCT CT C values 1 r of are the same√ as the eigenvalues of , and the diagonal values σi of matrix Σ are set to σi = λi. The diagonal values σ1, . . . , σl of matrix Σ are called singular values, and appear in decreasing order so that σi ≥ σi+1.

C = UΣV T (2.4)

A low-rank approximation of the matrix C of, at most, rank k can be constructed by first replacing all singular values in the matrix Σ with zeroes, except for the k highest values. This new matrix Σ k is then used to compute the rank-k approxi- T mation of C by inserting it in to equation 2.4, so that C = UΣ kV . Since the effect of small eigenvalues on matrix-products are small, replacing the smallest set of singular values with zeroes in such a way minimizes the error of the low-rank ap- proximation of C compared to the original matrix (Manning, Raghavan, & Schütze, 2008b). LSI is generally better at dealing with some of the issues previously explained as drawbacks to a TF-IDF document representation, such as an inability to incorpo- CHAPTER 2. BACKGROUND 9

rate synonymy, hyponymy and hypernymy to document similarity comparisons. Words that have the same or a similar meaning are assumed to be more likely to appear in a similar context, which would make it more likely for them to be placed in the same concept. However, as the method only allows for a term to belong to a single concept, LSI is unable to properly represent polysemy in similarity compar- isons (Deerwester et al., 1990).

2.2 Cosine similarity

When documents are represented in a vector space, a common way to measure the similarity between two documents is by the cosine similarity between their vector representations. Equation 2.5 shows the equation used to calculate the cosine similarity between two documents, d1 and d2. Geometrically, the cosine similarity is interpreted as the cosine of the angle between the vector representations of the two documents, V~ (d1) and V~ (d2). An important consequence of this is that the similarity becomes independent of the document length (Huang, 2008).

V~ (d1) · V~ (d2) sim (d1, d2) = (2.5) ~ ~ V (d1) V (d2)

2.3 Text pre-processing

As a step towards transforming a text or a document to vector space representation, a set of pre-processing operations are commonly applied to the documents before indexing. The purpose of these are to increase the quality of information retrieval operations on the indexed dataset. The following subsections describe some pre- processing operations common to information retrieval.

2.3.1 Tokenization Tokenization is the process of separating out the individual words in running text. In English, the most common word separator is whitespace, but simply splitting the text at every whitespace-character can also cause some problems for informa- tion retrieval. Words or terms such as word processor or Royal Institute of Technol- ogy includes one or more whitespace-characters, and by splitting them to multiple terms some of their compound meaning is lost. There are also compound terms such as I’m that ideally should be split in to two words: I and am. The purpose of these examples is to illustrate the difficulties of proper tokenization, and how the decision on which rules to use for it should depend on factors such as the type of the text and the language (Jurafsky & Martin, 2009). 10 CHAPTER 2. BACKGROUND

2.3.2 Stop words A common technique in information retrieval is to remove stop words from the documents in the collection. Stop words are words that carry little semantic weight due to them appearing with high frequency throughout the collection. It is there- fore common to remove them altogether from the index, as this saves considerable space and helps to remove any impact that they otherwise would have, no matter how small. Normally, a list of common stop words is kept, and any word from a document that, at the time of indexing, is present in this list is not put in to the index (Fox, 1989).

2.3.3 Stemming Another common technique to improve the quality in text similarity comparisons is stemming, which is the process of collapsing together words to their word stem, base or root form. The major advantage of stemming is that it allows matching a word to any morphological variant of it contained in other documents. For exam- ple, if we have the word edit in a document, stemming it will allow it to match to other variants of the same word, such as editing or edits (Willett, 2006).

2.4 Evaluation metrics

To evaluate a recommendation system and to be able to compare different imple- mentations to each other, a set of evaluation metrics are normally used. Depending on the type of system, these metrics can differ. Below is a description of some of the most commonly used metrics to evaluate recommendations or sets of recommen- dations, commonly found in the field of ranked information retrieval.

2.4.1 Precision at K Precision is the fraction of retrieved documents that are considered relevant. In the case of ranked retrieval, it is common to calculate precision for the top K docu- ments retrieved (Kishida, 2005). In the case of a recommender system for applica- tions, precision at K is a measure on how many of the top K recommendations for an application that are relevant, as shown in formula 2.6.

(] Relevant recommendations retrieved) P recision at K = (2.6) K

2.4.2 Mean average precision Another common measure in ranked retrieval is the mean average precision (MAP), which is a single measure score of quality for multiple queries. In the case of this CHAPTER 2. BACKGROUND 11

study, if we have a set of queries qi ∈ Q where each qi corresponds to a Wikipedia n o article about an application, a set of relevant articles d1, . . . dmj and a set of ranked recommendations Rj,k from the top result until we get article dk, then the mean average precision is calculated as shown in equation 2.7.

|Q| mj 1 X 1 X MAP (Q) = P recision(R ) (2.7) |Q| m j,k j=1 j k=1

If no relevant recommendations are retrieved, the precision value in equation 2.7 is taken to be 0. The measure is biased towards the top of the ranking, mean- ing that the score will be higher for cases where relevant recommendations are returned higher in the list as opposed to lower down (Manning, Raghavan, & Schütze, 2008a). Chapter 3

Method

The following section describes the method used to evaluate the two approaches for recommending applications. The section starts with a description of the dataset that was used in the evaluation, and how it was collected and labelled. After this is a description of the implementation of the TF-IDF and LSI-models and how they were used to return recommendations, followed by a description of the experi- ments that were performed to evaluate and compare them to each other. In the final sections of the chapter, the software and libraries used for the implementation are listed and described.

3.1 Data collection and labelling

The data that was used in the evaluation consisted of 2501 distinct articles from English Wikipedia, each article about an application. The content of each article was fetched from the Wikipedia REST API1, using hyperlinks to articles that were manually gathered from the Wikipedia web interface2. All articles were collected through available list-type Wikipedia-articles, which contain lists of links to articles about applications, sorted by their software category. Examples of such includes Wikipedia-articles with titles List of word processors3 and List of spreadsheet software4. The total number of hyperlinks manually collected was 3943. This number was reduced to the final amount of 2501 distinct articles by removing empty, invalid or duplicate articles after fetching their contents from the API. The lists previously mentioned in this section were also used to add software category labels to the articles according to which list their hyperlink was collected from. Thus, an article fetched from the page List of word processors was labelled a word processor. If an

1https://en.wikipedia.org/api/rest_v1/ 2https://en.wikipedia.org 3https://en.wikipedia.org/wiki/List_of_word_processors 4https://en.wikipedia.org/wiki/List_of_spreadsheet_software

12 CHAPTER 3. METHOD 13

article was present in more than one list, it was labelled with one category label from each list that it appeared in. Of the 2501 articles collected, 2169 were labelled with only a single software cat- egory label and 332 with multiple software category labels. The articles were col- lected from 115 different lists, which means that a total of 115 different labels were used to label the dataset. The number of applications in each software category range between only a single application to 139 in the largest group. The median number of applications labelled with each software category label was 20. Some of these categories overlap in the functionality of the software type that they de- scribe, which will be further discussed in section 5.1.1. For a full list of the software categories, see appendix B. All the articles collected were in English, but they varied in length, style and narrative focus. As they are Wikipedia-articles, an article can have multiple authors and an author can have written parts of multiple articles.

3.2 Index creation

The following section describes how the TF-IDF and LSI indexes were created from the articles. The process is shown in figure 3.1, and was the same for both mod- els, with the exception to how the documents were represented. Before creating each index, the articles in the dataset were pre-processed in three subsequent steps. The same steps were applied to both the TF-IDF index and the LSI index, with the purpose to increase the quality of semantic comparisons between words and texts.

Articles

Load

Remove Transform to Tokenize Stem stop words Vector space

Save

Index

Figure 3.1: The process of creating the index from the Wikipedia articles.

The articles were tokenized and stripped from punctuation, and all words made lower-case. Wikipedia-specific formatting was removed, such as certain characters 14 CHAPTER 3. METHOD

or sequences of characters used to indicate the different sections in the article. In the second step, stop words were removed using a list of common English stop words, listed in appendix A. Finally, all remaining words were stemmed using a Porter stemmer, which is a relatively simple stemming algorithm for mapping English words to their stem. This is accomplished by using a set of sequential rules and transformations which are applied to each word (Willett, 2006). After pre-processing, the articles were transformed to LSI- and TF-IDF vector representation, and put in a separate index for each methodology. As the process of creating these indexes was time consuming, the indexes were then saved to file for re-use in the evaluation.

3.3 Recommendation process

As shown in figure 3.2, the process of recommending applications was the same for both methods, with the exception to how the articles were represented. In the same manner as when the Wikipedia-articles were used to create the indexes, the article of the application for which we wanted a system to give a set of recommendations first went through a series of steps to transform it to either TF-IDF or LSI vector representation. The article was first tokenized, stemmed and stripped from stop words and Wikipedia specific formatting. It was then transformed to either TF-IDF or LSI vector space representation, depending on which index that the query for recommendations was aimed for. After the article had been transformed to vector space, it was compared to all other articles in the index by calculating the cosine similarity between its vector and each of the vectors in the index. The articles in the index were then sorted by their similarity in a descending order, so that the articles on top of the list were the ones with the highest cosine similarity. Depending on how many recommendations, K, that were asked for, the top K applications from the sorted list were then returned as recommendations. CHAPTER 3. METHOD 15

Query-application

Load

Tokenize

Remove stop words

Stem

Transform to Vector space

Calculate co- Index sine similarities Load

Return K most similar applications from index

Recommendations

Figure 3.2: The process of retrieving a set of recommendations for an application.

3.4 Experiments

The evaluation was performed in two parts. First, an experiment was performed in which the percentage of recommendations that belong to the same software cat- egory as the query-application was measured for both implemented methods. The dataset of Wikipedia articles was split to a training set and a test set for the purpose 16 CHAPTER 3. METHOD

of evaluating on data previously unseen to the models. Of the 2501 labelled arti- cles available, 70% were randomly selected for the training set and the remaining 30% were used in the test set. The number of applications in each set were 1750 for the training set and 751 for the test set. The software categories used were all rel- atively broad descriptors of functionality, and the purpose of the experiment was to get a broad measure on the fraction of the recommendations that have a similar functionality. The second experiment consisted of four structured interviews, in which the test persons were asked to assess the quality of the top five recommendations from each implemented method on 20 randomized applications. The purpose of this was to get a more fine-grained measure on relevance, and to capture potential factors in the relevance of a recommendation that the software categories from Wikipedia might not, such as functional synergy.

3.5 Used software

Below is a description of the software tools and libraries that were used in the implementation and evaluation process. All software used were available free and open source at the time of the study.

3.5.1 Natural Language Toolkit (NLTK) NLTK5 is a suite of Python libraries and programs for various natural language processing tasks. It was used to tokenize, stem and to remove stop words from the articles.

3.5.2 Gensim Gensim6 is a Python library for vector space and topic modelling, which use the NumPy7 and SciPy8 Python libraries for scientific calculations. It was used for creating the TF-IDF and the LSI-models, for cosine similarity calculations and for loading and saving the models to files.

3.5.3 Wikipedia (Python package) The Python package named Wikipedia9 is used to send requests to the Wikipedia REST API, and was used to fetch the contents of the articles.

5http://www.nltk.org/ 6https://radimrehurek.com/gensim/ 7http://www.numpy.org/ 8https://www.scipy.org/ 9https://pypi.python.org/pypi/wikipedia CHAPTER 3. METHOD 17

3.5.4 Django REST framework Django REST framework10 is an extension to the Python web framework Django, and includes a toolkit for building REST API’s. It was used in combination with the other tools to build the back-end to the platform used in the evaluation. It was mainly used to load and save to the MySQL database in which the Wikipedia articles were stored locally, and to serve requests for article data from the evaluation interface.

3.5.5 React React11 is a library for the JavaScript programming language, primarily used for building user interfaces for web applications. It was used to build the user in- terface for the evaluation platform, and to facilitate browsing the articles of the applications.

3.6 Platform

The experiments were performed on a laptop with an Intel Core i7-6700HQ 2.60Ghz processor with 8Gb RAM.

10http://www.django-rest-framework.org/ 11https://facebook.github.io/react/ Chapter 4

Results

In this chapter, the results from the two experiments are presented. The first section contain the results from the experiment when the software category labels were used, and the second section the results from the interviews.

4.1 Comparison by software category

All software categories gathered from Wikipedia were related to the functional- ity of the applications they contained. A relatively broad measure on how well a system is able to recommend similar applications in terms of functionality could therefore be how high the systems scores the articles of the applications that belong to the same software category. In the following subsections, the results from such an experiment is presented.

4.1.1 Precision at K Table 4.1 shows the mean precision for the top K recommendations for TF-IDF and LSI respectively. The scores are the mean over all applications in the test set, which means a total of 751 recommendation queries performed for each K and for each method. Each of the top K recommendations that had one or more software category in common with the query-application were considered to be relevant, and the ones that did not were considered as irrelevant. For each method and each K, this was then averaged over all applications using the arithmetic mean, to get the mean precision at for the full test set. For the smallest value of recommendations retrieved, K=5, the LSI based rec- ommendations had a slightly higher mean precision than the TF-IDF based rec- ommendations. For values of K higher than 5, the TF-IDF recommender had the higher mean precision of the two, although the absolute difference between the methods remain small.

18 CHAPTER 4. RESULTS 19

Table 4.1: The mean precision at K recommendations retrieved for the applications in the test set, for TF-IDF and LSI respectively.

Mean precision K TF-IDF LSI 5 0,527 0,533 10 0,481 0,471 15 0,444 0,427 20 0,413 0,393 25 0,384 0,363

4.1.2 Mean average precision Table 4.2 shows the mean average precision for the same category-based measure of relevance applied on the same set of articles as in section 4.1.1. The mean aver- age precision was somewhat higher when TF-IDF was used. This means that the TF-IDF-representation of the articles on average made the system rank the articles about applications from the same software category slightly higher in the full list compared to LSI.

Table 4.2: The mean average precision for the applications in the test set, for TF-IDF and LSI respectively.

Method Mean average precision TF-IDF 0,368 LSI 0,332

4.2 Interviews

In this part, the results are presented from the interviews. Four structured inter- views were performed, in which the subjects were asked to evaluate the top five recommendations from the TF-IDF based and the LSI based system respectively.

4.2.1 Evaluated applications A set of 20 applications were randomized from five software categories of which the subjects had confirmed experience in working with, and an understanding of the major features and functionality that these systems commonly possess. The decision of not randomizing from the full set of applications was based on the dif- ficulty that would be introduced in asking people to evaluate types of software of which they have no prior experience in working with, or to find enough people able to evaluate the more specialized types of software present in the list. 20 CHAPTER 4. RESULTS

The articles were randomized from the software categories Web conferencing soft- ware, Word processor, Spreadsheet software, Raster graphics editor and client. Four articles were randomized from each category, and the articles were varying in both length, style and narrative focus. Table 4.3 shows the abbreviations used in place of the software category names. As a replacement to the actual names of the applica- tions evaluated, the report will refer to them by a combination of their abbreviated software category name and a distinct number 1 to 4. Thus, EC1 and EC3 will refer to two specific email clients from the evaluation set. For a full list of which application name that corresponds to which abbreviated name, see appendix C.

Table 4.3: Abbreviations used in place of the software category names.

Software category Abbreviation EC Raster Graphics editor RG Spreadsheet software SS Web Conferencing software WC Word processor WP

The mean precision for the top five recommendations for subset of applications from only these five categories was 0,509 for the TF-IDF recommender and 0,507 for the LSI-recommender, when using matching categories as relevance indicator in the same way as in section 4.1.1 and 4.1.2. This is slightly lower for both recommenders compared to the mean precision for the full set of applications in the testing set, shown in table 4.1. Table 4.4 shows the number of applications in each of the five categories, the median length of their content and the mean precision for the top 5 recommendations for each category when matching software categories were used.

Table 4.4: Details for the full set of applications from the five categories from which the set of applications for the interviews were randomized.

Number of Median article Mean precision at K=5 Software category applications length [#words] TF-IDF LSI EC 45 438 0,476 0,458 RG 82 465 0,605 0,644 SS 12 393 0,383 0,250 WC 29 467 0,413 0,262 WP 31 576 0,439 0,542

As seen in table 4.5, the median number of words in the 20 articles selected for the interviews were 520, with the longest article containing 5507 words and the shortest one 68 words. The mean precision for the top five recommendations was 0,48 for the TF-IDF recommender and 0,66 for the LSI-recommender. The potential consequences of this is discussed in section 5.1.2. CHAPTER 4. RESULTS 21

Table 4.5: Details for the evaluation set that was used in the interviews, compared to the test set and the full set of applications.

Measure Interview set Test set Full set Number of applications 20 751 2501 Max [# words] 5507 10009 13567 Min [# words] 68 29 29 Median [# words] 520,5 450 449 TF-IDF – Mean precision (K=5) 0,480 0,527 0,546 LSI – Mean precision (K=5) 0,660 0,533 0,568

4.2.2 Interview Process For each of the 20 applications, the subjects were presented with the top five rec- ommendations for each methodology, and asked to evaluate the quality of these. The subjects were given, and asked to read, the Wikipedia articles of the 20 appli- cations evaluated along with the articles belonging to their corresponding sets of recommendations. A decision to hide the software category labels of each article was made to prevent them from influencing the scores set. This meant that any information on software category had to be read from the content of the Wikipedia articles by the subjects themselves and not from the user interface of the evaluation platform. The subjects were asked to put three separate scores for each recommended ap- plication. The first score was set based on how similar they judge the recommended application to be in terms of functionality and features. A high value on this score indicates a high similarity in their functionality, meaning that the two applications to a large extent are able to perform the same tasks, in a similar manner. A low value indicates little or no shared functionality. The second score was set based on to what extent the functionality of the recommended application complements or enhances the functionality of the application for which the recommendations are evaluated. A high second score indicates a strong functional synergy between the evaluated application and the recommended application, and a low score indicates little or no presence of such. Finally, the subjects were asked to set a general score based on how satisfied they would be on whole with the recommendation based on the two relations combined, along with potential comments they might have on this. A scale of {0, 1, 2, 3} was used for setting the three scores for each recommen- dation, with 0 being the lowest and 3 the highest. The subjects were given a de- scription text to each score in addition to the scale, to help them set the scores and to keep the scores at the same magnitude over all interviews. For the full list of criteria given to the subjects, see Appendix D. 22 CHAPTER 4. RESULTS

4.2.3 Interview process delimitation To reduce the complexity of the interview process, the subjects were asked not to put any weight to the order in which the top 5 recommendations appeared. This could have expressed itself by subjects penalizing a recommendation on the general quality score for not appearing higher in the list. In addition to this, the subjects were asked not to put any weight in to what level the applications were supported or maintained at the time of the evaluation. The main consequence of this is that applications which are no longer available or maintained should still be considered as good recommendations if they fit any of the criteria for being so. The reason for the delimitation was that such applications would ideally have been removed from the dataset in beforehand, which was something that time did not allow for in this study. Finally, the subjects were not allowed to use any external source of information other than the Wikipedia articles of the applications evaluated and the recommendations. This included outgoing links from the articles, which were made non-clickable in the evaluation platform.

4.2.4 Interview results Table 4.6 shows the mean precision for the top 5 recommendations to the 20 ap- plications rated in the interviews. On average over the full evaluation set and all interviews, the LSI based recommendations were scored higher when it came to similarity in terms features and functionality between the applications described in the articles. However, the mean score of the TF-IDF based recommendations were over twice as high as the score for the LSI based recommendations when functional synergy was assessed. For the final score on how generally satisfied they were with the recommendation in terms of related functionality, the LSI based recommenda- tions were once again scored higher than the ones from TF-IDF. In the following sections, we further elaborate the results, as well as the most prominent comments expressed during the interviews.

Table 4.6: The mean precision over all recommendations for the 20 applications evaluated in the interviews.

Mean precision at K=5 Assessed relation TF-IDF LSI Features and functionality 0,563 0,676 Functional synergy 0,278 0,128 General quality of recommendation 0,624 0,700 CHAPTER 4. RESULTS 23

Features and functionality Figure 4.1 shows the mean precision from the four in- terviews for the top five recommendations for each application in terms of similar- ity in features and functionality. The LSI based recommendations were on average scored higher for 15 of the 20 evaluated applications. The TF-IDF based recommen- dations were, on average, scored higher on the remaining 5 applications. For 15 of the evaluated applications, the four interview subjects unanimously gave a higher score to the recommendations from one of the methodologies. The five applica- tions where the highest scoring methodology differed were EC3, RG2, RG3, RG4 and SS2. These were also the five applications where the absolute difference in the mean score over all interviews were the lowest.

TF-IDF LSI

1

0.8

0.6

0.4 Mean precision

0.2

0 SS1 SS2 SS3 SS4 EC1 EC2 EC3 EC4 RG1 RG2 RG3 RG4 WP1 WP2 WP3 WP4 WC1 WC2 WC3 WC4 Application

Figure 4.1: The mean precision for the top five recommendations for each of the 20 ap- plications evaluated in the interviews when similarity in features and functionality was assessed. The blue bars in the diagram show the precision for the top five recommenda- tions for an application when TF-IDF was used, averaged over the four interviews. The red bars show the same measure for when LSI was used. 24 CHAPTER 4. RESULTS

Functional synergy Figure 4.2 shows the mean precision from the four interviews for the top five recommendations for each application when functional synergy was assessed. For 12 of the applications evaluated, the TF-IDF based recommendations received a higher mean score from the subjects. For two of the evaluated applica- tions, the LSI-based recommendations were on average rated higher. Unlike the previously described evaluation on features and functionality, a majority of the ap- plications did not receive an unanimously higher rating from all the subjects. For six of the applications, the recommendations from when TF-IDF was used were scored higher in all four interviews. For the two applications where the LSI based recommendations were scored higher, it was never unanimously.

TF-IDF LSI

1

0.8

0.6

0.4 Mean precision

0.2

0 SS1 SS2 SS3 SS4 EC1 EC2 EC3 EC4 RG1 RG2 RG3 RG4 WP1 WP2 WP3 WP4 WC1 WC2 WC3 WC4 Application

Figure 4.2: The mean precision for the top five recommendations for each of the 20 appli- cations evaluated in the interviews when functional synergy was assessed. The blue bars in the diagram show the precision for the top five recommendations for an application when TF-IDF was used, averaged over the four interviews. The red bars show the same measure for when LSI was used. CHAPTER 4. RESULTS 25

General quality of recommendations Figure 4.3 shows the mean precision from how the subjects personally would assess the quality of each recommendation in terms of the two types relationships in functionality used in the study. The LSI based recommendations was scored higher on average for 15 of the 20 evaluated applications. For the remaining five applications, the TF-IDF based recommenda- tions were scored higher. The distribution in scores is similar to the one where only features and functionality was assessed (see figure 4.1), and the method that re- ceived the highest average score for each application is the same between the two. For the recommendations scored higher on functional synergy and integrations, an increase in the general score compared to features and functionality can be seen. This was especially true for the applications where the articles more thoroughly described an integration, which was confirmed in the motivations expressed from the subjects when setting the scores.

TF-IDF LSI

1

0.8

0.6

0.4 Mean precision

0.2

0 SS1 SS2 SS3 SS4 EC1 EC2 EC3 EC4 RG1 RG2 RG3 RG4 WP1 WP2 WP3 WP4 WC1 WC2 WC3 WC4 Application

Figure 4.3: The mean precision for the top five recommendations for each of the 20 appli- cations evaluated in the interviews when the general quality of the recommendations were assessed. The blue bars in the diagram show the precision for the top five recommenda- tions for an application when TF-IDF was used, averaged over the four interviews. The red bars show the same measure for when LSI was used. 26 CHAPTER 4. RESULTS

General comments from the subjects All subjects expressed that a similarity in features and functionality is what they would generally value the most in an indi- vidual recommendation, if forced to select one of the criteria. This is also reflected in the scores they set. How much a strong functional synergy or integrations is rel- evant to the general quality of a recommendation was expressed as a more complex matter to evaluate due to a higher dependency on the user’s intention for using the system. Something that varied across the subjects was how they scored the quality of the recommendations in some of the cases where they had an application with a more basic functionality. In multiple cases, the recommendations for such were for applications that were capable of the same (or very similar) functionality, but with a wider range of additional functionality as well. Two of the subjects expressed that they considered the recommendation to still be equally relevant both in terms of features and functionality as well as general quality. One of the subjects gen- erally gave a high score in terms of features and functionality but a lower score on general quality. The motivation for the cases when this gap was large was that an extended range of features most likely would mean higher complexity in using the application, reducing the general quality of the recommendation. The fourth person generally gave a lower score on both features and functionality as well as for the general quality of the recommendation, with a similar explanation. The difference in his motivation was that he considered the increased complexity from additional features to influence the similarity in features and functionality as well. One problem expressed by all subjects were that some of the articles were too short for them to be able to comprehend the main features and functionalities of the application described. The problem was present both in the articles of the 20 applications evaluated, as well as in some of the articles of the recommended appli- cations. This in turn lead to confusion in how to assess the quality of the recommen- dations, and lead to a higher variance between the subjects in how the scores were set compared to the other applications in the evaluation set where the explanations of functionality were more thorough. Another thing that emerged during the interviews was that the subjects felt that it was hard to know how to set the score on functional synergy and integrations, especially for the applications from which they had no experience in working with themselves. Some of these articles only briefly mentioned integration or another type of cooperation between the applications, which made it hard to know what score to set without any prior knowledge or additional information. In addition to this, the subjects expressed that they felt of the general quality of a recommendation to be something that depends on what the user of the sys- tem is interested in. Some users might be looking for alternative applications with the same or similar functionality, while others might value recommendations for applications that provide additional functionality through integrations or by other means. It was also noted by several of the subjects that the recommendations from CHAPTER 4. RESULTS 27

when TF-IDF was used had a higher representation of recommendations from the same manufacturer than the those from when LSI was used. It was suggested that such recommendations might also be of interest. One of the subjects suggested that the system that generally provided the most variation between the individual rec- ommendations could be the best, since this would satisfy a wider range of use-cases and information needs. Chapter 5

Discussion and conclusion

In the following section, the outcome and limitations of the results presented in chapter 4 are discussed. After this, the conclusion drawn from the experiments is presented, followed by a final section on possible extensions for future work.

5.1 Discussion

In the following subsections, the outcome and limitations of the two experiments are discussed.

5.1.1 Comparison by software category From the first part of the evaluation, where matching software categories were used to indicate a relevant recommendation, the difference between the two recom- menders is such a small one that no real conclusion can be drawn on which to prefer over the other. The tests indicated a slight advantage to the LSI-recommender for the smallest number of recommendations retrieved tested (K = 5), but only with about half a percent higher mean precision compared to the TF-IDF recommender. For sets of recommendations retrieved larger than this, TF-IDF had the slightly higher mean precision of the two. A higher precision at smaller sets of recommendations retrieved could however be preferable to a higher precision at larger sets. This depends of factors such as the intended use of the recommender system as well as the dataset of possible rec- ommendations. For the application dataset used in this experiment, the median number of applications in each software category was 20. If the purpose of the recommender would be to retrieve recommendations with a matching software category label, a number above 20 recommendations returned would mean that it would be highly likely for irrelevant recommendations returned for applications from at least half of the software categories. The exception to this is the cases where

28 CHAPTER 5. DISCUSSION AND CONCLUSION 29

applications have multiple software category labels, which is a little over 13% of the applications in the full dataset. A big problem with using only matches in software categories for evaluation is that they only provide a binary distinction between relevant or not relevant. Many of the categories capture a very broad range of applications, and there is bound to be groupings within them that are more, or less similar to each other in terms of functionality. An example of this is the most frequent software category label in the evaluation dataset, Collaborative software, which was assigned to 139 applications. This subset was examined, and found to contain both applications primarily tar- geted towards word processing, web-servers, email clients, web-conferencing and others. This is problematic, as applications that would not be considered similar in functionality could still be counted as relevant and yield a false positive. Another side of the problem was that of overlapping, or very similar software categories. Going back to the example of the collaborative software category, it con- tained subsets of applications from other different software categories. One of these groups were applications that was clearly stated to be web-conferencing software when browsing their individual articles. Many of these were correctly given multi- ple labels during the data collection, as they also appeared in the Wikipedia list of web-conferencing software. However, a number of these did not appear in this list even though their individual articles clearly states that they are web-conferencing software. This means that they are only labelled as Collaborative software in the eval- uation set, which will lead to false negatives in the evaluation if they appear as recommendations to an application labelled Web conferencing software.

5.1.2 Interviews The second part of the evaluation was the four structured interviews performed. The purpose of the interviews was to get a more fine-grained evaluation on similar- ity in features and functionality, compared to just using matches between software categories as the relevance indicator. It also had the purpose of assessing the rele- vance of recommendations that have functionality that synergizes with the one in the evaluated application, without the two necessarily being of the same software type. The people interviewed were between ages 27 and 58, with three being male and one female. All subjects were purposefully selected to have experience in us- ing software of the categories in the interview set in their everyday work, so that they would have an understanding of the features and functionality commonly found in such applications, and to facilitate similarity comparisons between such. It might however be quite possible that the end user of a recommendation system such as the one evaluated is not a person with a good knowledge in the type of application that he or she wants recommendations for. It could be imagined that a person that already have an insight in a certain type of application also have some 30 CHAPTER 5. DISCUSSION AND CONCLUSION

knowledge in the alternatives available on the market, which would decrease the need for recommendations. In contrast, a person researching alternative applica- tions with only a limited knowledge in the application type might benefit more from having recommendations. However, for facilitating the evaluation process, it was decided to use people with more insight in the software categories used in the evaluation. The people interviewed scored the recommendations from the LSI based rec- ommender higher in terms of features and functionality and general quality for the evaluated applications. The recommendations from the TF-IDF based recom- mender was on average scored higher in terms of complementary functionality and integrations. For several of the articles evaluated, integrations were only briefly discussed in one or a few sentences. The LSI based recommender did not return any such articles in its top five recommendations for any of the applications evaluated, while the TF-IDF based recommender did this on several occasions. The fact that a TF- IDF representation of an article explicitly represent every individual word in the article after the pre-processing, while an LSI-representation transform the article into a set of abstract concepts might be influential towards this. If only a few words are present that references the relation between two applications, a representation by concepts might abstract these away. A problem in using a relatively small subset for evaluation is that the results are at risk of becoming very dependent on which applications that are selected. Many of the software categories in the dataset were either very specific or demanding in terms of prerequisite knowledge to be able to comprehend the information in the articles of the individual applications. Due to limitations in the experience of working with such applications among the people available for interviewing, a choice was made to use applications more related to every-day office work. The applications used in the evaluation were randomized from five such categories, which were categories that the subjects all had confirmed experience in working with. The problem with selecting applications from only five categories is that the re- sults are at risk of not being representative for the whole set of applications from all categories. This is especially a risk since they shared the property of all being related to everyday office-work, which is something that most of the software cat- egories in the dataset did not. Ideally, the evaluation would have been performed on a randomized set of applications over all software categories. However, the risk of having people without any pre-requisite knowledge evaluate applications from some of the more specialized categories, such as electronical circuit simulators or molecular mechanics software, was decided as too much of a risk towards the reli- ability of the results. Another problem with using a small set of 20 applications for the interviews is the risk of having unrepresentative results compared to the full set. The random- CHAPTER 5. DISCUSSION AND CONCLUSION 31

ized applications might heavily favour one of the recommenders, even though this might not be the case for the full dataset. As seen in table 4.5, the randomized ap- plications were not representative in terms of how large fraction of the top five rec- ommendations that fall within the same software category compared to the average of the whole dataset. This might have had an impact on the results from the inter- views, and increases the difficulty in drawing conclusions on which recommender that produces the best results for the whole application dataset. However, since no data were available to evaluate to what extent the software category correlate to the features and functionality of an application, a choice was made to proceed with the randomized set. An alternative would be to hand-pick a set of applications that would better represent the qualities of the full application dataset, but at the risk of introducing a bias in the selection process. Another option would be to use a larger evaluation set. The problem with this is that the quality of the assessments might be reduced if the interviews become too long in terms of time taken. Even with only 20 applications in the evaluation set, the subjects had to read and assess five potentially distinct articles per recom- mender and application. When including the article of the application for which the recommendations are evaluated, this sums up to a worst case of 220 distinct articles that the subject would have to read and understand for each interview. For the set evaluation set, this number was 128 distinct articles, as several rec- ommendations either re-appeared for multiple applications or appeared in the rec- ommendations from both recommenders. Even though this number is significantly lower than the worst case, the interviews still took several hours each to perform, with multiple breaks needed. It was noted that the time put into reading each article in many cases decreased as the interviews progressed, and several of the subjects stated that the work was tiring when questioned about it. This might have negatively influenced the quality of the scores set, especially during the latter part of the interview. A possible improvement to the interview process would be to re- structure or randomize the order in which the applications are evaluated for each interview. If an application is evaluated early on in some interviews and more to- wards the end in others, the risk of a possible decrease in quality would be spread out over all applications instead of influencing just the same subset in each inter- view. It could also work towards negating any other potential effects that the order in which the applications are evaluated could have. In addition to this, the subjects had personal experience in working with some of the applications evaluated. In these cases, they tended to rely more on their own experience with the application rather than the description of it in the article. This means that for some applications, the basis on which the quality of the recommen- dation was assessed was not the same for all subjects. These cases are simply one expression of the fact that different people will bring different knowledge and expe- riences in to how they judge the quality of a recommendation, which is something that could influence the scores both ways, and would be hard to avoid. 32 CHAPTER 5. DISCUSSION AND CONCLUSION

5.2 Obstacles

The biggest obstacle encountered during the process was to set up the process of evaluating the recommendations. The problem with assessing the relevance in a set of recommendations is that it to a large extent depends on what the individual user is looking for. This was confirmed in the comments from the interviews. The initial idea was to base the evaluation on functionality around one of the commercially available functionality databases for applications, such as Capterra1, GetApp2 or Software Advice3. These services contain lists of high-level features for a large set of applications, and the idea was to be able to use these as a base for quantitatively measuring functional similarity between applications. The problem with this was that most of the applications in the Wikipedia dataset collected did not have such lists available on any of these sites upon closer examination. A de- cision was made to drop this idea and instead use a broader functional indicator in form of software categories, complemented with a set of in-depth interviews for a subset of the applications. In addition to being time-consuming, the interviews were limited by the availability of people with the right pre-requisite knowledge to evaluate some types of applications, especially for more specialized types of ap- plications. To summarize, the process and the quality of the results would have benefited strongly by better available evaluation data, such as either structured in- formation on application functionality, more fine-grained software categories or, possibly, user information such as likes or ratings on applications.

5.3 Conclusion

There were no clear results on which of the two methods that is to prefer over the other for recommending applications with related functionality. The LSI-based rec- ommender returned a slightly higher fraction of applications from the same soft- ware category when the requested set of recommendations was relatively small. For larger sets of requested recommendations, the TF-IDF based recommender re- turned a higher fraction of applications from the same software category. The mean precision for the recommendations to the 20 applications that were rated in the interviews were 0,624 for TF-IDF and 0,700 for LSI for when general quality from related functionality was assessed. The recommendations from the LSI-based recommender was rated higher by the subjects in terms of similarity in functionality and features, and in general quality of the recommendations in terms of functional relations. The recommendations from the TF-IDF based recommender was rated higher in terms of integrations and functional synergy. However, the

1http://www.capterra.com 2https://www.getapp.com 3http://www.softwareadvice.com CHAPTER 5. DISCUSSION AND CONCLUSION 33

recommendations for the 20 applications in the used evaluation set was found to belong to the same software category to a higher extent for the LSI-based recom- mender compared to its average on the whole dataset. For the TF-IDF based rec- ommender, this number was slightly lower than its average on the whole dataset. This could indicate a bias in the interview evaluation set towards the LSI-based recommender, and makes it difficult to draw any conclusion on which method that is preferable over the other from the results of the experiment.

5.4 Future work

To conclude which of the two methods evaluated in this study that produces the highest quality recommendations, further studies is needed with both a larger data- set of application articles and more people evaluating the recommendations. The availability of test persons with more extensive knowledge in applications from different software categories is also something that would increase the quality of such a result, as more types of applications could be evaluated. There are also several other methods, as well as extensions to the TF-IDF and LSI-methodologies that could be evaluated. Examples include topic modelling methods such as Latent Dirichlet Allocation (LDA) and methods that make use of an external knowledge-base, such as Explicit Semantic Analysis (ESA). Both were partially researched for this project, and found to have been successfully applied in similar contexts (Gabrilovich & Markovitch, 2007). There are also more Wikipedia- specific methods that could be evaluated, such as the Wikipedia Link-based Mea- sure, which use the hyperlink structure between articles to calculate their similarity (Witten & Milne, 2008). Bibliography

Abiteboul, S. (1997). Querying semi-structured data. Database Theory—ICDT’97, 1–18. Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from web pages. In Proceedings of the 2003 acm sigmod international conference on management of data (pp. 337–348). Ballatore, A., Bertolotto, M., & Wilson, D. C. (2014). An evaluative baseline for geo-semantic relatedness and similarity. GeoInformatica, 18(4), 747–767. Buneman, P. (1997). Semistructured data. In Proceedings of the sixteenth acm sigact- sigmod-sigart symposium on principles of database systems (pp. 117–121). Debnath, S., Ganguly, N., & Mitra, P. (2008). Feature weighting in content based recommendation system using social network analysis. In Proceedings of the 17th international conference on world wide web (pp. 1041–1042). Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391. Foltz, P. W., Laham, D., & Landauer, T. K. (1999). Automated essay scoring: Ap- plications to educational technology. In World conference on educational multi- media, hypermedia and telecommunications (Vol. 1, pp. 939–944). Fox, C. (1989). A stop list for general text. In Acm sigir forum (Vol. 24, pp. 19–21). Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Ijcai (Vol. 7, pp. 1606–1611). Gee, K. R. (2003). Using latent semantic indexing to filter spam. In Proceedings of the 2003 acm symposium on applied computing (pp. 460–464). Giles, J. (2005). Internet encyclopaedias go head to head. Nature Publishing Group. Gong, Y., & Liu, X. (2001). Creating generic text summaries. In Document analysis and recognition, 2001. proceedings. sixth international conference on (pp. 903–907). Homayouni, R., Heinrich, K., Wei, L., & Berry, M. W. (2004). Gene clustering by latent semantic indexing of medline abstracts. Bioinformatics, 21(1), 104–115. Hu, M., Lim, E.-P., Sun, A., Lauw, H. W., & Vuong, B.-Q. (2007). Measuring article quality in wikipedia: models and evaluation. In Proceedings of the sixteenth acm conference on conference on information and knowledge management (pp. 243– 252).

34 Bibliography 35

Huang, A. (2008). Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (nzcsrsc2008), christchurch, new zealand (pp. 49–56). Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (2st ed.). Upper Saddle River, NJ, USA: Prentice Hall PTR. Kishida, K. (2005). Property of average precision and its generalization: An examination of evaluation indicator for information retrieval experiments. National Institute of Informatics Tokyo, Japan. Manning, C. D., Raghavan, P., & Schütze, H. (2008a). Evaluation in information retrieval. Introduction to Information Retrieval, 151–175. Manning, C. D., Raghavan, P., & Schütze, H. (2008b). Matrix decompositions and latent semantic indexing. Introduction to Information Retrieval, 403–417. Merriam-Webster’s online dictionary. (n.d.). Merriam-Webster Dictionary. Retrieved 2017-09-04, from https://www.merriam-webster.com/dictionary/ application OED Online. (n.d.-a). Oxford University Press. Retrieved 2017-09-04, from https://en.oxforddictionaries.com/definition/synonym OED Online. (n.d.-b). Oxford University Press. Retrieved 2017-09-04, from https://en.oxforddictionaries.com/definition/polysemy OED Online. (n.d.-c). Oxford University Press. Retrieved 2017-09-04, from https://en.oxforddictionaries.com/definition/hyponymy OED Online. (n.d.-d). Oxford University Press. Retrieved 2017-09-04, from https://en.oxforddictionaries.com/definition/hypernym Pazzani, M. J., & Billsus, D. (2007). Content-based recommendation systems. In The adaptive web (pp. 325–341). Springer. Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning (Vol. 242, pp. 133–142). Ricci, F., Rokach, L., & Shapira, B. (2011). Introduction to recommender systems handbook. In Recommender systems handbook (pp. 1–35). Springer. Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. Wikipedia. (n.d.). Wikimedia Foundation. Retrieved 2017-09-08, from https:// en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia Willett, P. (2006). The porter stemming algorithm: then and now. Program, 40(3), 219–223. Witten, I. H., & Milne, D. N. (2008). An effective, low-cost measure of semantic relatedness obtained from wikipedia links. Appendix A

List of stop words

The following list is the list of English stop words that was used during the pre- processing of the articles. The list is included in the NLTK-library for the Python programming language.

he, were, only, again, are, once, after, ourselves, then, shouldn, weren, between, she, each, while, don, didn, until, into, in, hadn, a, theirs, both, as, an, below, herself, haven, by, them, if, isn, they, mightn, his, through, before, hasn, being, should, was, this, on, there, same, some, when, hers, to, its, that, is, but, at, will, just, you, d, their, ve, yours, further, my, doing, y, myself, nor, does, we, have, up, too, off, how, here, needn, itself, other, be, any, no, couldn, doesn, where, aren, am, ma, m, so, shan, having, for, had, these, ll, than, from, over, very, ours, and, all, yourselves, our, ain, me, with, who, down, it, most, the, himself, i, now, re, what, of, or, more, did, those, whom, yourself, her, your, why, has, own, themselves, wasn, him, not, because, s, can, wouldn, such, during, out, which, been, about, t, won, mustn, o, do, few, above, against, under

36 Appendix B

Software categories

The following list is a list of Wikipedia software category labels used to label the application dataset. The list contains 115 distinct labels.

Project management software, Enterprise bookmarking platform, License manager, Ac- counting software, Web conferencing software, Enterprise Resource Planning, Kanban soft- ware, Password manager, Time-tracking software, OLAP Server, Backup software, Source code hosting facility, Reporting software, System dynamics software, Word processor, DNS Server software, Download manager, Content management system (CMS), Optimization software, Computer assisted qualitative data analysis software (CAQDAS), Haskell web framework, Business intergration software, Molecular mechanics modeling software, Deep learning software, Application virtualization software, Packet analyzer, Computer algebra system (CAS), CAD, Relational database management system, Object-relational mapping software, SSH Server, Solaris antivirus software, Java web framework, HTML editor, 3D Computer graphics software, Video editing software, Vector graphics editor, Virtualization software, macOS antivirus software, FTP client software, Graphical FTP server software, Application server, Numerical analysis software, File synchronization software, Geographic information system, Reference management software, Scala web framework, CFML web framework, Web server software, Decision-making software, File archiver, VoIP software, Terminal based FTP server software, Continuous integration software, Calendaring soft- ware, SVN client, Remote desktop software, BlackBerry antivirus software, Data model- ing tool, Computer simulation software, Electronic design automation (EDA), Database tool, Content control software, Disk authoring software, ASP.NET web framework, Shop- ping cart software, Ruby web framework, Symbian antivirus software, Email client, Elec- tronics circuit simulator, Windows mobile antivirus software, antivirus software, C++ web framework, Desktop publishing software, Object-relational database management system, Version control software, Image viewer, Mobile CRM, Digital audio workstation, IDE, Image organizers, Wiki software, Scrum software, Raster graphics editor, PHP web framework, Webmail client, , Mind-mapping software, Collaborative soft- ware, D web framework, Discrete event simulation software, iOS antivirus software, Perl

37 38 APPENDIX B. SOFTWARE CATEGORIES

web framework, Learning management system, 3D modeling software, Lisp web frame- work, Spreadsheet software, JavaScript web framework, Workflow management system, SSH client, Web analytics software, Development estimation software, UML tool, Con- figuration management software, Database management system, Issue-tracking system, Windows antivirus software, Other web framework, Android antivirus software, Disk en- cryption software, server, Python web framework, Media player, Information graphics software, CRM Appendix C

Evaluated applications

Table C.1 contains the names and corresponding abbreviations for the applications that were evaluated in the interviews.

Table C.1: Abbreviations used in place of the application names.

Abbreviation Software category EC1 Eureka Email EC2 EC3 EC4 GNUMail RG1 Adobe Photoshop RG2 Iphoto RG3 GimPhoto RG4 F-spot SS1 StarOffice SS2 Quattro Pro SS3 PlanMaker SS4 Microsoft Excel WC1 Zoom video communications WC2 Skype WC3 BigBlueButton WC4 WizIQ WP1 TextMaker WP2 LibreOffice writer WP3 EZ Word WP4 OpenOffice.org

39 Appendix D

Interview scoring criteria

Tables D.1 to D.3 contain the criteria given to the subjects during the interviews, to help them set the scores.

Table D.1: The scores and criteria used for similarity in features and functionality during the interviews.

Score Criteria 3 Very similar product features and functionality. The two applications are to a large extent able to perform the same tasks, in a similar manner. 2 Fairly similar product features and functionality. The recommendation might either miss some key functionality, or have a significantly differ- ent way of performing tasks. 1 A few similar features/features within the same area, but key features are missing. 0 The recommended application has very little, or no shared features and functionality with the evaluated application.

Table D.2: The scores and criteria used for functional synergy during the interviews.

Score Criteria 3 Strong functional synergy. The functionality of the recommendation strongly enhances and/or complements the functionality of the evalu- ated application. 2 Fairly strong functional synergy. The functionality of the recommenda- tion enhances and/or complements the functionality of the evaluated application. 1 Weak functional synergy. The functionality of the recommendation weakly enhances and/or complements the functionality of the evalu- ated application. 0 None, or insignificant functional synergy.

40 APPENDIX D. INTERVIEW SCORING CRITERIA 41

Table D.3: The scores and criteria used for general relevance from related functionality.

Score Criteria 3 Strong relevance from functional relations. 2 Fairly strong functional relation. 1 Weak functional relation. 0 None, or insignificant functional relation. www.kth.se