Content Based Filtering for Application Software

DEGREE PROJECT IN THE FIELD OF TECHNOLOGY MEDIA TECHNOLOGY AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 Content based filtering for application software DAVID LINDSTRÖM KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Content based filtering for application software DAVID LINDSTRÖM Master in Computer Science Date: February 27, 2018 Supervisor: Jeanette Hellgren Kotaleski Examiner: Anders Lansner Principal: ISOFT Services AB Swedish title: Innehållsbaserad filtrering för applikationsprogramvara School of Computer Science and Communication iii Abstract In the study, two methods for recommending application software were imple- mented and evaluated based on their ability to recommend alternative applications with related functionality to the one that a user is currently browsing. One method was based on Term Frequency–Inverse Document Frequency (TF-IDF) and the other was based on Latent Semantic Indexing (LSI). The dataset used was a set of 2501 articles from Wikipedia, each describing a distinct application. Two experiments were performed to evaluate the methods. The first experiment consisted of measuring to what extent the recommendations for an application belong to the same software category, and the second was a set of structured interviews in which recommendations for a subset of the applications in the dataset were evaluated more in-depth. The results from the two experiments showed only a small difference between the methods, with a slight advantage to LSI for smaller sets of recommendations retrieved, and an advantage for TF-IDF for larger sets of recommendations retrieved. The interviews indicated that the recommendations from when LSI was used to a higher extent had a similar functionality as the evaluated applications. The recommendations from when TF-IDF was used had a higher fraction of applications with functionality that complemented or enhanced the functionality of the evaluated applications. iv Sammanfattning I studien implementerades och utvärderades två alternativa implementationer av ett rekommendationssystem för applikationsprogramvara. Implementationerna ut- värderades baserat på deras förmåga att föreslå alternativa applikationer med rela- terad funktionalitet till den applikation som användaren av ett system besöker eller visar. Den ena implementationen baserades på Term Frequency-Inverse Document Frequency (TF-IDF) och den andra på Latent Semantic Indexing (LSI). Det data som användes i studien bestod av 2501 artiklar från engelska Wikipedia, där varje artikel bestod av en beskrivning av en applikation. Två experiment utfördes för att utvärdera de båda metoderna. Det första experimentet bestod av att mäta till vilken grad de rekommenderade applikationerna tillhörde samma mjukvarukategori som den applikation de rekommenderats som alternativ till. Det andra experimentet bestod av ett antal strukturerade intervjuer, där rekommendationerna för en delmängd av applikationerna utvärderades mer djupgående. Resultaten från experimenten visade endast en liten skillnad mellan de båda metoderna, med en liten fördel till LSI när färre rekommendationer hämtades, och en liten fördel för TF-IDF när fler rekommendationer hämtades. Intervjuerna visade att rekommendationerna från den LSI-baserade implementationen till en högre grad hade liknande funktionalitet som de utvärderade applikationerna, och att rekommendationerna från när TF-IDF användes till en högre grad hade funktionalitet som kompletterade eller förbättrade de utvärderade applikationerna. Contents 1 Introduction 1 1.1 Definitions . .1 1.1.1 Application . .1 1.1.2 Synonymy . .2 1.1.3 Polysemy . .2 1.1.4 Hyponymy . .2 1.1.5 Hypernymy . .2 1.1.6 Semantic similarity and relatedness . .2 1.1.7 Structured and Semi-structured text . .3 1.1.8 Recommender systems . .3 1.1.9 Wikipedia . .3 1.2 Research question . .4 1.3 Objective . .4 1.4 Delimitation . .4 2 Background 6 2.1 Vector space model . .6 2.1.1 Term frequency-inverse document frequency . .6 2.1.2 Latent Semantic Indexing . .7 2.2 Cosine similarity . .9 2.3 Text pre-processing . .9 2.3.1 Tokenization . .9 2.3.2 Stop words . 10 2.3.3 Stemming . 10 2.4 Evaluation metrics . 10 2.4.1 Precision at K . 10 2.4.2 Mean average precision . 10 3 Method 12 3.1 Data collection and labelling . 12 3.2 Index creation . 13 v vi CONTENTS 3.3 Recommendation process . 14 3.4 Experiments . 15 3.5 Used software . 16 3.5.1 Natural Language Toolkit (NLTK) . 16 3.5.2 Gensim . 16 3.5.3 Wikipedia (Python package) . 16 3.5.4 Django REST framework . 17 3.5.5 React . 17 3.6 Platform . 17 4 Results 18 4.1 Comparison by software category . 18 4.1.1 Precision at K . 18 4.1.2 Mean average precision . 19 4.2 Interviews . 19 4.2.1 Evaluated applications . 19 4.2.2 Interview Process . 21 4.2.3 Interview process delimitation . 22 4.2.4 Interview results . 22 5 Discussion and conclusion 28 5.1 Discussion . 28 5.1.1 Comparison by software category . 28 5.1.2 Interviews . 29 5.2 Obstacles . 32 5.3 Conclusion . 32 5.4 Future work . 33 Bibliography 34 A List of stop words 36 B Software categories 37 C Evaluated applications 39 D Interview scoring criteria 40 Chapter 1 Introduction The rapid growth of the internet together with accelerating processes of digitaliza- tion has led to an increasing amount of software applications available over the last years. A problem with this is that for most people, only a tiny part of the software in the world is of interest. If faced with a specific need for functionality, the information that is of relevance is at risk of being obscured by the growing amount of data available. There is a general need for help in filtering out information that is of relevance, and numerous services of different types either incorporate this as a part of their functionality or base their whole functionality around it. One common mechanism for doing so is to provide recommendations for content that might be of interest to a user. Today we see the presence of recommender systems on many parts of the internet, including when we buy or browse for books, movies or read the news online. The purpose of this study is to investigate and evaluate the quality of a recommender system for application software, where recommendations of alternative software is given based on which application a user is currently browsing. As there are numerous technologies and algorithms for recommending content, two specific methods have been selected for this study. The first is a method based on Term Frequency–Inverse Document Frequency (TF-IDF) and the second is a method based on Latent Semantic Indexing (LSI). The dataset on which they are evaluated is a set of articles from English Wikipedia, each describing an application. 1.1 Definitions The following section contains definitions for terms that are central to the report. 1.1.1 Application An application is a program (such as a word processor or spreadsheet) that per- forms a particular task or set of tasks (Merriam-Webster’s online dictionary, n.d.). 1 2 CHAPTER 1. INTRODUCTION Throughout the report, the terms application, software and application software will be used interchangeably. 1.1.2 Synonymy A synonym is a word or a phrase that means exactly or nearly the same as another word or phrase in the same language. For example, shut is a synonym of close (OED Online, n.d.-a). 1.1.3 Polysemy Polysemy is the coexistence of many possible meanings for a word or phrase (OED Online, n.d.-b). An example is the word book which can have multiple meanings. It can refer to a physical book that you can read, or it can refer to registering or scheduling some event (e.g. to book a hotel room). 1.1.4 Hyponymy A hyponym is a word of more specific meaning than a general or superordinate term applicable to it. For example, spoon is a hyponym of cutlery (OED Online, n.d.-c). 1.1.5 Hypernymy A hypernym is a word with a broad meaning, constituting a category into which words with more specific meanings fall; a superordinate. For example, colour is a hypernym of red (OED Online, n.d.-d). 1.1.6 Semantic similarity and relatedness Semantic similarity is defined as a subset of the more general notion semantic relatedness. Semantically related terms or texts refer to any type of relation between the two, whereas the more specific notion of semantic similarity refer to them being related by either synonymy, hyponymy or hypernymy. In this sense, the words train and bus are semantically similar, as they are both means of transportation. On the other hand, the words bus and road would be considered as semantically related but not semantically similar, as they often cooccur, but with different roles in the context in which they appear (Ballatore, Bertolotto, & Wilson, 2014). In the report, the use of similarity and relatedness between articles or words will refer to semantic similarity and semantic relatedness. CHAPTER 1. INTRODUCTION 3 1.1.7 Structured and Semi-structured text Structured text refers to text that resides in a fixed structure, so that sets of tu- ples of the same kind can be stored and processed in a database. This includes spreadsheets, table oriented text as in a relational model or sorted-graph as in ob- ject databases (Arasu & Garcia-Molina, 2003). Unstructured text is raw text that does not have any pre-defined data model or structure (Abiteboul, 1997). Semi-structured text is text that is neither raw data nor strictly typed, but is normally associated with a schema that is contained within the text. Such text is often called self-describing, and includes tagged text such as HTML, XML or JSON documents (Buneman, 1997).

Load more