Uppsala universitets logotyp

21026 Examensarbete 30 hp Maj 2021

Design and training of a on an educational domain using Topic & Term-Frequency modeling

Max Netterberg & Simon Wahlström

Civilingenjörsprogrammet i system i teknik och samhälle

Civilingenjörsprogrammet i system i teknik och samhälle Uppsala universitets logotyp

Design and training of a recommender system on an educational domain using Topic & Term-Frequency modeling

Max Netterberg & Simon Wahlström

Abstract This thesis investigates the possibility to create a machine learning powered recommender system from educational material supplied by a media provider company. By limiting the investigation to a single company's data the thesis provides insights into how a limited data supply can be utilized in creating a first iteration recommender system. The methods include semi structured interviews with system experts, constructing a model-building pipeline and testing the models on system experts via a web interface. The study paints a good picture of what kind of actions you can take when designing content based filtering recommender system and what actions to take when moving on to further iterations. The study showed that user preferences may be decisive for the relevancy of the provided recommendations for a specific media content. Furthermore, the study showed that Term Frequency Inverse Document Frequency modeling was significantly better than using an Elasticsearch database to serve recommendations. Testing also indicated that using term frequency document inverse frequency created a better model than using topic modeling techniques such as latent dirichlet allocation. However as testing was only conducted on system experts in a controlled environment, further iterations of testing is necessary to statistically conclude that these models would lead to an increase in user experience. Teknisk-naturvetenskapliga fakulteten, Uppsala universitet. Utgivningsort Uppsala/Visby. Handledare: Error: Reference source not found , Ämnesgranskare: M, Examinator: Error: Reference source not found

Teknisk-naturvetenskapliga fakulteten Uppsala universitet, Utgivningsort Uppsala/Visby

Handledare: Thomas Ingeborn Ämnesgranskare: Filip Malmberg Examinator: Elísabet Andrésdóttir Populärvetenskaplig sammanfattning

Några av de största tech-jättarna Netflix och Youtube har extremt mycket innehåll att filtrera igenom för användare på sidorna. Den stora mängden innehåll innebär att de flesta användare har svårt att hitta på sidorna även med en sökfunktion. Youtube & Netflix löser detta problem genom att konstruera komplicerade rekommendationsmodeller som drivs av maskininlärning. Modellerna lär sig av användarens beteende och beräknar vad för typ av innehåll som användaren är mest intresserad av. Rekommendationsmodellerna kan baseras på liknande produkter som användaren interagerat med vilket kallas content based filtering, eller vad liknande användares preferenser är vilket kallas collaborative filtering. Modellen är en avgörande del i ett rekommendationssystem.

Att skapa ett rekommendationssytem kan verka simpelt vid första anblick men att definiera vad en bra rekommendation är kan vara ett svårlöst problem. Många populära applikationer så som Netflix eller Youtube använder komplicerade algoritmer för att bestämma vad som bör rekommenderas härnäst till användaren. Frågan blir då: hur vet algoritmen vad som ska rekommenderas och när det ska rekommenderas? Det korta svaret är att den inte vet. Den beräknar utifrån användarens handlingar vilket innehåll användaren med störst chans kommer att vilja titta på härnäst.

Denna uppsats undersöker möjligheten att skapa ett maskininlärningsdrivet rekommenda- torsystem från utbildningsmaterial som tillhandahålls av ett medialeverantörsföretag i form av filmer och podcasts. Genom att begränsa utredningen till data hos ett enskilt företag ger avhandlingen inblick i hur en begränsad datatillförsel kan användas för att skapa en första iterations rekommenderingssystem.

Metoderna inkluderar semistrukturerade intervjuer med systemexperter, konstruering av en modellbyggnadsrörledning och testning av modellerna på systemexperter via ett webb- gränssnitt. Studien målar en bra bild av vilken typ av åtgärder du kan vidta när du utformar innehållsbaserat filtreringsrekommendatorsystem och vilka åtgärder du ska vidta när du går vidare till ytterligare iterationer. Denna studie foukserar på att jämföra två typer av maskininlärningsmodeller för ett specifikt system. Modellerna som jämförs mot varandra är TF-IDF (Term Frequency - Inverse Document Frequency) och LDA (Latent Dirichlet Allocation). Studien visade att användarpreferenser kan vara avgörande för relevansen av de rekommendationer som ges för ett specifikt mediainnehåll.

Dessutom visade studien att modellering med TF-IDF var signifikant bättre än att använda en Elasticsearch motor för att ge rekommendationer. Testning visade också att användning av TF-IDF skapade en bättre modell än att använda ämnesmodelleringstekniker som LDA. Eftersom testning endast utfördes på systemexperter i en kontrollerad miljö, är ytterligare iterationer av testning såsom AB-testning nödvändiga för att statistiskt dra slutsatsen att dessa modeller skulle leda till en ökad användarupplevelse.

3 Distribution of work

This thesis was written by Max Netterberg and Simon Wahlström. Both authors con- tributed equally during the course of the project. Max Netterberg had overall responsibility for developing the test site and Simon Wahlström had overall responsibility for developing future suggestions on the developed models. Both authors has contributed equally to the codebase of the model building pipeline as well as the thesis text.

4 Contents

1 Introduction 10

1.1 Problematization ...... 10

1.2 Purpose ...... 11

1.3 Delimitations ...... 11

2 Background 12

2.1 Skolfilm ...... 12

2.2 Skolfilm’s Data ...... 12

2.3 How streaming media is used in an educational setting ...... 13

3 Theory 14

3.1 Levels of recommender systems ...... 14

3.2 Expert recommendations ...... 15

3.3 Collecting information ...... 15

3.3.1 Explicit data collection ...... 15

3.3.2 Implicit data collection ...... 16

3.4 Preprocessing data ...... 17

3.5 Content based filtering ...... 18

3.6 Term Frequency - Inverse Document Frequency ...... 19

3.6.1 Term Frequency ...... 19

3.6.2 Inverse Document Frequency ...... 20

3.6.3 Term Frequency & Inverse Document Frequency ...... 20

3.7 Topic modeling ...... 21

3.8 Latent Dirichlet Allocation ...... 22

4 Method 24

4.1 Project structure - Using an iterative process ...... 24

4.2 Preparing develepment enivronment ...... 24

4.3 Choosing a model for building recommendation systems ...... 25

5 4.4 System Overview ...... 25

4.5 Pre-processing data ...... 26

4.6 Interviews ...... 27

4.7 Tuning Hyperparameters ...... 28

4.8 Testing models ...... 29

4.8.1 Manual testing ...... 30

4.8.2 Online evaluation ...... 30

4.9 Test evaluation ...... 32

4.9.1 Wilcoxon signed-rank test ...... 32

4.10 Performing online evaluation ...... 33

4.10.1 Subtitles ...... 35

5 Results 37

5.1 Interviews ...... 37

5.2 Model test results ...... 39

5.2.1 Comments ...... 40

5.3 SRT-files ...... 42

6 Analysis 43

6.1 Low ratings ...... 43

6.2 Diversity ...... 45

6.3 Number of keywords ...... 45

6.4 TF-IDF vs LDA ...... 46

7 Proposed Future Directions 48

7.1 Collaborative filtering ...... 48

7.2 Feature combination hybrid ...... 50

7.3 Session based recommendations ...... 53

7.4 Additional data collection ...... 57

7.5 Recommending AV-centrals ...... 57

6 8 Conclusions 59

References 60

Appendix 63

7 Wordlist

Data frame

Data frame is a data structure made popular from the python library pandas. The structure resembles a matrix but has named rows and columns. TF-IDF

Machine learning model, acronym for Term Frequency, Inverse Term Frequency. LDA

Machine learning model, acronym for Latatent Dirichlet Allocation. Corpus

Large collection of documents (not the same as MongoDB documents) which is used in scientific language processing analysis. The documents can be read by machines and have generally gone through some kind of processing. The documents can consist of sentences or other type of text. Pickle

File format, generally used to persist data frames. REST-API

Acronym for Representational State Transfer - Application Programming Interface. It is communication interface paradigm. Generally used on the web to transfer data between the client interface and the backend. JSON

Data structure which is primarily used in the web to send data between two applications. Acronym for Javascript Object Notation Document

Can mean two things in this thesis:

1. Data structure used in databases such as MongoDB and Elasticsearch. It is based of the JSON data structure and is therefore very similar to JSON.

2. Document representing an entity in a corpus. In the case of this thesis a document represents a media file (film or audio). MongoDB

Database paradigm that uses documents instead of tables. CSV

File format, acronym for comma-separated values. Used to store values in a table format.

8 CRUD

Acronym for Create, Read, Update, Delete. Common data actions. BoW

Acronym for Bag of Words. Words collected and stored in no specific order. Term

Same thing as ”word”. In the thesis ”word” and ”term” are used interchangeably.

9 1 Introduction

As part of the digitalization of society today, education is increasingly conducted on digital platforms and with online resources. However, with an all increasing flow of information on the Internet and more products to choose from, new problems arise. An emergent problem is the difficulties for users to find what they are interested in with this abundance of information. One solution that has had success in addressing this issue is the use of machine learning to create recommendation systems offering personalized recommenda- tions. Recommendation systems has successfully been implemented by media streaming platforms resulting in increased user engagement and sales (Jannach and Jugovac 2019). The business goal of a recommendation system is usually to keep users on the website for longer time, rendering increased user engagement for streaming services like YouTube and increasing sales for webshops like Amazon. According to Netflix, 75 % of what people watch is based on recommendations (Amatriain and Basilico 2012) and Amazon reports that 35 % of their sales originate from a recommendation (Jannach and Jugovac 2019). These sites let the majority of their home screen be populated by recommendations in favor of general content, showing the importance of recommendations for many online businesses. For a recommendation system in an educational setting, the goal of the rec- ommendation would not be the same. The next recommended item should be optimized to enhance students learning process in a given subject. Not keeping the user on the website by offering interesting and at the same time, from an educational perspective, irrelevant items.

1.1 Problematization

Teachers are an occupational group with limited time and stressful schedules. Often the time to find relevant educational content is greatly limited, leading to teachers reusing the same material without researching if more relevant content could be found. With thousands of educational, video and audio resources, a treasure of high quality content exists for teachers. However, accessing it is in practice much more problematic than desired. A system for recommending educational materials for teachers is much needed for overcoming this obstacle. A recommendation system could potentially save time for teachers as well as facilitate in the search for new and improved content.

10 1.2 Purpose

The purpose of this project is to investigate recommendation systems that aid teachers in finding relevant educational material. This leads to the following research questions:

1. According to experts of the system and using available data, what should recom- mendations in the system consist of?

2. What recommender system and data collection methods are relevant to use on the system?

The first research question will be answered by developing a recommendation system using available data on the educational streaming service Skolfilm. To develop the system, two interviews with system expert groups as well as a user study with system experts will be conducted. Based on these results and by performing a literature study investigating potential recommender system methods to use on Skolfilm’s system, the second reasearch question will be answered.

1.3 Delimitations

The online educational streaming service Skolfilm, developed by company Skolmedia, will be used as the platform on which the research questions of this study will be investigated. The media that is used consists of films and podcasts available on the service. This thesis is only considering data that was available at the start of the project. Data concerning user favorites was not possible to make available in time for the project and is therefore not included.

11 2 Background

The project was realised in cooperation with the company Skolmedia and their streaming service Skolfilm. Preconditions of the company were given to provide context for the recommendation system that was developed and the methods investigated in the literature study. Moreover, the section presents a background of how educational media is used and the importance of AV-centrals.

2.1 Skolfilm

Skolmedia is a company based in Stockholm that offers a service called Skolfilm. Skolfilm is an online media streaming platform providing educational resources in the form of film and podcasts for elementary and upper secondary school. The service can be signed directly by schools, by the local region or by AV-centrals that in turn provide the content for schools (Skolfilm 2021). For this thesis the service Skolfilm is used, which is provided directly by Skolmedia. AV-centrals subscribe to Skolfilm’s media service and provides educational media content such as films and podcasts to schools in their domain. Their main task is to provide high quality educational content as well as purchase new media content for schools. The AV-central can also remove, add and modify metadata for all the media content that is available on their domain. Search queries performed by users in the system relies on the media’s metadata to serve relevant content. Therefore AV-centrals can directly affect how content will be exposed on their platform (ibid.). Based on their high involvement in media content and end users, employees at AV-centrals are considered experts of the system. The system expert group that was used for interviews and user tests of the developed system are therefore employees from AV-centrals.

2.2 Skolfilm’s Data

Skolmedia has data in different databases and structures. It is necessary to extract relevant data for the creation of a recommendation model. The system data structure consists of an ElasticSearch database that is used to find films when users create search queries on the website. The ElasticSearch database contains all relevant media content metadata such as title, description, keywords for most films, and subtitle (SRT) files. All media content metadata that Skolmedia hosts is available in this database, in total more than 23,000 items. Each AV-central has a varied supply of media content that is a subset of the total media content on the ElasticSearch database. All AV-centrals has access to the full range of UR’s (Utbildningsradions) content as well as their own collection of purchased content.

12 Users can have a personal user account or a joint account that is shared with others. The school the data is generated from is stored in the MySQL database but no specific user information, making users anonymous. A MySQL database, which contains user-specific information regarding favorites and an anonymous search history exists. Favorites is a feature in which the user can mark media content as a favorite and save it on their user profile for future logins. This MySQL database also contains information about every time a user has consumed media content. However, this information is not connected to any specific users. Apart from favorites, no data is stored that can be connected to a specific account.

2.3 How streaming media is used in an educational setting

As availability of faster broadband connections and an increased range of educational films has emerged, the use of digital teaching materials has increased with it (Bhosale, Pottigar, and Chavan 2015). Increased resources through government investments have also contributed to an increased digitalization of the educational environment. In the last year alone, the proportion of schools with a public, accessible strategy for digital teaching materials has increased from 47 to 66 % based on a survey on Swedish teachers (Clio 2020). Studies have shown that streaming films can facilitate students’ learning and contribute to increased satisfaction. Different students benefit from different learning styles and by broadening teachers’ teaching style with increased visual learning, more students can be reached. Today, students are also more receptive to teaching through streaming as they are used to that medium from other parts of their lives (Cisco 2012). A comparative survey between students who used streaming video and a control group for math instructions showed a 4.8 percent increase in score for the streaming movie group (Boster et al. 2006). However, the effectiveness of educational film depends on it being of sufficiently high quality. A survey of primary school teachers showed that educational films in schools can have a positive impact on students’ performance, but that only 13 percent of the films surveyed were of high quality (Ludewig and Jannach 2018a).

However, the current situation regarding digitalization in Swedish schools shows problems with implementing this knowledge in practice. According to a survey conducted on Swedish teachers in 2020, only 13 percent use digital teaching aids in a majority their teaching and 63 percent teach completely without digital aids. The corresponding figures for Denmark are 36 and 29 percent, respectively. The trend for both countries, however, is an increasing digitalization and an increased confidence in digital solutions in the classroom. The main reasons why teachers do not use digital teaching materials are stated to be poor internet connection and insufficient IT equipment (45%), login problems (32%) and high costs (28%) (Clio 2020).

13 3 Theory

Users generate large amounts of data whenever they interact with websites on the Internet. As data storage capabilities are improved, the potential for storing the data is increased. With more available data it is possible to create sophisticated filters among the increasing number of alternatives users are presented with. This is the basis of a recommender system. Some of these recommendation systems utilize machine learning models to make accurate predictions. Generally these recommendation models can be categorized into three different categories: content-based filtering, collaborative filtering and hybrid methods. Which method that is chosen is dependent on attributes of the domain system and availability of data. Content-based filtering uses available metadata of items that a user has interacted with to provide recommendations that are similar to consumed items. Collaborative filtering does not need any data about items but relies on similarities between users to make recommendations. Hybrid methods utilize both these strategies in various ways to make recommendations (Ricci, Rokach, and Shapira 2015). For an overview of collaborative filtering compared to content based filtering see figure 1. Models built with these methods are popular and well tested. In recent years research suggests that the use of DNN:s (deep neural networks) to solve recommendation problems may also be a viable option (Zhang et al. 2019).

Figure 1: Comparision between content based and collaborative filtering

3.1 Levels of recommender systems

Depending on the data used in a recommendation model, a recommendation system can consist of three levels of personalization. The most basic level of recommendation consists of Non-personalized recommendations. This type of recommendation system

14 does not differentiate between users in any way and will therefore produce the same recommendations to all users given the same activity on the page. The recommendations can be based on e.g. popular items or suitable items based on time or season. (Falk 2019).

The second level consists of semi-personalized recommendation systems. In such a system, users are divided into specific groups or segments of the total group. Depending on which group a user belongs to, recommendations relevant to that group will be displayed. Which group a user belongs to may depend on the user’s profile, e.g. place of residence or age. It can also be based on undefined attributes of the user, such as depending on where the user is located at the moment or whether the user is in movement or still. E.g. If a user is abroad, the range of products changes as it probably means that the user is on holiday. The recommendation system knows that you belong to a specific group but nothing about you as an individual user and will therefore give the same recommendations to everyone in the same group (ibid.).

The last level of recommendation system uses personalized data on how the user has interacted with the system in the past to create a unique user profile for each individual user. A prerequisite for a personalized recommendation system to be functioning is therefore that data from individual user behavior is stored (ibid.).

3.2 Expert recommendations

Expert recommendations are a type of recommendation that uses data based on expert opinions. This type of recommendation is mainly suitable in contexts where expert opinions are generally regarded as reliable, e.g., wines. This type of recommendation was more common in recommendation systems in the past and now most recommendation systems use data generated by large user groups (ibid.).

3.3 Collecting information

There are multiple ways to collect data from users. Generally, it can be grouped into two methods; implicit data collection and explicit data collection.

3.3.1 Explicit data collection

Explicit data collection covers all aspects where the user is asked to provide information about their preferences of an item or aspect of a system. Examples of this are ratings of items, product reviews and information about your user profile (Ricci, Rokach, and

15 Shapira 2015).

Rating systems are usually designed as a star system or using thumbs up / down. Both Youtube and Netflix have previously had rating systems with five stars but have now switched to a thumbs-up / down system. In 2017, Netflix made this change when repeated testing showed that that system increased the user’s engagement in the rating system with more than twice as much use (Johnson 2017). A possible conclusion to draw is that greater user coverage results in an improved user experience compared to more granular data for fewer users.

3.3.2 Implicit data collection

Implicit data collection is the inverse of explicit data collection, meaning the user is not making an active choice in expressing their preferences. Implicit ratings can be used to infer user preferences by tracking user actions on the site (Ricci, Rokach, and Shapira 2015). For example this might be which search queries a user has entered or if a user is reading the description of a specific product. This kind of data can be collected to show trends, these trends can thereafter be interpreted (Falk 2019). Which data that is collected depends on the domain in which the recommender system is deployed in. One of the most influential actors in the market of collecting big data from user activity is Netflix. With over 200 million paying customers (Statista 2021), huge amounts of data is created that Netflix then uses to analyze its customers and their behavior. A major part of that data is implicitly collected and can therefore be generated in large quantities for each user. In addition to the user’s watch history, a lot of other data is also collected, such as:

• If a film is stopped before it is finished playing and if so when

• If a film content is paused and if so when

• If a film content is fast forwarded and if so when

• If parts of the content was rewatched

• Time of day when a film is played

• Location of user

• What type of device the user is watching on

• User browsing pattern. What content is clicked on, read more about, etc.

• User search history

(Maddodi et al. 2019)

16 3.4 Preprocessing data

When large amounts of data are being collected from descriptions of items, a significant part of the information usually needs some level of preprocessing before being used in a model. It is important to remove unnecessary characters that are also found in a text apart from words, such as html tags and punctuation marks. This procedure is called bleaching. By lowercasing all words ensures words that are equal will be considered equal regardless of sentence casing. Also words that do not add meaning or help explain the meaning of a text should be removed. One such type of words is called stop-words. These are common words that are usually found in most texts irrespective of the content of the text. Stop-words are typically articles, prepositions, conjunctions or pronouns as these are parts of speech which don’t add semantic value to the text they are found in. For most languages standardized lists of stop-words can be found (Schütze, Manning, and Raghavan 2008). Furthermore, words that does not contribute adding meaning to the text depending on the specific context of the content should be removed. To exemplify this the word “nature” can be removed if texts in outdoor magazines are used. This is because "nature" is expected to be found in almost all documents in the corpus and thus adds no value in explaining the meaning of a text. Removing stop-words and contextual words prevents words with high frequency to be included, which then introduces background noise leading to weakening the model (Falk 2019).

Since a model comparing words between texts can not distinguish between past tenses or plural forms of the same word, it will consider them as separate words with different meaning. Stemming and lemming can be used to reduce the number of variations of the same word by removing its conjugations. Stemming does this by decomposing each word to a root form of the word. The root form of the word usually doesn’t have inherent meaning, e.g. "love" and "lovingly" may become "lov". It’s worth to consider that the stemmed word itself may have a conflicting meaning with another word. While stemming simply chops off the ending of a word to find a common base, a lemmatizer derives the base word using a vocabulary (Schütze, Manning, and Raghavan 2008). An example of this can be seen in figure 2.

Some words completely change meaning when they come in sequence opposed to sep- arately in a text. An example of such words is ice cream. When dissecting the words one by one the meaning is different compared to those when combined. A method called phrase extraction can be used to connect these words by using manually or automatically generated dictionaries. This can also be used for names, inspecting the name George Bush word by word will mean something else than combining George-Bush into one word. The resulting word is called an n-gram, where n represents the number of words put together to form a concatenated word (Aggarwal et al. 2016).

17 Figure 2: Before and after the string has been passed through the data-transformer. The transformations applied were; bleach, lowercasing, remove punctuation, remove stop words and lemmatizing.

3.5 Content based filtering

Content-based recommender systems create recommendations by comparing how similar items are. This is then used with user profiles to produce recommendations containing items which are as close to the user’s preferences as possible. A score is calculated based on how well the features of the user profile match the attributes of an item. This score is then ranked and the items most relevant to a specific user’s profile will be displayed as recommendations (De Gemmis et al. 2015). E.g. a user who has watched Batman will get recommended other superhero movies such as Superman or other action movies directed by Christopher Nolan such as Inception. The attributes of an item can consist of metadata that describes an item’s properties and the items data in text form. A news article, for example, has metadata about who wrote the article, when it was published etc. However, this type of data may be too short or not sufficient enough to base recommendations on. The second type of data in this example consists of the actual text of the article. It is a larger amount of information, but can be more difficult for a model to interpret as it depends on natural language processing to interpret its meaning (ibid.). User profiles can consist of both explicitly and implicitly collected information about the user’s behavior on the page (Khusro, Z. Ali, and Ullah 2016), as described in section 3.3. In order for a recommendation in a content-based recommendation system to be produced, a process of three steps are needed:

Content analyzer. First, a model needs to be created over the content, where each item in the content is described based on its attributes. This requires that the item data is processed as described in section 3.3, leaving only relevant information. Then a bag-of- words (BoW) model can be created. In a BoW, all words are split up and treated separately. Some information is then lost because the meaning of words sometimes depend on which adjacent words are present, also the word order in the text is lost. The information that exists after removing irrelevant words from the BoW model can be represented by keyword vectors (Falk 2019).

18 Profile learner. The result of the content analyzer is set against a user profile if personalized recommendations are desired. As described earlier, the user profile can consist of explicit or implicit data, this depends on what data is available and what is possible to collect. Each user profile is represented as a user profile vector containing data describing the user (Falk 2019).

Filtering component. To get recommendations, the model created in the content analyzer needs to be combined with the user profiles in profile learner. The filtering component matches user profiles with the items that best matches the profile. This is done by comparing the mathematical similarity between the vector created based on the user profile’s collected data and keyword vectors between items (ibid.).

In section 3.6 and 3.7 are two commonly used content analyzer methods introduced.

3.6 Term Frequency - Inverse Document Frequency

The Term Frequency - Inverse Term Frequency (TF-IDF) algorithm compares the fre- quency of terms in documents to determine the similarities between the documents. Depending on the frequency of the terms some terms add more meaning than others. A document containing a high frequency of a specific term indicates that this term is important for understanding the meaning of the document. For example, say you would like to compare different articles and the term weather frequently appeared in one article. This would be important information since the appearance of a single term multiple times indicates that the article is about some weather event. However, if all articles you are com- paring concerns weather events the term weather doesn’t add meaning since it will appear frequently in most of the documents (Aggarwal et al. 2016). As the name implies, the TF-IDF algorithm contains two parts, Term Frequency and Inverse Document Frequency.

3.6.1 Term Frequency

Term frequency is an algorithm that determines the frequency of a word that appears in a document. It does so by dividing the number of times a term appears in a document with the total number of terms in the document. An equation representing this can be seen in equation 1. Since the size of a document can be of varying length the term Weather may appear more frequently in a large document compared to a small document. Therefore dividing the number of times a term appears in a document by the total number of terms in the document is rectified (Qaiser and R. Ali 2018).

• n(t) is the number of times t appears in a document

19 • N is total number of terms in the document

C 5 (C) = =(C)/# (1)

3.6.2 Inverse Document Frequency

The term frequency algorithm is not concerned with what type of word it is being asked to examine. When analyzing a document, words like ”downpour” and ”of” would be treated the same. To rectify this problem, the inverse document frequency algorithm is applied. The inverse document frequency algorithm assigns heavier weights to words that appear infrequently in the corpus and assigns lower weights to words that appear more frequently. Essentially, lowering the score for more common words (Qaiser and R. Ali 2018). This algorithm can be seen in equation 2. To exemplify, consider 30 documents where the word ”for” appears in 23 of those 30 documents. The inverse document frequency for the word ;= 30 = . ”for” would be: 23 0 2657. If the word ”rain” appeared in 5 out of those 30 documents ;= 30 = . then the inverse document frequency for the word ”rain” would be 5 1 7917. Since the word ”rain” received a heavier weight, it will be prioritized higher than a word like ”for”. So, for this example ”for” does not imply a strong correlation between documents the same way ”rain” would.

# 835 (C) = ;=( ) (2) =3(C)

3.6.3 Term Frequency & Inverse Document Frequency

By combining the algorithms described in section 3.6.2 and section 3.6.1 the final form of the term frequency and inverse document frequency algorithm can be described. The last step of the algorithm is multiplying the result of the two previous algorithms together, this can be seen in equation 3. Terms with higher values will contribute to a higher degree when comparing documents, thus being more important when explaining the meaning of the document. Each document is then stored as a vector containing TF-IDF-weights for each of its terms (ibid.).

C 5 835 = C 5 (C) × 835 (C) (3)

By using a similarity function documents can then be compared finding which document vectors contain terms with similar TF-IDF weights. Equation 4 is the cosine similarity

20 between two document vectors 31 and 32 (Qaiser and R. Ali 2018).

31 ∗ 32 B8<(31, 32) = (4) k31kk32k

3.7 Topic modeling

While algorithms like term frequency-inverse document frequency (3) are very powerful in pairing documents together, they have a limit regarding actually understanding the meaning of the context in a document. For example, let’s say there is a document that contains a sentence such as; ”Stella was listening to the song ”castles in the sand”, on her way to the youtube conference”. Obviously the sentence is implying that the youtube conference is the main goal however the TF-IDF algorithm may assign heavy weights to words like ”castle” and ”sand” and therefore try to match this document with other documents that contains information about ”castles” such as ”knights” or ”kings” which in this case would be a bad match since Youtube was the main context of the sentence. This is also an issue with documents containing non-contextual words such as proverbs, names, or acronyms.

To combat this issue, one can use a modeling technique called Topic modeling. Topic modeling is a unsupervised machine learning technique that generates topics which en- capsulates the different meaning of words and their contexts. It does so by calculating statistical relationships between the terms and the documents in the corpus. Topic model- ing is an extension of Latent Semantic Indexing (LSI), however LSI is not a proper topic model since it does not utilize probabilities in its modeling. Probabilistic Latent Semantic Analysis (PLSA) however, is the first proper topic modeling algorithm proposed by Blei, Ng, and Jordan (2003) (Blei, Ng, and Jordan 2003). From PLSA a more complete topic modeling algorithm was created, Latent Dirichlet Allocation (LDA). LDA was a more refined version of PLSA as has been applied in a lot of different research fields subsequent LDA’s establishment in the researching field (Liu et al. 2016).

Since LDA is used in this study, LDA will be used to further explain topic modeling by providing a foundation for examples. However, LDA is only one type of topic modeling which primarily uses probability to facilitate relationships. There are several other algo- rithms that implement topic modeling in different ways such as using a linear algebraic approach (Lee and Seung 1999).

21 3.8 Latent Dirichlet Allocation

The goal of Latent Dirichlet Allocation is to find topics (or themes) that suit the provided documents. For example, say we have four documents and three topics. These can be seen in table 1 and table 2, respectively.

Table 1: Documents Doc 1 Doc 2 Doc 3 Doc 4 ball government planet planet ball government planet galaxy ball planet galaxy ball planet planet ball government galaxy government planet planet

Table 2: Topics Topics science politics sports

If any one of the documents should be classified according to the topics they would not gravitate towards a single topic. For example take document 1, it contains the word ball three times, but it also contains the words; planet and galaxy. Therefore, document one could belong 40% to the first topic, science, and 60% to topic three, sports. Thus, the question then becomes, which topics should be assigned these documents? This is where LDA comes in, the LDA algorithm has an input of several documents and a number K, where K is the number of topics the algorithm should create. Output from the algorithm is a list of K-number of topics and a list of vectors, one for each document that contains a probability of how often each topic is displayed in the document (Falk 2019).

To critically asses the models, it is necessary to explain LDA more mathematically. Observing figure 3 below it is possible to identify the generative process of LDA. Firstly it generates priors from the Dirichlet distribution equal to the number of topics specified (denoted by [ in figure 3) as well as generating priors for each document specified (denoted by U in figure 3). These priors are conjugating priors to the multinomial distribution, which is convenient since this reduces the calculation complexity later on in the algorithm. Thereafter, the topic priors are used to generate a multinomial distribution (denoted by \ in figure 3) and the document priors are used to generate several multinomial distributions, one distribution for each topic (denoted by V in figure 3). Samples of topics are then picked from the multinomial distribution \. These sample topics are denoted by I in figure 3,

22 these sample topics I are then used to generate words from the multinomial distributions

V forming a sample of words F. #3 then corresponds to a document while # corresponds to the whole corpus. Therefore, the final probability from the algorithm covers the entire corpus (Blei, Ng, and Jordan 2003).

Figure 3: Graphical Model for the LDA algorithm

The task then becomes to maximize the posterior of equation 5. This turns out to be a complicated problem with exponential growth therefore the solution to this problem needs to be approximated. Typically there are two types of inference methods used for LDA, sampling algorithms and optimization algorithms (ibid.). This thesis will be using LDA that utilizes Gibbs sampling to infer the LDA algorithm. Gibbs sampling is a technique for obtaining a sample of observations from a distribution when regular sampling is too difficult (George and McCulloch 1993).

# # Ö Ö3 Ö ?(V, \, l, I|U,[) = ?(\3 |U) ?(I3=|\3)?(F3=|I3=,V) ?(V: |[) (5) 3=1 ==1 :=1 Equation 5. The joint distribution of the topics \, set of K topics I, word mixture V and a set of N words F with the input parameters U and [.

23 4 Method

This chapter aims to cover how the project was executed and explains decisions about methodology. The development consisted of two parts; the development of recommenda- tion system models and the development of a user interface to test the models.

4.1 Project structure - Using an iterative process

The project was realised through two iterations. Each iteration has the purpose of im- proving the recommendation model by refining the model and evaluating the changes by interacting with users through interviews and testing. For every new iteration new models are also examined and tested to investigate if they can compete with earlier models.

4.2 Preparing develepment enivronment

When choosing a programming language, there were several requirements that were considered. Since the aim of the project was to develop machine learning models the ease of creating models was one requirement. A second requirement was that the language should be a high level programming language so that most of the time would not be spent on syntax errors or memory handling bugs. Since the project had quite a short time span, choosing a language that was well documented and easy to use was also a requirement. A programming language that fits these requirements is Python which has extensive support for machine learning libraries and is very well documented. The language is also a staple when it comes to scientific computing because of libraries such as numpy and pandas.

The machine learning libraries used in the project can be seen in table 3. The project could be completed with only one of the libraries. However, the inclusion of one does not interfere with the inclusion of the other. Therefore, the API of both libraries combined was more intuitive than forcing the use of one. More information about the libraries and their versions can be found in the repository of the project.

Table 3: Machine learning libraries used in the project ML Libraries Gensim Scikit Learn

24 4.3 Choosing a model for building recommendation systems

Since the project has access to content based data, a content based modeling algorithm was preferred. It was decided to use topic modeling and term frequency modeling as content analyzers. These modeling algorithms can create content based recommendations without any influence from personalized user data if only the current item the user is interacting with is considered. However, including user data at a later stage can make the recommendations more personalized.

4.4 System Overview

The system pipeline that was developed during this project is contained in Python code. However, figure 4 gives a simplified overview of the system pipeline that was developed in order to create recommendation models and the interface they were tested on.

Figure 4: Flowchart describing the recommendation evaluation system

The system pipeline starts with a query from the Skolmedia Client to the Elastic Search database, this can be seen in figure 4’s top right corner. The client can decide the number of media items to retrieve but usually all is fetched. Then the client does a rough strip of all unnecessary data before it is persisted to a CSV file to avoid redundant database requests. Sequentially the SRT-files are read and parsed from their content as well as mapped to their media. Afterwards they are added to the CSV file as a separate field with the subtitle’s text.

25 Next, the Datatransformer reads the CSV-file into a pandas data frame and applies spec- ified data transformations. To read more specifically about what type of transformations the transformer executes, see section 4.5. Depending on the corpus size, specified trans- formations and computing power this step can take between a couple of minutes to several hours. After the transformations are successful, the transformed data is persisted as a pickle (common data frame file format) to prevent redundant computation time.

When there is transformed data, a model can be computed. Several models can be computed concurrently, however the code is designed to utilize several cores so usually it is more efficient to compute one model at a time. After a model is trained on the transformed data, it is persisted to the disk so it can be shared on multiple devices as well as reducing computation redundancy. The models are then fed into a Base Model class that implements basic model functions such as "predict". These are then fed into a Model Holder class that internally holds all models in a list but also provides additional functionality to handle several models at the same time which is required by the flask REST API.

The flask REST API exposes endpoints to make predictions on specific models, authenti- cate and submit feedback. All of the CRUD endpoints are guarded by an authentication middleware which checks if the user requesting the resource is authenticated. The API also exposes endpoints for the user to login and check if their auth-token is valid. The prediction endpoints accept a media id and a model id, in response it returns all media the model predicted for that specific movie-id. The endpoints responsible for submitting feedback passes the data to MongoDB where it is persisted.

4.5 Pre-processing data

The metadata in the ElasticSearch database consisted of a multitude of metadata for each media content item. For this project only the relevant metadata was chosen and stored in a separate csv-file for further processing. The metadata chosen can be seen in table 4.

In order to give the modeling algorithms the best possible conditions to perform well the data went through a preprocessing step. This step of processing involved several sub steps that iterated through each media and removed unnecessary elements of the text.

• HTML-tags were removed from the text corpus.

• Punctuation was removed.

• All text was set to lowercase in order to treat same words with different capitalization

26 Table 4: Metadata contained in the media pulled from ElasticSearch Database

Name Description Uid Unique id for media Title Title of media Surtitle Specifies which series the media file is part of, if any Language Which language is used Streaming format Which type of media file it is. Film or audio. Description A description of the contents of the media file. Summary A summarised description of the content. Query freetext Condensed words for easier freetext searching with ElasticSearch. Keywords Self written keywords Keyword tags Self written keyword tags

• Stop words were removed using a Swedish stop word list. The Python library used for this was nltk.

• Additional custom stop words were also removed. These include words that are repeated in a large part of the documents and therefore will not infer much meaning to topics. These custom stop words can be found in appendix 8.

• Words were also lemmatized. In lemming, (for short) the words are transformed to their base meaning, preventing the same words that are conjugated differently to be interpreted as different words by the models. For example the words Dogs, and dog would result in the word ⇒ dog. This was done with a python library.

4.6 Interviews

To gain increased insight regarding how users on the platform interacts with the content, two interviews with employees at AV-centrals were conducted. The AV-centrals distribute their media content via Skolmedia’s system. This means that they decide which content should be on the platform, therefore being well understood with which media content works well and why. Therefore they can be considered expert users of the system, whereas teachers and students in the region in which the AV-central operates are the end users. They are interacting directly with the content but may lack a broader perspective on their user behaviour compared to employees at the AV-central.

Both interviews were conducted via zoom and were 1 hour in length. The interview with Medienavet had three participants and Mediepoolen two participants. The interviews had a semi-structured nature with prepared questions but where the respondents were allowed to deviate from the question as long as they remained on the subject of the interview. The reason for semistructured interviews was for the interviewee to be able to evaluate and

27 deviate from the topic into interesting points about what a good recommendation would look like (Kvale and Brinkmann 2014). The interview questions can be found in appendix 8.

4.7 Tuning Hyperparameters

The LDA model has a number of hyperparameters that need to be tuned. By tuning hy- perparameters one can determine which value is the most optimal for the hyperparameters to produce the most correct model. In the case of LDA a topic coherence score is used to determine how good the model is. Topic coherence is a metric to determine to what degree topics overlap by measuring the semantic similarities of top scoring words between topics (Stevens et al. 2012). Using the gensim-library a number of parameters were identified as significant for influencing the output of the model. These were Alpha, Beta and number of Topics. Alpha determines the distribution of topics over documents and the parameter Beta determines the distribution of words over topics (Welbers n.d). The hyperparameters can be seen in figure 5 on the X-axis. To measure how the model changed when tuning different parameters the model’s coherence score was calculated as well. A common method for measuring topic coherence is by using Cv, it ranges from 0 to 1 where a score closer to 1 is considered better as it indicates less overlapping topics (Stevens et al. 2012). The values used in the tuning can be seen in table 5. The calculation grows exponentially when adding more values therefore limiting the scope of the test was necessary. Topics Alpha Beta 10 0.01 0.01 20 0.3 0.3 30 0.6 0.6 40 0.9 0.9 50 symmetric symmetric 60 asymmetric 70 80 90 100 110

Table 5: Values used to test the topic coherence score of the LDA model

Results of the coherence score calculations can be seen in figure 5. The majority of calculations gave a coherence score of around 0.4 - 0.5. Higher coherence score means less overlapping topics indicating a stronger model. The highest coherence score had the parameters 20 Topics, Alpha=0.9 and Beta=0.9. However, the highest coherence score does not have to imply a better model. This depends on the models’ use case.

28 Figure 5: Coherence analysis LDA. Number 1 in Alpha and Beta corresponds to symmetric and 0 corresponds to asymmetric

The results from the coherence score calculations indicated that the optimal LDA model had hyperparameters: 20 topics, Alpha=0.9 and Beta=0.9. The LDA model included in the testing had 118 Topics, Beta=0.07 and Alpha=symmetric. This was because manual testing was performed to determine that the model with the best coherence score did not give the best recommendations.

4.8 Testing models

A crucial part of developing a recommendation system is to evaluate whether it functions as intended and if users find it useful enough for it to be worth implementing. A rec- ommendation system can be tested on several different aspects depending on the context of the specific case. (Falk 2019) If the data is labelled, meaning a correct answer for a recommendation can be deduced, it is possible to perform offline testing to determine model predictability. E.g. when user ratings exist, it is possible to test whether a user receives relevant recommendations by comparing them to what the user has rated before. Since no labels existed that could determine the quality of recommendations an online evaluation was deemed most appropriate. The advantage of not relying heavily on offline testing is that the system is tested on users and that the models can be tested for more than their predictive abilities. With offline testing there is also the issue of having biases in the data making conclusions on the results less likely to be valid (Gunawardana and Shani 2015). There were two types of testing conducted, one in each iteration. The first iteration only consisted of manual internal testing while the second iteration utilized a web interface and was tested on AV-centrals.

29 4.8.1 Manual testing

Due to the lack of labels in the data set, the most reasonable way to test the models in the first iteration was to conduct manual testing of the models. This meant designing different versions of the models and loading them into an early version of the interface used for user tests. Then the models were tested repeatedly for different media content where its recommendations were evaluated and scored. This was repeated around 100 times between two models to determine if there were significant differences in model usability. This testing process was continued until the two models were isolated for external testing. Online evaluation is resource demanding and for this project not enough participants could be recruited to enable testing of more than two models.

4.8.2 Online evaluation

For the second iteration of the project online evaluation was performed. Historically, the usefulness of recommendation systems was assessed largely on the basis of its predictive ability. This is still considered a key evaluation metric but not an evaluation that alone can determine what constitutes a good recommendation engine (Knijnenburg and Willemsen 2015). Online evaluation is studies that are done directly against users in order to evaluate how well the system works from the users’ point of view. This can be done in a controlled environment where users are asked to provide feedback on how they experienced the system. There is a great deal of freedom around the type of data that can be collected, both around implicit user behavior and by asking direct questions. Since this type of experiment is performed against users who are aware that they are in a test situation, it is important to be aware of possible biases that arise. This type of experiment is also relatively time consuming, requiring more resources to perform (Gunawardana and Shani 2015).

To test a hypothesis, it is often sensible to include a reference model against which the tested model can be compared (Knijnenburg and Willemsen 2015). For the user test that was performed in this project the reference model consisted of a query to the ElasticSearch database with the selected media’s title and surtitle concatenated together. This behaviour is directly related to Skolfilm’s current search engine and was therefore deemed as a appropriate reference. In online experiments that take place in a controlled environment, regardless of whether a completely representative test group has been recruited, bias can arise as users are aware that they are in a test situation. If they are also aware of what the test aims to answer, there is a risk that users will adapt their answers to fit the test’s hypothesis. This risk also arises when users are payed to participate. To avoid this, it should not be revealed in too much detail what the study intends to answer (Gunawardana

30 and Shani 2015).

When designing the test itself, it is also important to adjust for factors that may otherwise affect the test result. The order in which the recommendations are displayed affects the user’s assessment, as what is placed at the top or far left is generally valued as slightly better. This is best remedied by randomizing the order in which what is tested is shown (Gunawardana and Shani 2015). A similar masking of what is behind should also be done when noticing the recommendations displayed to the user. If different algorithms are given different labels, it may affect test subjects to favor certain labels corrupting test results. This becomes of particular importance for the reference model which users will rate lower when aware of seeing it to please. If a scale is used for users to rate some aspect of the experience, it has been shown that users can be influenced depending on how it is designed. A five-point star scale is generally preferred (Knijnenburg and Willemsen 2015). When several recommendation models are tested, it is possible to set these directly against each other for the user to compare. The main advantage that can be seen with this is that it makes it easier for the user to determine how good a recommendation is by having something to compare with. This means that it is possible to find significantly smaller differences between the test objects. A bias that is important to avoid is if users figure out which the reference model is. It can be avoided in the same way as mentioned earlier by randomizing the order in which the models are displayed. (ibid.).

A user study needs a group of test users and a list of tasks that the test subjects must perform. During the time that users perform their tasks, quantitative data can be collected about their behavior, for example which parts of the task were performed, how well the task was done, or the time required. It is also possible to ask direct questions to the users to get more qualitative results that are not possible to achieve by only observing the users (Gunawardana and Shani 2015). For this, there is no standardized questionnaire, it depends on which dimension of the recommendation system it is desired to examine (Said 2013). A crucial factor of testing on users in a controlled environment is that it is far from a real scenario which then might influence test subjects and test results (Knijnenburg and Willemsen 2015). As previously mentioned, these experiments are resource-intensive to carry out and as much data as possible should be collected per test occasion to ensure having enough data for analyzing test results afterwards (Gunawardana and Shani 2015).

Online evaluation can also be carried out in the form of a so-called A\B test, where a small proportion of the page’s users are led to a version of the page that contains a new function, for example a new recommendation algorithm. By then comparing user behavior on versions A and B, respectively, conclusions can be drawn about the effectiveness of the new function. A risk with testing directly on end users is that a reduction in quality can cause the site to lose users. The risks associated with not testing against end users

31 are often still seen as significantly greater, Netflix does A\B tests for every small change before it is put into production (Falk 2019).

4.9 Test evaluation

When evaluating the tests statistical methods are needed to interpret the results. To test whether users experience any difference between two algorithms, independent pairwise measurements can be made between the algorithms. By having randomly selected users from the population repeatedly test an algorithm, an overall assessment from the user can be obtained for each algorithm. An average of the user’s assessment should be used in favor of each test case being seen as an assessment. This is because several assessments by the same person cannot be seen as independent of each other and thus the validity of the test cannot be guaranteed (Gunawardana and Shani 2015).

4.9.1 Wilcoxon signed-rank test

Both student’s pairwise t-test and Wilcoxon’s signed-rank test can be used to investigate if there is any statistical difference between algorithms. Both tests use the magnitude of the difference between two paired values. In t-tests, it is assumed that the difference is normally distributed. For a Wilcoxon’s test, it is not required as it is a nonparametric test. Another factor that can determine which test is most appropriate is the amount of data. Wilcoxon can handle smaller sample sizes than t-tests (ibid.). However, one consequence of the fact that fewer assumptions are required, is that the statistical power of the results is generally lower compared to t-tests (Conover 1998). The general procedure of performing a Wilcoxon’s signed rank test is as follows. For paired data a null hypothesis is set stating that the median difference between the sets is equal. Each paired difference is calculated and ranked according to how great the difference is. The ranked paired differences are then labeled with its sign, minus or plus difference. A W-value is calculated for the sum of the positive differences and for the sum of the negative differences of pairs. The minimum of ,+ and ,− is chosen and compared to a critical W-value that determines if the null hypothesis can be rejected or not (Shier 2004).

When tests are done, situations can sometimes arise when there is a draw between two algorithms. Depending on the hypothesis made, a draw can be solved in two ways. If the hypothesis is that algorithm A is better than algorithm B, a draw should be in B’s advantage. If instead the hypothesis is that A is not worse than B, a draw should go to A’s advantage (Gunawardana and Shani 2015).

32 4.10 Performing online evaluation

After extensive internal testing it was decided on two models which would be included in iteration two. The models and their data transformations can be seen in table 6. The LDA model did not seem to improve using n-grams from the internal testing. Therefore, this transformation was excluded in the external test. 15 items was included in the set of recommendations to ensure enough items were presented to the test users for them to form an opinion regarding the recommendation quality.

Table 6: Models used in external testing with their data transformations. X indicates that the transformation was applied to the entire corpus.

Model bleach no punctuation lowered no stop words lemmed n-grams TF-IDF X X X X X X LDA X X X X X -

The purpose of the test was to decide which model performed best according to users. Also to determine if that model is better than the reference model. By letting a group of users test the media several times, a ranking of the models can be produced. From this ranking it is possible to conclude which model gave the most relevant recommendations. However, this does not indicate that the recommendations are of good quality, therefore the user was also asked to rate the recommendations. The test flow is described in figure 7. Steps 2-5 include the testing, step 1 describes how the user has to log in to be able to utilize the testing interface. The test that the user performs can be broken down into four steps. These steps can be repeated as many times the user desires. On the first visit to the website, the user is prompted with a dialog that describes the testing and requests 20-30 tests from the user. The dialog can be seen in figure 6.

After the user has agreed on the first time visit dialog it will no longer be shown and therefore this step is not part of the testing flow diagram.

Step one In the first step, denoted by 2 in figure 7, the user clicks on a button to generate a recommendation. By doing this several requests to the flask API are sent. The first request generates a new feedback document in the connected MongoDB and returns the feedback id to the client. The second request generates predictions based on the models and media provided. Since every iteration of the testing flow compares two recommendation models at the same time, the models are evenly distributed in their selection. This means that there will be an even distribution of models set against each other. The selected media is randomly generated from the ElasticSearch database.

33 Figure 6: First time visit dialog

Step two In the second step, denoted by 3 in figure 7, the user decides which recommen- dation they prefer, which is then considered the winner and selects the recommendation by clicking on the blue button. This step is also illustrated by figure 8 which is a screenshot of the actual testing interface. When the user selects which recommendation they thought was best the button is then replaced by five stars. The models are denoted as Recommen- dation 1 and Recommendation 2 to not introduce biases where users associate the model name with a preference. This is also to not let the user know if one of the models was the reference model (Knijnenburg and Willemsen 2015).

Step three In step three, denoted by 4 in figure 7, the user selects the amount of stars they think the recommendation deserves, ranging from 1-5 stars. For the user to proceed, they have to rate both recommendations. When both recommendations are selected they are automatically moved to step four.

Step four Step four is the last step and denoted by 5 in figure 7. Here, the user can leave a comment motivating their choice of recommendation and click on a button to move on. It was decided to use a comment function for explaining user ratings. This was to not limit users with predefined explanations as they were experts on the system. Diversity has been proven to be an important factor for recommendation quality for users (Gunawardana and Shani 2015). Therefore, for each model, comments were divided according to whether or not they were about diversity and, if so, in what way according to the user.

When the user moves on, they are automatically routed back to step one and therefore restarting the process. The test users are employees working at the AV-centrals, which

34 Figure 7: Flowchart describing a typical testing user’s flow means they are familiar with most of the movies in Skolfilms catalog. With this approach of choosing test users we ensure that they have adequate knowledge to decide the quality of the recommendations.

Figure 8: Screenshot of the testing interface

4.10.1 Subtitles

SRT-files were available for 67 % of the total number of media items available from Skolfilm. To test if SRT-files improved the model all available SRT-files were parsed

35 and included in the corpus. Subtitle files are normal text files with text and timestamps separated by line breaks. At the top of the file there is usually some metadata about the subtitles.

The text is paired with the timestamps which determines when the text is displayed when playing the corresponding media content. An example of the structure can be seen in figure 9. The parser was designed to extract all text and compile it into a concatenated string and remove the metadata and the timestamps. After the parsing was successful the parsed text was used to calculate several new models. To test if the inclusion of SRT-files improved the model, manual testing was performed.

Figure 9: SRT file structure example

36 5 Results

Interviews and a user study on the developed system were conducted in order to find preferences regarding recommendations amongst experts.

5.1 Interviews

The following sections present the results of the two group interviews that were conducted. The first interview was done with the AV center Medienavet in Sundsvall and the second with Mediapoolen in Borås.

Something both groups mentioned, but primarily Medianavet, was the importance of rich labeling of key words. Some film lacks enough keywords or is incorrectly labeled, which often depends on the extent to which the media producer has labeled it. Even if the data is well labeled by the producer, it is experienced that inflation on certain labels may occur, resulting in it losing importance as it is overused (Medienavet 2021; Mediepoolen 2021).

The most active user group is teachers for grades 1-6, both in terms of searching for content on the site as well as showing the films to their students. According to both interviewed groups, many teachers reuse films they have used before and which they feel have worked in their teaching. This is often done by saving them among their favorites, which according to Mediapoolen some teachers will primarily use and not search for new material at all. One consequence of this is that older films often do better than newer ones and that new films in the catalog are overlooked, not uncommonly for over a year before users discover them. The contracts for the films are also often fixed-term, which means that they sometimes have to be removed after a certain time. Therefore, Mediapoolen believes that it would be good if it were possible to make a smooth transition to newer materials before this happens (Mediepoolen 2021; Medienavet 2021).

Mediapoolen mentions a number of criteria that often determine the quality of a film. Good picture and sound quality, correct subtitles and that the labeling of age group matches the content of the film indicates a good film. In addition, it is important that media are updated on current events, e.g., a film about the UK and the EU should be produced after Brexit. A good film also matches the curriculum well. Regarding the time aspect, it can be seen that films around 15 minutes suit many teachers particularly well. Then the subject has time to be presented properly in the film while there is still enough time to work with the corresponding material during the same lesson. Even if the film is good in other respects, it is not relevant if it cannot be included in the curriculum. A preference for newer films over older ones is something that Medianavet also expresses. Medianavet also draws attention

37 to the fact that it can be difficult to know if a film can be considered bad or if it simply can not be found in the catalog. Regarding this, they also take up media from UR who are narrow in their labeling of films as they have a policy to only label their material directed at one specific grade. It can thus be a highly relevant film for high schools even though it is marked for middle school only. Due to the way UR labels their movies, many users risk missing these movies when they start filtering by age when searching (Medienavet 2021; Mediepoolen 2021).

Regarding the type of recommendations that the surveyed groups think works best, Medi- enavet replied that they believe that recommendations linked to the specific film that the user is visiting are preferable over recommendations based on the user’s profile displayed on the home page. Medienavet suggested that recommendations should not be based solely on the type of teacher the user’s profile is described as. Instead, recommendations should be based on users individual activity on the page. Regarding the type of metadata that recommendations should be based on, Mediapoolen believes that the content of the curriculum to which the film corresponds should be considered. They also mention that the recommendations may depend on which grade a teacher has. Teachers for lower grades teach in more subjects and can therefore receive broader recommendations. While teach- ers for high school often has only a few subjects, they are teaching and would then need narrower recommendations. Medianavet also points out that what is a relevant recommen- dation depends on the domain, e.g. for language teachers, language itself is central and the subject of the film is not as important. The time of the semester can also be important for the types of films that the user searches for, e.g. towards the end of the semester, watching feature films increases Mediapoolen mentions (Mediepoolen 2021; Medienavet 2021).

Introducing a grading system is believed by both groups to benefit the service as long as it is easy for the user to use. Both groups also agree that students should not be allowed to rate movies. They are too unreliable and difficult to interpret as they are driven by interests other than the educational aspect. Mediapoolen also goes a step further and suggests that a film started by a user of teacher type should be weighed heavier as film starts from such accounts can mean it is screened in front of an entire class of students. Something that is also mentioned by both groups is the introduction of expert recommendations for films in the form of editors at AV-centrals rating films. Medienavet mentions regarding this that recommendations should be issued equally based on the user’s and the editors’ judgment. The editorial board also consists of people with their own opinions and interests that influence their assessment (Mediepoolen 2021; Medienavet 2021).

38 5.2 Model test results

The results of the test are reported by presenting the average rating and preference for each model from the users. With the help of other data collected during the test, such as comments and individual ratings, the test result is supplemented to provide a basis for a further analysis of the testing. In total there were 11 participants and between all participants there were 268 submissions on the web interface. The distribution of submissions was not equal for all participants. The participant with the largest number of submissions had 84 submissions, whereas the participant with the smallest number of submissions had 7 submissions. The mean number of submissions were 24. In figure 10 the mean score of all users was computed and plotted. The mean was calculated for each user so that it is not user dependent. This is to prevent user bias as the amount of submissions differed greatly.

Figure 10: Mean model scores

In total, the reference model was put against TF-IDF 83 times, reference against LDA 83 times and TF-IDF against LDA 85 times. A summary of the results of this can be seen in figure 11 were the number of times each model were preferred over the other is shown. As can be deduced from the figures, TF-IDF and LDA are selected more often than the reference model. To determine if there is a statistically significant difference between the models based on the results when they were compared, a Wilcoxon’s signed-rank test was performed.

The Wilcoxon test determines if there is a significant difference between two data sets. The test was performed on each combination of the models being paired with each other. The results of the Wilcoxon’s signed rank test can be seen in the table 7. For each model comparison, the minimum of ,+ and ,− were calculated and is shown as W-value in table 7. The W-crit is the critical value that W-value has to be lesser than for the null hypothesis to be rejected with significance level alpha. In table 7 we can see that only for the case TF-IDF vs Reference the test showed significant difference. As the W-value was 0 for this case it was possible to determine that TF-IDF vs. Reference had a statistical significance of 99 %. Meaning the null hypothesis that the two models is producing the same results

39 (a) Reference vs LDA (b) Reference vs TF-IDF

(c) LDA vs TF-IDF

Figure 11: Wins between models can be rejected. As for Reference vs. LDA and TF-IDF vs. LDA no significant difference could be seen using alpha = 0.05.

Table 7: Result of Wilcoxon’s ranked test. Number of values are the number of participants that were included in the test. They differ since some participants did not have enough submissions to reliably incorporate them into the test.

Model Number of values W-crit W-value alpha TF-IDF vs Reference 9 3 0 0.01 LDA vs Reference 8 3 5 0.05 LDA vs TF-IDF 7 5 12 0.05

5.2.1 Comments

A total of 119 comments were received during the test. As a comment belongs to two models the total number of comments mentioned in this section is more than 119.

Figure 12 shows the percentage of these comments by model. In total, TF-IDF had 55, LDA had 53 and the reference model had 60 comments that could be linked to the diversity of the recommendation. All models had a similar proportion of low diversity recommendations that the user experienced as negative. Both LDA and the reference

40 model had many recommendations with high diversity that were seen as negative by the user. Positive low diversity recommendations accounted for a smaller proportion of comments for both of these models. As for TF-IDF, there is an opposite relationship where positive low diversity recommendations are instead in the majority.

Figure 12: Diversity TF-IDF, LDA and reference

Other results that emerged from compiling the comments are target group, language and UR.

• 16 comments described that too much of the recommendation consisted of films from the same series as the reference film.

• 19 comments described that the recommendation contained films that did not suit the target group of the reference film in terms of age.

• 4 comments about the fact that another language is not relevant.

• 4 comments specifically about UR movies that are not desirable.

There were also comments from a small number of users where the model with the most films from the same series was chosen as the winner.

41 5.3 SRT-files

The models created by parsing and incorporating subtitles into the text corpus did not yield any model that had significant test results to build upon. The models created with the subtitles included in the corpus were manually tested and put versus models with the same parameters without the the subtitles included in the corpus. By comparing the recommen- dations, it was deemed to not provide any significant improvement of recommendations and decided that further testing had to be done to understand if the difference was big enough to incorporate into the pipeline.

42 6 Analysis

This section contains analysis of the user tests and interviews conducted with experts.

6.1 Low ratings

The goal of the project is to create as useful a model as possible for the users based on the data provided by Skolfilm. The results of the testing were therefore analyzed on the basis of various parameters to create as much insight as possible into what influences how useful a recommendation is according to experts.

From the results it can be seen that in total 78 recommendations were rated a score of one. Further analysis of low ratings is interesting as this can provide information on the usability of the models as well as investigating factors leading to low performance of models. Table 8: Comparing low rating scores between models

Model Rating 1 Rating 5 Reference 34,9 % (54) 7,5 % (11) LDA 19,6 % (29) 10,1 % (15) TF-IDF 12,2 % (18) 14,2 % (21)

As can be seen in table 8 the reference model had higher proportions of low ratings, whereas TF-IDF and LDA had considerably fewer. Regarding the high ratings, the proportions are nearly mirrored. Comparing low and high ratings for the models, it seems like the models especially contribute to less really low quality recommendations and more high quality recommendations but not in the same extent. The low recommendation quality seen in the reference model can be explained by it having a lower prediction accuracy. For the developed models a protrusive factor explaining low ratings was whether films in the same series as the reference film were recommended. 20,7 % of the ratings in LDA and 27,8 % of the ratings in TF-IDF can be explained by this. In the user interface on Skolfilm’s website episodes from the same series are already displayed, which several users during the test pointed out as a reason for not including these in the recommendation. There were comments indicating episodes from the same series may be relevant as long as it is only the next episode or a limited number. This is consistent with Said (2003) regarding the factor of novelty in recommendations (Said 2013). In Said (2003) study, there is a negative correlation between user satisfaction of the system and finding obvious items in the set of recommendations. For the list to be diversified, episodes of the same series should be filtered out from the set of items that the model recommends,

43 which was also repeatedly mentioned during testing. Based on this, the TF-IDF model could be significantly improved by filtering media from the same series away from the recommendation list that the user receives. However, as mentioned, the user might still be interested in episodes from the same series as long as that item is relevant and the rest of the list is diversified. It can also be argued that media from the same series may be relevant as a series may contain very different content in itself. To include both of these factors, a recommendation should therefore be able to contain media content from the same series but to a limited extent to ensure adequate diversity.

A factor for low ratings of LDA was recommendations with deviating target groups. In 13,8 % of cases with low rating for LDA, it was due to the reference film and the recommendations were aimed at different target groups. In total, 20 comments regarded recommendations were the wrong target group was part of recommendations. This lead to the user lowering its rating of the recommendation. This mainly applied to films labeled with target group preschool. The reason that preschool media content in particular was overrepresented is probably due to the fact that this age group may differ from other age groups. Children in other grades can use materials adapted for both younger and older grades. For example, a high school teacher may benefit from media adapted for both middle school and upper secondary school. A preschool teacher can find relevant material from media content adapted for students in grade 1-3, but there is no media for younger children as preschool is the lowest grade. This leads to a more narrow window of relevant media content. A filter applied to the recommendations that prevents media that is more than one step away from the same target group may accommodate users with this issue. However, some media that do not belong to the nearby age classification may still be of interest for the user to be recommended. This is supported by comments such as "UR-samtiden är för en helt annan målgrupp men kan ju vara intressant för lärare." (Medienavet 2021). If the user is of type teacher the media content targeted at teachers has some relevancy regardless of which target group the other items in the recommendation belongs to.

Several comments mention that too much media of the wrong type was shown in the recommendations. E.g. "Rek 1 var konstig då den rek radio högt, trots att det gällde rek för en film..."(Mediepoolen 2021). However, there were also comments requesting that recommendations for media content of different types should be shown. To meet both of these needs, a recommendation with a diversified set of items that also include similar content, different types of media content types should be included but limited. The same filtering strategy might be applied to content belonging to the producer UR Samtiden as a large portion of the total media content catalog belongs to UR Samtiden. Mentions of UR Samtiden is found in four comments, all indicating that media content from this specific producer may take over recommendations. Therefore this is another

44 type of recommendation that should be considered to be limited based on the feedback given by test users.

6.2 Diversity

Regarding the models’ experienced diversity by the users TF-IDF received a similar amount of comments for each type of diversity measure. This differs from the other two models which had a predominance of negative comments regarding high diversity recommendations. To understand why TF-IDF were preferred over the other models, users’ preference for more narrow and precise recommendations over high diversity can be an explaining factor. This may be because high diversity is due to the model not being accurate enough and thus the diversified recommendations are not relevant enough for users to experience the useful ones. TF-IDF instead provides more recommendations that are low diversity, both negative and positive compared to other models as can be seen in figure 12. Having a high dominance of low diversity recommendations may have led to more accurate predictions, therefore keeping down the low quality recommendations as seen in table 8. According to (Knijnenburg and Willemsen 2015) high diversity leads to higher user satisfaction. However, regarding the tested models, it seems like model predictability influenced the recommendations in a negative way.

One problem that the AVcenters addressed during the interviews was difficulties in getting users to discover new films. A major contributing factor to this is that many users almost exclusively use media content from their saved favorites. These are stored in a separate folder, which means that these users rarely find new content. This and other reasons that lead to users not finding new material contribute to the fact that it is difficult to determine if new media content is not appreciated by users and therefore is not consumed or if it is because users do not find the content. A recommendation system should therefore favor newer media content over older ones. This can then help users who search for media content to a greater extent find recently added material.

6.3 Number of keywords

The results from the user test were analyzed to see how the size of the metadata for a film and the number of keywords a film has affected what rating the users set. In figure 13a, a plot shows how the proportion of media with a rating of 1 changes depending on the number of keywords that the film has. A corresponding plot for rating five is seen in figure 13b. There is an opposite relationship between the graphs, which indicates that an increased number of keywords increases the probability that the user sets a higher rating.

45 At about 15 keywords, there is a cut off point where an increased number of keywords does not mean a larger proportion of films with a high rating.

(a) Amount of feedback with rating one vs num- (b) Amount of feedback with rating 5 vs number ber of keywords for media the recommendation of keywords for media the recommendation was was based on based on

Figure 13: Proportion results with rating 1 and 5 depending on number of keywords

Based on the results of the testing, the most suitable recommendation system is a TF-IDF model. TF-IDF performed with statistical significance better than the reference model, which the LDA model did not. The other results in the test also indicated that this was the case even though they did not show with statistical significance that TF-IDF outperformed LDA.

6.4 TF-IDF vs LDA

Even though TF-IDF is a more simple algorithm than LDA, the results indicate that recommendations produced by TF-IDF were of higher quality. One reason for this could be the complexity of LDA compared to TF-IDF. Even though LDA is able to create topics and should therefore, in theory, be able to make more advanced recommendations the time-frame of this project allowed for only two iterations of model refinement. This means that simpler model building techniques will have an edge since further research and refinement is generally needed for more complex algorithms.

To determine the overlap between topics, a coherence score can be used. The Coherence Score of the LDA model was acceptable at 0.55, however, this Coherence Score was measured using the E measure. There are currently seven Coherence Score measures in total (Röder, Both, and Hinneburg 2015) the different Coherence Score measures can be seen in table 9.

Optimally the method should have taken these scores into consideration. However, due to time constraints E was the only Coherence Score measure used.

46 Table 9: Coherence Score Measures

Coherence Score Measures

E ? *"0BB *"0BB >=4−0=H * #%" 

Furthermore, this LDA model utilizes Gibbs sampling to achieve inference. LDA can, in theory, achieve inference by regular algebra, however this is an expensive exponential calculation, therefore sampling and optimization algorithms are used to approximate the solution. This thesis time frame did not allow a comparison between sampling and optimization inference, however this should be addressed in the future since an LDA model with optimization inference could get a better Coherence Score or provide better recommendations.

47 7 Proposed Future Directions

The second research question was stated as follows: What recommender system and data collection methods are relevant to use on the system? To answer this a study was conducted investigating relevant recommender system methods for Skolfilm. Available data allowed for content based filtering, using the current media content to base recommendations. By extending data collection this method or other methods can be further developed to increase usability for end users.

7.1 Collaborative filtering

In contrast to content-based filtering, collaborative filtering does not need to know in- formation about the items it recommends. The phrase was coined by a developer who worked on one of the first recommendation systems, ’Tapestry’, and has since been used extensively around studies that implement the same type of filtering techniques (Goldberg et al. 1992). Collaborative filtering is a recommendation algorithm which utilizes user data to identify similarities between users. The algorithm then uses these similarities to pair users and recommend items between users. Pairing users can be done in numerous different ways with different methods and often depends on what kind of user specific data algorithms have access to (Falk 2019).

By using an implementation of collaborative filtering called user-user, relevant items are recommended by finding users that are similar. Similar users are defined by comparing their preferences, and users with overlapping preferences are evaluated as similar. A common example that well represents how user-user collaborative filtering is used is movie recommendations based on user ratings. As can be observed in table 10, User1 and User2 have similar ratings on similar movies. Therefore, we can assume that they are interested in the same type of movies and will be classified as such in the model produced. The model may then recommend Movie5 to User1 since User2 has seen it and rated it highly. Conversely, User1 and User4 have incompatible ratings and therefore will not be assumed to have similar preferences (Isinkaye, Folajimi, and Ojokoh 2015).

Table 10: Table that shows mock data of users and the movies they have rated. The ratings range from 1-10. Null indicates that a user has not rated the movie

Ratings Movie1 Movie2 Movie3 Movie4 Movie5 User1 3 4 5 5 null User2 3 3 null 6 8 User3 1 8 null 2 2 User4 1 null 2 1 null

48 Collaborative filtering can also use something called item to item filtering wherein the items are bundled together based on user reviews and then an item is recommended to a user based on the similarities of the items (Isinkaye, Folajimi, and Ojokoh 2015).

There are some drawbacks to collaborative filtering such as the ’cold start problem’ where if a user has no ratings, no conclusions of similarity with other users can be made. There is also the scarcity problem where the number of users usually heavily outweighs the number of ratings causing the recommendations to be heavily biased towards the existing ratings (ibid.). To create a model that can produce recommendations of higher quality and usability for users, collaborative filtering could be used. Based on a user profile, users are matched with others with similar user behaviors and profiles. In the current model, only non-personalized data is used, which means that all users receive the same recommendations for the same activity on the page. Although these recommendations are accurate insofar as the user is recommended the most similar media content to what they are currently watching, personal preferences may come into play that make them more or less relevant for the specific user. Therefore, a personalized recommendation can be highly valuable in order to create useful recommendations for a wider range of users. This is also supported by experts of the system during interviews that asks for a personalized recommender system. As the system has more items than users a user- user based collaborative filtering method would be most appropriate (Ricci, Rokach, and Shapira 2015).

A recommendation system that uses collaborative filtering needs access to user data to create personal recommendations. At present, data on users’ activity on the site is implicitly collected, but without any connection to specific accounts. A way to determine how much a user appreciates a particular media can be obtained if the data collected concerning watch statistics and user engagement can be linked to a user. As for explicitly collected data, it is now available in the form of media that users can add to their favorites. One way to collect more explicit data, ensuring greater user coverage, is by creating a media rating system that teachers can use. Experts in both interviews stated in their interviews that ratings should be provided only by teachers to avoid students contributing noisy data. They also suggested a simple rating system too ensure a high degree of usage (Medienavet 2021; Mediepoolen 2021). To determine if recommendations based on collaborative filtering is more useful than what has been shown with content based recommendations online evaluation would need to be preformed.

The user’s activity on the page needs to be collected and stored in order for a model of collaborative filtering to be developed. As it can be sensitive to collect too much personal data about Skolfilm’s users as it is in a school environment, the integrity aspect of the situation needs to be carefully considered. A semi-personalized model where each user

49 is part of a predefined group with similar preferences is a compromise where personal integrity is protected while some personalization for the recommendations can be provided. When the user receives recommendations, it is then based on the overall user behavior of other users in the same group (Falk 2019). All data generated in the group is thus compiled and it will then not be possible to track the activity of individual users. One consequence of a semi-personal recommendation system will be that all users belonging to the same group will receive the same recommendations. It is therefore important that these predefined groups do not have unnecessarily large internal differences. The division that most closely resembles reality and which should thus work best in practice is that the user is divided according to which year groups and subjects the user teaches. To avoid groups with too few users in, the range of year groups and subjects a teacher can choose should be limited. Based on the interviews, there are several assumptions that can be made about the users depending on the type of teacher and grade that is taught which can also affect recommendations. Medianavet mentions that language teachers will be more interested in media content for the language and grades they are teaching and disregard what the exact subject of the media content is about. They also mention that teachers for lower grades usually have many more subjects compared to teachers for higher grade (Medienavet 2021). More tests should be performed to find what differences there are in preferences depending on the teacher type as well as other external factors such as time during the semester or type of school.

7.2 Feature combination hybrid

To deal with common complications associated with collaborative filtering, such as cold start problems and data sparsity different hybrid filtering models have been proposed from research in the field. One such solution that has been proven to produce positive results is a method combining various types of user-generated data to ensure the most appropriate data is used to make recommendations. This system, proposed by Zanker and Jessenitschnig (2009) turned out to increase recommendation system prediction accuracy as well as user coverage. Usually either implicit or explicit data is used but in this type of hybrid model both types of data is used based on its availability from users. The predictive power expected of having user ratings for items can in general be considered much higher than implicitly collected information about the user such as page views. However, explicit data is usually less available as users do not necessarily generate enough of this data to base recommendations upon. Conversely, rating domains such as page views have a large abundance of data from users but suffer from noisy data making accurate predictions more difficult. The system, called feature combination hybrid, is described below (ibid.).

50 Í B2>A4 A42 (8, D, 3 ) = 8,E 5 2ℎ A42 # (6)

A42 8, D, 3 , 8 3  5 2ℎ ( A42) ∈ A42 B2>A48,E = (7) > ,>Cℎ4AF8B4  

The input of equation 6 consists of an item i, a user u and a recommendation domain 3A42. A recommendation domain can be defined as the type of data that is used for comparing items between users. Such as user ratings, page views of search queries. The output is the average score for i on u on its N closest neighbours (Zanker and Jessenitschnig 2009). Equation 7 describes that equation 6 will be used if the item can be found in the rating domain, otherwise the score will be set to 0.

As feature combination hybrid is a user-user collaborative filtering method, the similarity between users must be calculated. This is done by comparing the cosine similarity between each user in the domain. As opposed to when only one rating domain is used to determine the similarities between users, all necessary rating domains are utilized and weighted depending on its importance. A rating domain is deemed necessary if it has to be deployed to find enough similar users to base recommendations to a user u. Therefore rating domains are ranked based on its predictive power where the highest ranked rating domain is prioritized, and if not sufficient data is available the next rating domain in the hierarchy is recruited. The weights associated with each rating domain can be set to equal at first and based on specific domain knowledge or offline testing on the current data set be reset (ibid.). The following example is set up to further clarify how this method works.

Table 11: Example of feature combination hybrid

User RB40A2ℎ RE8B8C RE84F RA 0C4

Alice @1, @3, @4 82, 83, 84, 86, 87 82, 84 82, 84 Bob @2, @4 81, 82, 84 81, 82, 84 81, 84 Charlie @1, @5 81, 84, 85 85 85 New user David @2,@5 82, 83 82 null

Table 11 shows current users in the system. A new user David has just begun using the system resulting in few interactions. In a traditional collaborative system, no rec- ommendations could be given to David in this scenario. With FCH, a list of personal recommendations for David are produced by enabling several rating domains when com- paring user similarities. If enough neighbours N are found and if item i is in the rating domain for David a recommendation score will be given for i. The weight for each rating domain is set equally to simplify.

51 1 1 1 B2>A48 ,;824 = ∗ 0 + ∗ √ = 0.35  4 2 2 2 0E8384 : (8) B2>A4 = 1 ∗ + 1 ∗ √1 = .  84,>1 0 0 29  2 2 3 B2>A4 B2>A4 84,;824 + 84,>1 A42 5 2ℎ (8 , 0E83, 'A0C4,'E84F) = = 0.32 (9) 4 2

If N = 2 is chosen, rating domain 'A0C4 can not be used solely, but with 'E84F added two neighbours are found where 84 also is in the rating domain 'A0C4 and 'E84F. Now scores for 84 from Alice and Bob can be calculated as seen in equation 8. The combined score is then seen in 9. To exemplify how multiple rating domains are included if not enough neighbours with the chosen item is found directly, the recommendation score 81 for David is calculated using N = 3.

B2>A4 = 1 ∗ + 1 ∗ √1 + 1 ∗ √1 + 1 ∗ √ 1√ = .  83,;824 4 0 4 4 4 0 33  2 2 2 3 0E83  1 1 1 1 1 1 1 83 : B2>A48 ,>1 = ∗ 0 + ∗ √ + ∗ √ √ + ∗ = 0.14 (10) 3 4 4 3 4 2 3 4 2  B2>A4 1 ∗ + 1 ∗ + 1 ∗ + 1 ∗ 1 .  83,ℎ0A;84 = 0 0 0 = 0 13  4 4 4 4 2

B2>A4 B2>A4 B2>A4 83,;824 + 83,>1 + 83,ℎ0A;84 A42 5 2ℎ (8 , 0E83, 'A0C4,'E84F) = = 0.22 3 3 (11)

As can be seen in equation 10, enough neighbours can not be found for David using rating domain 'A0C4 and 'E84F alone as could be in equation 8. Rating domains in lower hierarchy needs to be utilized to find 3 similar users who have also rated 82 in some rating domains. Using all available rating domains, a score can be calculated as seen in equation 11. If a collaborative filtering system relying on explicit data was used, David would have found no similar users since he had not rated any content. As no users would be found the recommender can not produce recommendations for the user David.

In Zanker and Jessenitschnig (2009) study it was shown that using view product as rec- ommendation domain provided recommendations with better accuracy scores compared to products bought. This may seem counter intuitive but could be explained by products bought being more sparse than view product that contained considerably more data (ibid.). Data collection in the form of user ratings and expert ratings are requested by experts in the system (Medienavet 2021; Mediepoolen 2021). With basing a collaborative model solely on explicit data collection, there is a risk that not enough data is available to base recommendations on. Implicit data collection is already collected in the system but needs to be connected to individual users. If that connection can be set up, feature combination

52 hybrid may be used to ensure a system that is not suffering from cold start problems due to insufficient data on user or expert ratings.

7.3 Session based recommendations

The previous types of recommendation systems described, content-based and collaborative filtering, use the complete history of the users’ interactions with the system to create a profile. This comes with the assumption that all historical user data is equally valuable for the next recommendation. However, it is important to consider that users can change their preferences over time, which means that historical data contributes to creating an incorrect picture of the user’s current preferences. Historical user data also means that recommendations will be created based on the user’s complete profile and not on what they are currently searching for. A SBRS consists of a number of different entities whose properties will determine what kind of SBRS it is (Wang et al. 2019). It may be known which user belongs to which session, but often SBRSs handles data where the user is anonymous. What the user gets recommended is called items, which can be movies, hotels, services, etc. On each item, a user can perform an action in the form of clicking, starting a movie, leaving feedback, etc. An interaction then consists of a user performing an action on an item. A list of interactions forms a session which will in turn have a set of attributes. Based on this metadata it can be decided which algorithms are suitable for application and what kind of recommendations the system can produce. Session based recommender systems (SBRS) are based on which items a user interacts with during the current session on the site and then recommends the most relevant items for the user based on that information. A session contains a series of user-item interactions, for example, a user looking for a movie on Netflix is visiting a number of movies reading its descriptions. A common application of SBRS is then to let the system predict how a session can be supplemented given its current state, i.e. which item or items are recommended for the user based on their activity in the current session. This is done mainly by modeling patterns of dependencies between interactions within the session. E.g. Depending on which items a user has chosen to put in their basket on a shopping site, the user is recommended which other items similar to these that the user should also buy (ibid.).

Attribute 1: The length of a session, i.e. how many interactions it consists of. Long sessions, typically over 10 interactions, provide richer information to base more accurate recommendations. However, there is a risk that irrelevant interactions are included, which then give rise to noise and make prediction more difficult. For short sessions, the problem is reversed (ibid.).

Attribute 2: Arrangement of interactions within a session. Depending on the nature of the

53 session, a system may consist of either ordered or disordered interactions. In an ordered session, the recommendation depends on which item is estimated to come next in that sequence. In a disordered structure, it is instead based on which co-occurrences of items in the same session, regardless of order, are of importance (Wang et al. 2019). E.g. a user buying train tickets are following a set order of procedure when deciding dates, times, train, passenger class etc. While a user watching films on a streaming platform has no order in their interactions with the system.

Attribute 3: Which types of actions are included in a session. A distinction is made between whether a session has homogeneous or heterogeneous intra-session dependencies. Homogeneous means one single action type. For example, every time a user adds a product to the shopping cart. Heterogeneous means that several action types are included, e.g. in addition to what the user chose to add to the shopping cart. There are also the products the user clicked on in the session. This means increased complexity, which can make modeling more difficult (ibid.).

Attribute 4: Information about the user associated with the session. A session can have a user associated which provides the opportunity to learn the user’s personal preferences. However, it is a challenge to map a user’s preferences with certainty as it usually changes over time. Anonymous sessions have no users linked to a session and thus produce recommendations solely based on the anonymous user’s current context (ibid.).

The output a SBRS produces depends on what the attributes of the system’s sessions look like as well as what type of SBRS has been built. The recommendations may consist of items that are most relevant to continue the current session, a list of the items considered to complete a session or a recommendation about the user’s next session based on the current one (ibid.).

The conventional methods used to find dependencies in sessions are pattern / rule mining, K-nearest neighbor (KNN) and Markov chain. In Ludewig and Jannach (2018b), both conventional methods such as K-NN and more complex computationally heavy methods, e.g. neural networks were compared. With regard to prediction accuracy, it turned out that the simpler methods generally performed better than the newer more complicated methods. They also require less computing and memory capacity, which simplifies implementation (ibid.). In Wang et al. (2019) a summary of the conventional approaches for SBRS can be found and is seen in figure 12.

Medienavet mentions that the users current watch history might be more important than modelling a full user profile (Medienavet 2021). A recommender system type that uses near-time preferences would be a session based recommender system. Based on the items a user has interacted with during its current session recommendations on what the user

54 Table 12: Comparison SBRS approaches (Wang et al. 2019)

Applicable sce- Approach Pros Cons narios Intuitive, simple Information loss, Simple, balanced and effective on cannot handle Pattern/rule mining and dense, ordered session data where complex data (e.g., base approaches or unordered ses- dependencies are imbalanced or sions easy to learn sparse data) Information loss, hard to select K, Intuitive, simple KNN based ap- Simple, ordered or limited ability and effective, quick proaches unordered sessions for complex ses- response sions (e.g., noisy sessions) Usually ignore long-term and Short and ordered Good at modelling higher-order de- Markov chain sessions with short- short-term and low- pendencies, the based approaches term and low-order order sequential de- rigid order as- dependencies pendencies sumption is too strong might be interested in next is given (Wang et al. 2019). E.g. if a user is searching for content about evolution and animals on Skolfilm a film about Galapagos might be recommended. If another user also is searching for evolution but scientists instead of animals, a film about Carl von Linné might be recommended. The next time the user is visiting the site this information is forgotten and the recommendations are now solely based on the user’s current activities. By investigating which types of sessions, users, items and actions that exist in Skolfilm, the relevant type of algorithm to apply a SBRS can be suggested. The length of a typical session is not possible to measure using the available data as no session data is collected. By collecting session data it is possible to investigate whether the full session is interesting to use for mapping dependencies or if the sessions are too long for this, thus generating too much noise (ibid.). If that is the case only the latest items could be used which should reduce noise. Sessions on Skolmedia will be of unordered type. There is no predetermined order for users when they are browsing the content. This means that dependencies based on co-occurences in session lists will be used and not which order these interactions appear. Several different action types can be considered and it is also possible to include multiple types in one session. This includes searches a user has done during a session, which media content the user has clicked on or what the user has watched. As session complexity increases with multiple action types it is important to consider whether an added action type adds useful data for the algorithm (ibid.). User

55 searches are the most relevant action type for users on Skolmedia. These are valuable as these contribute with information about what the user is looking for but has not found yet, which was requested during interviews with experts on AV-centrals. If data is collected regarding which media content a user is visiting this could be used instead depending on how much of such data is produced by users. If information about the user is available it can be used to personalize recommendations by learning which long-time preferences the user has from its previous sessions (Wang et al. 2019). However, according to ibid. this is hard to capture why anonymous users might as well be used. Using anonymous data also means that specific user data is not necessary to store, which is favourably from an integrity point of view as users are in a school environment.

As conventional algorithms are performing equivalent or better compared to more complex methods according to Ludewig and Jannach 2018b a less complex algorithm is preferred as long as the session data collected by Skolfilm is proven to be simple enough. In figure 12 three conventional methods are compared. Markov chains should not be applicable as it requires ordered sessions, which is not accurate on Skolfilm. To apply KNN or pattern/rule mining simple sessions are required, meaning the types of attributes in a session are not contributing to the increased complexity of the model. Only one action type and anonymous users would lead to sessions for Skolfilm being less complex. The available data does not contain any session data, so to conclude if the conventional methods are relevant, session data needs to be collected and investigated to determine its complexity. The weaknesses mentioned in figure 12 are mainly sparse or imbalanced data and noisy session data. Data will be naturally sparse when starting to collect session data as not many sessions have been created. After this initial phase, the data can be inspected to find if it is dense and balanced enough for dependencies to be created by KNN or pattern/rule mining. In other words, there are enough searches performed by users to make connections between these interactions and find the next probable item. Furthermore, are there some items that are overrepresented in sessions or is the spread of items balanced? Finding how much noisy data is included in sessions will also determine whether conventional methods are useful or not. If it is concluded from investigating the collected session data that the data is too complex for conventional methods, more advanced methods using neural networks might be applicable (Wang et al. 2019).

To conclude whether any of the aforementioned methods are actually useful for end users A\B testing is required. A\B testing is the most reliable method of testing as it is tested directly by end users (Falk 2019).

56 7.4 Additional data collection

In order to be able to develop the current model further, it is required that ongoing data collection is broadened or that a completely new type of data is collected. There is some metadata for media files that has potential to improve a future recommendation system if the data were supplemented.

SRT-files are available for 67 % movies right now and showed no improvement of the model when tested to be implemented. If all media content has associated SRT-files, however, it has the potential to be able to improve the model by providing more data to each media file. When only a proportion of media files have corresponding SRT-files this leads to a high variance in metadata length, which might decrease model predictability. Metadata about which course objectives and media content fulfil is also available for a portion of the catalog today but is missing for a majority. 15 % of media files has this metadata which has the potential to improve a model if the data is complete. Information about course objectives can be used so that recommendations with the same or similar course objectives can be prioritized over other media. Medienavet as well as Mediepoolen mentions during the interviews that it is crucial for teachers that the media content they find also belongs to a relevant curriculum. Mediapoolen also suggests curriculum data should be used as one of the main factors for basing recommendations. Expert recommendations should be considered to complement user ratings as this is suggested by both Mediapoolen and Medienavet during interviews (Medienavet 2021; Mediepoolen 2021). It is a type of recommendation that works when expert opinions are regarded reliable and fulfills a purpose (Falk 2019). Editors in Skolfilm have an extensive knowledge of the media content which might be useful for basing recommendations to end users.

7.5 Recommending AV-centrals

During the work, insights into how the area of application of a recommendation system can be utilized has been broadened. In addition to being useful for direct end users in the form of teachers, it can also serve a purpose for editors at AV centrals. An important factor when recommendations had low quality in the testing was the number of keywords for the selected media. If a recommendation system can be helpful in facilitating keyword labeling, it would form a self-reinforcing system. By using LDA topic modeling on SRT files, topics in media files can be found, which can be an aid in adding key words to media or finding media with insufficient or incorrect labeling.

Another way the current recommendation system might be helpful for editors at AV centrals is by recommending media content to them based on content whose license is

57 about to expire. At present, there is no standardized method for concluding whether a film should remain in supply or be discontinued. The recommendation system can suggest movies that are similar to the outgoing movie and that show different performance indices that are used to support which movie performs best among the end users.

58 8 Conclusions

Regarding the first research question According to experts of the system and using available data, what should recommendations in the system consist of? the following conclusions can be made. By interviewing experts and testing a content based recommendation system designed using available data, it was discovered that the number of keywords associated with media content is an important factor deciding the quality of a recommendation model. Recommendations benefit from being low diversity with the exception of media content belonging to the same series as the reference media. Furthermore, recommendations in the current system benefit from being of similar target groups and media type. It was also concluded using a Wilcoxon signed-rank test that the TF-IDF model outperformed the reference model. LDA did not seem to generate satisfactory recommendations but might be useful after conducting additional tests.

The second reasearch question was stated as What recommender system and data collection methods are relevant to use on the system? Developing a recommender system for Skolfilm, it can be concluded that more data collection is needed to further understand user behaviour and create recommendations that predict user preferences with higher accuracy. By initiating additional data collection Skolfilm would be able to further develop a recommender system that is able to provide recommendations based on more information about users. Collaborative filtering applied to user favorites, feature combination hybrid method and session based recommendations are possible paths for future recommendation systems based on data that Skolfilm can collect and what expert users of the system have suggested. Further testing with expert users as well as end users in the form of A\B-testing needs to be conducted to determine what an optimal recommender system on Skolfilm’s domain would be.

59 References

Aggarwal, Charu C et al. (2016). Recommender systems. Vol. 1. Springer. Amatriain, Xavier and Justin Basilico (2012). “Netflix recommendations: Beyond the 5 stars (part 1)”. In: Netflix Tech Blog 6. Bhosale, Supriya, Vinayak Pottigar, and Vijaysinh Chavan (2015). “A Review on Video Streaming in Education”. In: International Journal of Computer Science and Informa- tion Technologies 6.2, pp. 1088–1091. Blei, David M, Andrew Y Ng, and Michael I Jordan (2003). “Latent dirichlet allocation”. In: the Journal of machine Learning research 3, pp. 993–1022. Boster, Franklin J et al. (2006). “Some effects of video streaming on educational achieve- ment”. In: Communication Education 55.1, pp. 46–62. Cisco (2012). The Impact of Broadcast and Streaming Video in Education. url: https: //www.cisco.com/c/dam/en_us/solutions/industries/docs/education/ ciscovideowp.pdf. Clio (2020). Digitization in Swedish schools 2020. url: https://www.clio.me/wp- content/uploads/2020/06/clio-market-research-se.pdf. Conover, William Jay (1998). Practical nonparametric statistics. Vol. 350. John Wiley & Sons. De Gemmis, Marco et al. (2015). “Semantics-aware content-based recommender systems”. In: Recommender systems handbook. Springer, pp. 119–159. Falk, Kim (2019). Practical Recommender Systems. Shelter Island, NY: Manning Publi- cations Company. isbn: 9781617292705. George, Edward I and Robert E McCulloch (1993). “Variable selection via Gibbs sam- pling”. In: Journal of the American Statistical Association 88.423, pp. 881–889. Goldberg, David et al. (1992). “Using collaborative filtering to weave an information tapestry”. In: Communications of the ACM 35.12, pp. 61–70. Gunawardana, Asela and Guy Shani (2015). “Evaluating recommender systems”. In: Recommender systems handbook. Springer, pp. 265–308. Isinkaye, F.O., Y.O. Folajimi, and B.A. Ojokoh (2015). “Recommendation systems: Prin- ciples, methods and evaluation”. In: Egyptian Informatics Journal 16.3, pp. 261–273. issn: 1110-8665. Jannach, Dietmar and Michael Jugovac (2019). “Measuring the business value of recom- mender systems”. In: ACM Transactions on Management Information Systems (TMIS) 10.4, pp. 1–23. Johnson, Cameron (2017). Goodbye Stars Hello Thumbs. url: https : / / about . netflix.com/en/news/goodbye-stars-hello-thumbs. Khusro, Shah, Zafar Ali, and Irfan Ullah (2016). “Recommender systems: issues, chal- lenges, and research opportunities”. In: Information Science and Applications (ICISA) 2016. Springer, pp. 1179–1189.

60 Knijnenburg, Bart P and Martijn C Willemsen (2015). “Evaluating recommender systems with user experiments”. In: Recommender Systems Handbook. Springer, pp. 309–352. Kvale, Steinar and Svend Brinkmann (2014). Den kvalitativa forskningsintervjun. Dimo- graf, pp. 138–157. Lee, Daniel D and H Sebastian Seung (1999). “Learning the parts of objects by non- negative matrix factorization”. In: Nature 401.6755, pp. 788–791. Liu, Lin et al. (2016). “An overview of topic modeling and its current applications in bioinformatics”. In: SpringerPlus 5.1, pp. 1–22. Ludewig, Malte and Dietmar Jannach (2018a). “Evaluation of session-based recommen- dation algorithms”. In: User Modeling and User-Adapted Interaction 28.4-5, pp. 331– 390. – (2018b). “Evaluation of session-based recommendation algorithms”. In: User Modeling and User-Adapted Interaction 28.4-5, pp. 331–390. Maddodi, Srivatsa et al. (2019). “Netflix bigdata analytics-the emergence of data driven recommendation”. In: Srivatsa Maddodi, & Krishna Prasad, K.(2019). Netflix Bigdata Analytics-The Emergence of Data Driven Recommendation. International Journal of Case Studies in Business, IT, and Education (IJCSBE) 3.2, pp. 41–51. Medienavet (2021). AV-central, 2 participants in the interview. Interview conducted by Max Netterberg & Simon Wahlström. Mediepoolen (2021). AV-central, 3 participants in the interview. Interview conducted by Max Netterberg & Simon Wahlström. Qaiser, Shahzad and Ramsha Ali (2018). “Text mining: use of TF-IDF to examine the relevance of words to documents”. In: International Journal of Computer Applications 181.1, pp. 25–29. Ricci, Francesco, Lior Rokach, and Bracha Shapira (2015). “Recommender systems: introduction and challenges”. In: Recommender systems handbook. Springer, pp. 1–34. Röder, Michael, Andreas Both, and Alexander Hinneburg (2015). “Exploring the space of topic coherence measures”. In: Proceedings of the eighth ACM international conference on Web search and data mining, pp. 399–408. Said, Alan (2013). “Evaluating the accuracy and utility of recommender systems”. PhD thesis. Universitätsbibliothek der Technischen Universität Berlin. Schütze, Hinrich, Christopher D Manning, and Prabhakar Raghavan (2008). Introduction to information retrieval. Vol. 39. Cambridge University Press Cambridge. Shier, Rosie (2004). “Statistics: 2.2 The Wilcoxon signed rank sum test”. In: url: https: //www.statstutor.ac.uk/resources/uploaded/wilcoxonsignedranktest. pdf. Skolfilm (2021). Om oss. url: https://skolfilm.se/om-oss. Statista (2021). Number of Netflix paid subscribers worldwide from 1st quarter 2013 to 1st quarter 2021. url: https://www.statista.com/statistics/250934/ quarterly-number-of-netflix-streaming-subscribers-worldwide/.

61 Stevens, Keith et al. (2012). “Exploring topic coherence over many models and many topics”. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp. 952–961. Wang, Shoujin et al. (2019). “A survey on session-based recommender systems”. In: arXiv preprint arXiv:1902.04864. Welbers, Kasper (n.d). “Dirichlet distributions, the alpha hyperparameter, and LDA”. In: url: http://i.amcat.nl/lda/understanding_alpha.html. Zanker, Markus and Markus Jessenitschnig (2009). “Collaborative feature-combination recommender exploiting explicit and implicit user feedback”. In: 2009 IEEE Conference on Commerce and Enterprise Computing. IEEE, pp. 49–56. Zhang, Shuai et al. (2019). “Deep learning based recommender system: A survey and new perspectives”. In: ACM Computing Surveys (CSUR) 52.1, pp. 1–38.

62 Appendix

Custom stop words få, nan, film, se, en, finnas, börja, nana, göra, komma, gå, ta, också, nana, vara, både, bara, sig, i, in, genom, vi, sätt, egentligen, två, -, ca, serie, annan, del, medan, ge, skola, ur, inspelat, bra, mycket, många, visa, rätt, tid, dag, år, språk, berätta, inspelat, ny, arrangöra, in, ska, ta, se, -, ur, tar, ser, nan, gör, få, ger, del, mer, andra, sa, ner, kom, vet, ja, sen, vill, tre, väl, kommer, lite, finns, går, igen, lika, måste, fram, gång, kanske, heter, olika, nåt, hela, tack, står, prata, nån, även, tiden, väldigt, sitt, saker, fick, titta, fler, flera, säger, behöver, säga, tror, kunna, hitta, helt, annat, ofta, nog, innan, gjort, såg, nej, alltid, gjorde, exempel, just, ihop, gick, fått, händer, inget, ganska, tillbaka, aldrig, vissa, ibland, alltså, hos, hej, den, de, dem

Inteview questions

Vad är syftet med er verksamhet, specifikt gällande streaming? Hur ser arbetet ut?

I vilken kontext används er verksamhet av användarna? Tänk tider, dagar, platser mm. På vilket sätt används tjänsten idag? Är användandet målinriktat eller planlöst? Ex. bestämmer läraren till stor del vad eleven ska kolla på eller får eleven själv klicka runt.

Vad är en bra film? Hur utmärker det sig? (Be om att konkretisera) Finns det speciella attribut ni skulle säga är viktiga? Har den bra faktainnehåll, tilltal mm.? Hur relevant är ålder på filmer? Är mer nyproducerade filmer generellt att föredra över äldre filmer?

Finns det filmer som ni anser håller hög kvalitet men som ni inte upplever kommer fram till lärare och elever. I så fall varför tror ni att det är så och finns det något som kännetecknar dessa filmer?(Om de svara UR) Vilket värde finns i att rekommendera inköpta förlags- filmer framför UR-program?

Hur skulle ni beskriva er verksamhets rekommendationer i nuläget? Vad anser ni fungerar väl med dem och vad ser ni har förbättringspotential?

Vad önskar ni att ni blir rekommenderade? Hur skulle en rekommendation kunna under- lätta för elever/lärare? Vad kännetecknar en bra rekommendation och vad skulle en dålig rekommendation vara?

Vilken typ av format skulle ni vilja se på rekommendationerna? Tror ni på bredare rekommendationer som utgår från hela användarprofilen (få användarna att upptäcka nytt material). Eller smalare rekommendationer som baserat på vad användare för tillfället är ute efter (ett sätt att förfina en påbörjad sökning)?

Bör det göras skillnad på vilken roll användaren har, lärare eller elev?

Hur skulle information kring vad en användare tyckte om en film kunna samlas in? Nu finns det möjlighet att spara filmer som favoriter, men hur skulle ett betygssystem kunna se ut?

63 Hur ser ni på att utgå från expertbedömningar (exemplifiera med vin-rekommendationer och sommelier. I det här fallet ex. redaktörer) kontra användaromdömen för att skapa rekommendationer. Hur bör olika bedömningar värderas i en rekommendation?

Kan man låta både lärare och elever sätta betyg? Påverkar det tillförlitligheten för det sammanvägda betyget?

Vilken information skulle kunna hjälpa till att skapa bättre rekommendationer? Tänk både om det finns sådant som redan samlas in men även ny datainsamling. Eller finns det risk att det blir en tröskel för användare då? Störande moment för användarupplevelsen?

Vill ni fylla i och bli rekommenderade utifrån det ni fyllt i? Ex. användarprofil och ratings?

Vill ni att ni ska få implicita rekommendationer utifrån det ni redan tittat på?

Är det viktigt med transparens kring varför en rekommendation ges? Endast ge rekom- mendationen eller ex. “eftersom du gillade det här rekommenderas det här”

Tänk er att en användare får 5 rekommendationer när den klickar på en film. Bör dessa rekommendationer då rekommenderas som de är eller presenterade efter vissa fördefinier- ade kategorier. I så fall vad skulle dessa kategorier kunna bestå av? Ex. på engelska, svårare, lättare, längre, kortare, annan filmtyp osv.

64