UNIVERSITY OF MANCHESTER School of Computer Science Project Report 2016

Recommendation web service

Author: George Andrei Ceaus Supervisor: Sean Bechhofer

1

Abstract

Recommendation engines are becoming an essential component for most web or mobile applications. Algorithms such as collaborative filtering allow content providers to recommend their users useful content and maximize how much content users consume. However, implementing such a system is complex and requires vast knowledge of how recommendations engine work. The aim of this project was to abstract the inner workings of a recommender system by implementing it behind a web service and provide an API that developers can integrate into their applications.

This report will start by going through some context information about the domain of this project and technologies used. It will then go into details about the research I have done before the beginning of the project, cover what functionality the system currently implements and how it can be used. It will also explain what design decision I have taken and why, and how I have implemented the system. Finally, it will end by showing what steps I have taken to evaluate the recommender system and the most important algorithms, and give a brief summary of what other features will be implemented after the end of the project.

2

1 TABLE OF CONTENTS

2 Context ...... 4 2.1 The topic area to which the work applies ...... 4 2.2 What else has been done in the area and by whom ...... 4 2.3 Why the work is being done ...... 5 2.4 What the work was aiming to achieve ...... 5 2.5 Technologies context ...... 5 3 Research ...... 6 3.1 Graph databases ...... 6 3.2 Querying graph databases ...... 6 4 Functionality ...... 7 4.1 Gathering information ...... 7 4.1.1 Content ...... 7 4.1.2 Tags ...... 7 4.1.3 Users ...... 7 4.1.4 User behaviour ...... 8 4.2 Recommendations ...... 8 4.2.1 Recommending tags ...... 8 4.2.2 Content-based filtering ...... 8 4.2.3 Collaborative filtering ...... 8 4.2.4 Other types of recommendations ...... 9 5 Design ...... 9 5.1 Recommender systems approaches ...... 9 5.2 Graph database ...... 10 5.2.1 Interfacing with the graph database ...... 10 6 Development ...... 12 6.1 Approach ...... 12 6.2 setting up ...... 13 6.2.1 Graph database ...... 13 6.2.2 Web client ...... 13 6.2.3 Recommender service backend ...... 14 6.3 Recommendation algorithms ...... 15 7 Evaluation ...... 17

3

8 Future work ...... 18 8.1 Tags ontologies ...... 18 8.2 Trending content ...... 19 8.3 Subscriptions ...... 19 9 Reflection and Conclusion...... 19 10 References ...... 20

2 CONTEXT

2.1 THE TOPIC AREA TO WHICH THE WORK APPLIES The amount of digital content consumed through the internet is exponentially increasing year on year. Some examples of digital content are video, audio, images, software or news. The most popular mediums through which content is consumed are web applications and more recently and increasingly popular, mobile applications. Due to the vast amounts of content available, content publishers are developing ways to organize content and make it more discoverable to its consumers. This has led to research in domains such as information retrieval, search engines or recommendation engines.

In fact, the most successful applications are those that were able to come up with innovative solutions to support this requirement, and maximize how much content its users consume.

However, developing such a system or even using an existing one requires expertise, which some developers do not have. The software product that I have developed during this project tries to help developers integrate a recommendation engine into their application, without requiring much knowledge in this field.

2.2 WHAT ELSE HAS BEEN DONE IN THE AREA AND BY WHOM Recommender systems attempt to present to the user content that he/she might be interested in. Therefore, these can be considered as a companion to search engines since they can recommend content that the user was not searching for but still find it useful.

The two main approaches to recommendations are collaborative filtering, content-based filtering, or a combination of the two known as a hybrid approach [1]. In collaborative filtering, the user’s past behaviour is taken into account to find users with similar preferences. Because of this, the system needs a large amount of historical information on the user and is susceptible to the cold start problem, where the system does not have enough data to make useful recommendations.

The latter, content-based filtering, is focused on the content features such as tags or the user’s preferences, to recommend other similar content. Since this method doesn’t rely on historical information as much as collaborative-filtering, it is less susceptible to cold start problems, and can recommend content right away given that content features have been specified and the user has consumed at least one item.

4

There are numerous research papers and resources on recommender algorithms online [2]. It is a heavily discussed and researched topic, and it has even been the subject of a competition sponsored by Netflix [3] to drive research into finding new algorithms with better accuracy.

The applications development community is moving towards a world of microservices [4], modularity, web and cloud services. Developers focus on only their specific business logic, and use other services for extra functionality, usually via a REST API [5], which has become an industry standard for web services.

However, when it comes to full-blown solutions which allows developers to just plug-in a recommender system into their existing content publishing applications, there is a scarcity of offers including open source packages, web services and even commercial solutions.

On the other hand, there are quite a few of solutions for including libraries such as Apache Mahout [6], or cloud services such as Amazon Machine Learning [7] or Google Cloud Machine Learning [8]. Although these are very good, they are not tailored towards content publishing applications and recommendations, and have a much broader domain. Moreover, they require knowledge of machine learning and can be quite complex to integrate into existing applications.

2.3 WHY THE WORK IS BEING DONE My project and web service aims to simplify the above mentioned process, and to provide web and mobile developers a web service which will help them with the entire process from annotating their content and registering user behaviour to getting recommendations for a user. This project was born from my own need for such a service, and inability to find a suitable solution.

2.4 WHAT THE WORK WAS AIMING TO ACHIEVE The goal of this project is to make it easier for developers who don’t have knowledge in machine learning or recommendation systems to integrate such a system into their web or mobile applications.

Another goal is to allow all developers to use this service cross-platform (web or mobile) and programming language independent. The only requirement is an internet connection.

Further, this being a web service, it will be used by multiple users and therefore needs to grow and scale together with the users’ applications. Highly scalable is a first class requirement.

2.5 TECHNOLOGIES CONTEXT Although the following chapters will go into more details concerning how I have designed and developed the system, I will now give some context on the technologies that I have used.

There are many ways how to implement a recommender service however, as part of this project I have built the system around a graph database. A graph database is one that represents data using graph structures such as nodes, edges or properties. Graph database have become increasingly popular due to the fact that most user data can be naturally structured as a graph.

The other defining feature of my system is that it’s a web service. Web services are distributed systems, that abstract the functionality of a system through a web API, usually REST. Representational state transfer (REST) is a distributed communications standard, that facilitates

5 communication between two different systems through standard HTTP. It makes use of URLs to identify resources, and uses the HTTP methods such as POST, GET, PATCH or DELETE for creating, reading, updating and deleting (CRUD) resources. Its simplicity and stateless design made it a popular choice for web services APIs.

3 RESEARCH

This chapter will cover what research I have done before starting the project, in order to select the best technologies and tools suitable for the project. It will also cover the decisions I took and the reasoning behind these.

3.1 GRAPH DATABASES Since my service has at its core a graph database, selecting the right database to meet my requirements was the most important decision I had to take.

To begin with, there are two types of graph databases that I’ve considered: triple stores and generic graph databases.

Triple stores are optimized to store RDF (Resource Description Framework) triples which are mainly used for the semantic web. A triple is of the form subject-predicate-object, for example “Alice likes ice-cream”.

Generic graph databases are databases that store graph structured data in the form of nodes, edges and optionally properties. There are numerous graph databases available, each optimised for specific set of requirements in terms of query performance, availability, scalability consistency or storage model.

3.2 QUERYING GRAPH DATABASES Advanced querying of the data is another very important requirement of my application. Triple stores use a semantic querying language called SPARQL (SPARQL Protocol and RDF Query Language), which is roughly inspired from SQL, and used to retrieve RDF data. SPARQL is a widely used standard in the world of triple stores.

On the other hand, as generic graph databases are a much broader category, there is no single standard for querying. Most popular databases come with their own proprietary querying languages, and usually support a few others. However, there is a language that is gaining in popularity and is quickly becoming a standard. is a graph traversal language and is part of the Apache TinkerPop graph computing framework [9]. It is supported by many graph databases and allows for more complex traversals than SPARQL, with the downside of being more difficult to learn and understand.

Although my data can be modelled as triples, “user consumed content”, “content tagged tag” etc., because of the limited querying capabilities of triple store databases, I have decided to use a generic graph database. Moreover, because of its growing popularity, and because it can act as an abstraction layer between my application and the database, I am using Gremlin as the querying/traversal language.

6

4 FUNCTIONALITY

In this chapter we will go over what features and functions the implemented software has, why I think these are important, who might benefit from using these features and how they can use it.

4.1 GATHERING INFORMATION The obvious feature of a recommender system is to output a list of content items that the user might be interested in. However, there are some things to be done before a content publisher can show its users recommendations. Firstly, content has to be created and then tagged, after which we can start recording user history. Using this information, we can then offer the end-users recommendations.

4.1.1 Content This web service does not store the user’s content. Instead, after the user has created new content and has saved it in its own database, he makes a call to my service to record the newly created item as well. Therefore, my service only contains metadata such as the original content ID, title or a short description. Everything else such as images, videos or articles, are up to the content publisher to save using any kind of persistent storage. The content ID is the key that links the two records that represent the same entity, but are stored in two separate places: my web service and the user’s application. When a recommendation is requested or any other function that should return content, I return a list of IDs, which the content publisher can then use to query his own database to retrieve the original item.

Figure 1 Tagging example. Image Source: National Geographic

4.1.2 Tags Tagging occurs slightly different, as the content publisher is not required to keep a copy of the tag in his own database, but can do so if he prefers. The service keeps a list of all tags content has been tagged with so far, and can return the list so that the user can reuse them to annotate new content, thus minimizing tag duplication. In the future, I will also support full-text search for tags, so that the user can get suggested tags that already exist.

Creating a tag only requires a title, after which the service will return a unique tag ID. Using this ID, the user can tag content by making a call to an endpoint using the tag ID and the content ID.

4.1.3 Users The user entity is similar to content, in that the original user information will be stored by the content publisher however he chooses, with the user ID being the key that links the two applications. The user ID and content ID are provided by the content publisher when a new entity is created in my service. The reason behind this is that the publisher is very likely to already have IDs linked to users and content in his own database. The alternative would be that I generate the IDs

7 and the publisher will save these, thus making the operation a little more complicated for the publisher.

4.1.4 User behaviour When a user consumes an item, we need to record that action. The act of consuming an item has different meanings depending on the publisher’s domain. For example, the user can download a photo, he can like an article, watch a movie etc. The goal of my service was to make it generic to accommodate as many domains as possible, therefore I’ve adopted a broad terminology: users consume content, content is tagged using tags. The entity who owns a web or mobile application that needs a recommender service is called a content publisher and its users (who consume content) are end-users. Tagging can have different meanings as well depending on the content publisher’s domain: tags can represent categories, annotations, features etc., but are known generically as tags as far as my service is concerned.

Going back to consuming content, it is recorded through a simple call to the service, sending through the user ID and content ID.

4.2 RECOMMENDATIONS So far, I have described the ways to add information into my service. Using this information, the system can now build a model and start making inferences to be used by applications.

4.2.1 Recommending tags One such useful feature is tag recommendations. Because the service has knowledge of content tagging relationships, it can use those to give tagging recommendations when adding new content. All the user has to do is firstly add at least one tag, after which the service can start recommending new ones. It will do this by finding all items that have in common the existing tags, see what other tags they have been tagged with, and return the most common. Because it returns the most common tags, this feature has limited usefulness as the tags returned will usually be very broad such as “nature” or “animals”. When tagging content, in order to be best described, the author should use a mix of broad and specific tags. Specific tags are much harder to recommend without actually looking to see what the content represents, using artificial intelligence or another method. Since my service is designed to accommodate all sorts of external domains, it is very hard to design such a broad artificial intelligence tool and it is out of the scope of this project. It is up to the content publisher to choose the best tags to describe the content, and I am only providing the tools to record the tag selection.

4.2.2 Content-based filtering Once content has been tagged and saved, the service can recommend other items that have those tags in common as the original item, known as similar content. These recommendations are per content item, and can be used for example on a product page to show similar products. Because this sort of recommendation does not require any user history, it is not affected by insufficient data such as the cold start problem. On the other hand, if the content pool is not big enough and if there is not a good balance of broad and specific tags associated with content, these recommendations will be poor in terms of perceived similarity or relevance. When making this kind of recommendations, specific or narrow tags are the most important.

4.2.3 Collaborative filtering The most important type of recommendations are user-based, personalized recommendations known as collaborative filtering. These take into account the historical behaviour of users to find

8 other users with a similar record and recommend what they have consumed and you have not. This method is based on the observation that people behave predictably when part of large crowds. This type of recommendation is the most accurate among the ones analysed, and the most widely used in industry. The downside of this method is that it needs considerable amounts of historical data for all the end-users of the application as well as the current user, to whom we are trying to show recommendations. What this means for the content publisher is that it will take some time before his application gathers enough historical data before he can use this method. Moreover, the current user has to consume some content before this method becomes effective; the more content consumed, the better the accuracy of the results.

4.2.4 Other types of recommendations Another type of per-item recommendation that was not implemented yet takes into account the historical data explored earlier to give further recommendations for each item. For example, given an item, it checks to see what are the most popular items consumed together with the current one. This method is very popular on e-commerce websites and takes the form of “people who have bought this item have also bought…”.

There are many types of recommendations similar to the one above that can be implemented, and since the system is already built to handle the fundamental methods (e.g. collaborative filtering, item-based filtering), implementing new methods is fairly trivial and only requires developing new traversal algorithms. Another use-case example can be “frequently bought together”.

As seen so far, all methods have advantages and disadvantages in terms of accuracy and cold start. Since content publishers know their application and domain the best, they can take the decision on what methods to use or even combine methods.

5 DESIGN

This chapter will cover the design decisions I took and the reason behind them needed to implement the required functionality. We will also go through how I have set up the graph database including the schema, the REST API endpoints and finally take a look at the distributed architecture and how the service will scale.

5.1 RECOMMENDER SYSTEMS APPROACHES There are many ways to develop a recommendation engine. One method would be to use a big data platform such as or Spark and using machine learning algorithms calculate what is the probability that a user will also like another item. These are large batch jobs that take into account all the data, and calculate recommendations for all the users at once. The advantage of this method is that it supports complex algorithms that yield results with the best accuracy. On the other hand, as mentioned earlier, this involves large batch jobs which usually takes a long time. For a real- time system such as a web application or mobile, the job would have to be run at regular intervals and the results indexed so that it can be queried faster.

Another method is to model the data as a graph, and implement the recommendation algorithms using graph traversal steps. This has the advantage of being less complex than the above method, and it is better fitted for the real-time scenario since recommendations are computed on demand and for individual users or items. Using graph traversals one can implement much broader use-cases than just recommendations as seen in the chapter above, which are essential for an application. The

9 disadvantage of this method is that query times slow down proportionally with data complexity and the graph’s connectedness degree. Moreover, it is much more complicated to implement complex recommendation algorithms, and slower when it comes to running those traversal algorithms.

Due to its simplicity, potential to implement a much wider range of features to help content publishers and better suitability for a web service, real-time environment, I have decided to use a graph as the core of my recommendation service.

5.2 GRAPH DATABASE The graph schema is designed to have three types of entities or nodes: users, content and tags. These nodes have additional properties such as a publisher-provided ID, title or description. A user can consume content, and content can be tagged with a tag. These actions are modelled as edges with additional properties representing the created date. In the future, the “consumed” labelled edge will also have a property to represent the strength of the action, for example ratings.

Users Content Tags Consumed Tagged

Figure 2 Graph database nodes and edges

The graph database that I am using is Titan. Its main characteristic is its ability to scale horizontally to billions of vertices and edges [10], largely due to how it stores data. It uses a pluggable storage backend architecture, and is therefore as scalable as the chosen backend. I am using which is known for its scalability [11].

Since Titan does not store the data itself but requires another database to do so, it can be considered a middleware layer on top of a conventional database, that adds a graph data model.

5.2.1 Interfacing with the graph database Titan integrates with the Apache TinkerPop framework, which includes the Gremlin graph traversal language. There are a two main ways how to use Gremlin: use the language directly using Java 8 or

10

any other JVM language (Groovy, Scala etc.), or remotely via the Gremlin Server. The first method requires that the code is co-located with the TinkerPop framework and thus the graph database.

The second method uses the Gremlin Server to execute Gremlin traversals on the database by sending instructions remotely. Because Gremlin Server is just a regular web server, it can be used by virtually any programming language, and even from browsers. It works by exposing a REST API as well as a web sockets endpoint, and can serialize the response data as either JSON or other formats including binary formats such as Kryo.

The traversal algorithms are sent to the Gremlin Server in Groovy format, which is the most popular and the recommended gremlin language dialect. The string-encoded algorithms are then compiled by the server and executed. Gremlin Groovy is the standard flavour of the language, and it is used in the documentation and Gremlin Console as well. The Figure 3 Apache TinkerPop framework Gremlin Console is a way to run ad-hoc queries against the database without requiring compilation, and it is the fastest way to try out new algorithms and learn the language.

After my first design iteration, I decided that I will use the traversal language directly, using Gremlin Scala. I have chosen Scala due to its functional programming capabilities and because of its strong support for building web applications and REST APIs. However, due to the language differences between Scala and Groovy, Gremlin-Scala was slightly different from Gremlin-Groovy, and all the examples from the documentation required time to be ported to Scala, which ended up taking most of my development time. Moreover, the Gremlin Console only supports Groovy, and I was not able to quickly test my Scala algorithms.

Because of all these downsides, I took the decision to change my strategy and use Gremlin Server to interact with the database, and change my API to use the Go programming language.

The final system architecture design can be seen in Figure 4. All the components - the recommender service, Titan and Cassandra can be scaled horizontally and independently of each other. However, during development I have chosen to run Cassandra embedded within Titan for simplicity. This means that when starting the Titan Database, it will also automatically start Cassandra on the same machine, instead of having its own cluster that can scale independently.

11

The recommender service implements a stateless architecture, and can therefore be scaled up or down seamlessly. It is composed of a client-facing REST API which defines all the supported functions Content Publisher of the service, the business logic containing all the Gremlin traversal algorithms and the Gremlin REST API Client Server Client which interacts with the database.

HTTP JSON 6 DEVELOPMENT REST API This chapter will go into details regarding the development effort undertaken to implement the Recommender Service required functionality and make use of the design laid out in the preceding chapter. It will include the initial plan and changes adopted along the way, the Gremlin Server development approach and methodologies used, Client the technologies I have used and what difficulties I WebSockets JSON have run across. Moreover, I will talk about the shortcomings of the current design and what I will Gremlin do to overcome those. Server Titan Graph Database 6.1 APPROACH As mentioned in the previous chapter, the system architecture is composed of three components: the recommender service, the Titan database, and the Cassandra backend. However, during development I have approached this architecture slightly differently. I have embedded Cassandra into Titan Cassandra to cut down the overhead of having to maintain and configure multiple Cassandra instances as well.

Moreover, since the recommender service’s client- Figure 4 Recommender service architecture facing interface is a REST API which is hard to test in a real-world situation and difficult to demonstrate to people, I have also developed a content publishing web application to demonstrate and test the recommender service. The demo app has also helped me model the REST API according to a real use-case and get an idea of how it might be used. As a result, the development effort has been split between developing the web application, developing the recommender system and setting up and configuring the graph database. All these subprojects have been developed in parallel and have grown together.

Working on three projects at once was very challenging. I have adopted the thin end-to-end slices agile methodology which requires implementing features that add real value to the user in one go, even if that means modifying all the components in parallel. Some examples of end-to-end slices include “recommended tags” or “similar content” which required modifying the database schema, adding a REST endpoint, develop the traversal algorithms, and modify the web client to implement such functionality. This theme was very common in implementing most of the functionality of the system.

12

However, although for implementing a feature I had to work on all components at once, adding extra functionality for a single component would usually take a few days taking into account all the unexpected. Switching from project to project after a few days and the context change associated with it was very difficult, accounting for a period of lowered productivity when starting work on another component.

6.2 SETTING UP Although the end-to-end slices covered how I implemented all the features, at the beginning of the project I spent a few months comparing, experimenting and trying out the different technologies and setting up the software stack and development environment.

6.2.1 Graph database I have started the development stage of the project by setting up Titan and Cassandra. This lasted for much longer than I expected as Titan 1.0 had just been released (September 2015) and TinkerPop had just joined the Apache Foundation and had a major refactoring and restructuring with the latest 3.0 release. This meant that resources on the newly released packages were scarce and I had a hard time understanding how they all fit together and how I might use it.

In fact, this was the reason behind my first poor design choice that proved to be less than ideal, and for which I had spent some time in the middle of the project to do a major redesign. I am referring to how I have moved away from using Gremlin directly using Scala, to using Gremlin Server and rewrite the recommender service using Go instead. In reality, this was not as bad as it sounds, due to the other design choices I have made which proved to be much better. By using TinkerPop and Gremlin, I had abstracted the database layer, and switching to another programming language was trivial. The traversal algorithms only needed to be changed from Scala to Groovy which was much easier than the other way around, and the REST API rewritten using Go, which again was simpler to implement than the Scala solution. This refactoring took me a few days, but saved me down the line many more days or even weeks.

In hindsight, this original poor design decision was due to the fact that I had not a good understanding of how Titan and Gremlin works, and I was unaware of the benefits of the Gremlin Server, knowledge that I gained in time by using the products, but could have been prevented with better research.

6.2.2 Web client The final component that I have set up was the web application. This was as well the source of many delays at the beginning of the project due to the technologies that I wanted to use. Angular is a framework for building single page applications (SPA) on the web, which are gaining popularity in recent years with advancements in browser technologies such as JavaScript and HTML5. At the start of the project, Angular 2 was under development in the Alpha stage. Angular 2 is a major redesign compared to the previous version and is not backwards compatible. Among many other features, the major change in Angular 2 was the adoption of TypeScript, which is a superset of JavaScript. TypeScript allows you to use many features from future versions of JavaScript that are not supported by browsers yet such as classes, modules and arrow functions, as well as features that will not be implemented in JavaScript such as optional static typing. It works on current and older browsers by transpiling to plain JavaScript, with the ability to select the target JavaScript version thus making the code future-proof.

13

The problem with Angular 2 was that documentation was expectedly poor, 3rd party components ecosystem was very small and I had concerns about TypeScript as I did not understand fully what are the future implications of using it. Moreover, bootstrapping Angular 2 and automating the process for compilation and all other processes that need to be executed was very challenging.

After a period of trial and error and going back and forth between the two Angular versions, I had decided to stick with Angular 2 and TypeScript, which proved to be a good decision in the end.

6.2.3 Recommender service backend As mentioned earlier, the recommender service is developed using the Go language. Go is a modern, open-source language built by Google and was designed with concurrency as a first class citizen. Moreover, it has a really powerful web server in its standard library, and is becoming a favourite language for building web services and REST APIs.

The REST API endpoints reflect the functionality of the service and is defined using Go as seen in Code snippet 1.

router, err := rest.MakeRouter( rest.Get("/info/", wrs.Info), rest.Get("/content/", wrs.Content.All), rest.Post("/content/", wrs.Content.New), rest.Delete("/content/", wrs.Content.Delete), rest.Get("/content/:cid/recommendations/similar/", wrs.Content.Recommendations.Similar), rest.Get("/content/:cid/tags/", wrs.Content.Tags), rest.Post("/content/:cid/tags/", wrs.Content.NewTag), rest.Get("/content/:cid/tags/recommended/", wrs.Content.RecommendedTags), rest.Get("/tags/", wrs.Tags.All), rest.Post("/tags/", wrs.Tags.New), rest.Get("/tags/:tid/", toBeImplemented), rest.Get("/users/", wrs.Users.All), rest.Post("/users/", wrs.Users.New), rest.Get("/users/:uid/content/consumed/", wrs.Users.Content.Consumed), rest.Post("/users/:uid/content/consumed/", wrs.Users.Content.Consume), rest.Get("/users/:uid/content/recommended/", wrs.Users.Content.Recommended), ) Code snippet 1. REST API definition

The rest.Get, rest.Post, and rest.Delete represent the HTTP methods used to manipulate the data, the string represents the path of the resource with cid, tid and uid representing the content ID, tag ID and user ID respectively, and finally followed by the function to execute for each endpoint hierarchically structured, known as a handler function.

Each handler function is called using a request and response variable. From the request parameter it can read the URL path variables or, in the case of a POST operation, the form values. It then sends the gremlin traversal steps to the Gremlin Server, along with any parameters. Once the algorithm has been executed and the result returned, the handler can either unserialize the data and process it, or just forward it to the client since the returned result is JSON as well. A simplified example that returns all the tags can be seen in Code snippet 2.

14 func (t *TagsHandler) All(w rest.ResponseWriter, r *rest.Request) { data, err := gremlin.Query(` g.V().hasLabel("tag").valueMap() `).Exec()

if len(data) == 0 { w.WriteJson([]int{}) } else { w.(http.ResponseWriter).Write(data) } } Code snippet 2. All tags handler function

The gremlin algorithm is specified as a string parameter to the “gremlin.Query” function. The starting “g” represents the graph traversal and is the starting point for all traversals. “V()” means get all vertices, with the optional parameter representing the ID of a single vertex, if we want to retrieve just one vertex by its ID. “hasLabel” is a filtering step and allows to pass through only those vertices that have the specified label. “valueMap” returns a mapping of all the properties and their values. In the case of tags, title is the only property they contain.

6.3 RECOMMENDATION ALGORITHMS Next I will go over two types of traversals that represent fundamental recommendation algorithms: content-based filtering and collaborative filtering. In content-based filtering, the tags of an item are followed to find other content with the same tags, and sort by how many tags they have in common as seen in Code snippet 3. g.V().hasLabel('content').has('contentId', contentId).as('x') .out("tagged").in("tagged") .where(neq('x')) .groupCount() .order(local).by(valueDecr) .select(keys).unfold() .limit(5) .hasLabel('content') .valueMap()

Code snippet 3. Content-based filtering Gremlin Groovy traversal algorithm

This traversal starts at the content with the specified contentId, and is labeled as “x” for further use. The out(“tagged”) traversal step instruct the traversal to follow the outgoing edges labeled “tagged” from the originating item. This will result in a list of tag edges that the content has been tagged with. To get the other content tagged with the same tags we follow the inward edges which will give us all the content nodes that have at least one tag in common. If one item has n tags in common, then it will appear n times in the resulting list. This will allow us to do a groupCount step, which is basically a mapping from the node to number of occurrences. We can then sort the list and return the top 5 items. Figure 5 illustrates how the traversal executes.

15

Figure 5. Content-based filtering traversal illustration

Collaborative filtering is slightly more complex. It uses the user’s history to find other users who have similar history and see what else they have consumed. In traversal steps this translates to following the “out” consumed edges to get what the user has consumed, then “in” consumed labelled edges to get who else has consumed the same items, then again “out” consumed edges to see what else they have consumed. This algorithm is represented using Gremlin Groovy in Code snippet 4.

g.V().hasLabel('user').has('userId', userId) .out('consumed').aggregate('c') .in('consumed').out('consumed') .hasLabel('content') .where(without('c')) .groupCount() .order(local).by(valueDecr) .select(keys).unfold() .limit(5) .valueMap() Code snippet 4. Collaborative filtering Gremlin Groovy traversal algorithm

As we can see, this algorithm is very similar to the previous one, with a few added steps. The “aggregate(‘c’)” step stores all the nodes visited so far, which in this case is the content that the user has consumed, to be used later to filter out from the resulting list the items that the user has already consumed. As we might notice from Figure 6, the number of nodes analysed grows at every step. If the users have consumed many items and some content is very popular, this algorithm can easily run into memory issues. This problem can be solved by implementing a limit at every step, a solution that has not been included in this algorithm.

16

Figure 6. Collaborative filtering traversal illustration

7 EVALUATION

In order to evaluate my recommender system, I have used multiple datasets from Yahoo Webscope [12] and MovieLens [13] which contains user movie ratings from either Yahoo Movies or MovieLens, a recommendations website specialised in films.

The datasets ranged in size from 10,000 to 1 million records of movie ratings given by users. Each line of the dataset contained a user ID, movie ID and ratings from 1-5 representing how much the user had enjoyed the film. The datasets were split into training and test data. The test data has been gathered chronologically after the training data.

In order to use these datasets with my recommender system, I have developed an evaluation tool that reads the given dataset line by line, and inputs records from the training set into the database. In order to do this as quick as possible, some optimizations were developed that made importing and looking up data in the graph database as fast as possible. Optimizations included defining a predefined schema instead of a schema less model, and adding indexes.

Moreover, it became quickly obvious that having multiple instances of the graph database to accommodate the multiple datasets as well as the development version, was very difficult. To solve this problem, I have used a containerization technology called Docker [14], that allowed me to isolate each instance of the database. Docker works similar to a virtual machine, except it is much more lightweight by actually sharing the host OS kernel.

Each dataset was imported inside a separate Docker image which was then saved. This allowed me to have a pool of ready to use graph databases with the different datasets already imported, which I could start or end whenever needed, in a few seconds.

17

Once a Docker image has been started, I can now run the evaluation tool, which will read line by line from the test dataset. For each line, the tool will request content recommendations from the recommender service given the user ID. The recommender service will use the graph database inside the given Docker instance that has imported the training set. Using these recommendations, the tool can then check if any of the content IDs match what the user has actually consumed, and record the hit or miss. This process will output a percentage value, which signifies how accurate in predicting what the user will consume, the current algorithm was.

Running the evaluation using the collaborative-filtering algorithms defined above, resulted in an accuracy of about 10%. Multiple algorithms have been tried including pre-processing the dataset to only include those records that have a positive rating (greater or equal to 3), or modifying the algorithm to support all ratings. All methods have failed to raise the accuracy measure above 10- 11%.

However, as a result of being able to run evaluation tests and modify algorithms quickly, I have discovered a number of optimizations and improvements to my existing algorithms. These optimizations have resulted in significant reduction in graph traversal execution time, and therefore the web service can now handle more requests per second than before.

8 FUTURE WORK

As a result of the features already implemented, this project is currently best described as a recommender web service. However, in the future I want to extend its functionality with features that do not necessarily fall into this category, but are meant to support a wider range of requirements that help content publishers with their applications.

8.1 TAGS ONTOLOGIES Currently, tags are very simple and support only one level of organization. Content is directly tagged using textual labels that have no relationships between them and little can be inferred from the content to tag relationships. In the future, I plan on implementing functionality to allow content publishers create tag ontologies. As with all features of this service, this feature has to be implemented to support a wide range of content publishing domains and requirements. In order to achieve this, I will base my design on the Simple Knowledge Organization System (SKOS) [15], which defines standards and specifications to support building thesauri, classification schemes and taxonomies which is exactly what I am trying to achieve.

Translated to my service and graph database schema, supporting this functionality involves creating relationships between the tags, which the content publishers will be able to use to create taxonomies or classifications for their content. SKOS defines vocabularies such as “broader”, “narrower”, “related” etc., which I can use as labels for the relationships between the tags to model an abstracted view of the publisher’s organization system.

This is an important feature as it will allow content publishers to not only define their content organization using my service, but also by having this model stored in my graph database, I can design much more complex traversal algorithms that will take into account the relationships between the tags. For example, I can take into account related or broader tags when finding similar content or recommending tags.

18

8.2 TRENDING CONTENT Another benefit from storing user behaviour is that I can use that data to identify trends in content usage. Whenever a user consumes an item, the service stores the timestamp of that action. Using these timestamps as time series, I can find patterns and identify what content is becoming increasingly popular known as trending. There are two approaches for doing this: one is to identify the trend while it is happening, and another is to predict the trend just before it happens.

The first approach requires statistical analysis of the time series data such as linear trend estimation or regression analysis. In rough terms, this implies fitting a line through the time series data, and by calculating its slope we will get the trend. Implementing this method will require some statistical and mathematical libraries such as Math [16].

The second method is slightly more complex, and requires machine learning or other prediction techniques. Although in this case I am predicting trends in content usage, these methods have a much broader use-case such as predicting stock prices or road traffic levels. As a result, prediction algorithms are implemented in most machine learning libraries or cloud services such as Amazon Machine Learning [7]. A more focused approach is to use manually identified patterns that precede trends and use those directly in deciding whether a time series is going to be a trend. Such a method has been developed by Stanislav Nikolov and described in detail in the article “Early detection of Twitter trends explained” [17].

This feature breaks the mould in that it is not implemented using graph traversals. Developing this feature will require pulling the data from the graph database and process it separately.

8.3 SUBSCRIPTIONS Another feature that is very popular among content publishers and social media is subscriptions. A user wants to subscribe to a topic, a category or to another user to receive the latest content that has been published under these sources. My service can help content publishers to add subscriptions to their applications by extending the graph schema to include subscription relationships between users and other users, tags, or even content. Traversing these edges to find the most recent content is straightforward and will make use of the created date timestamps to filter the most recent items.

Due to its lowered implementation complexity and great added value for content publishers, this feature is likely to be implemented in the near future.

9 REFLECTION AND CONCLUSION

This project set out to make it easier for application developers to use a recommender system. Throughout the project I have had difficulties in defining what making it easier means and implies, while trying to make the system as general as possible. Abstracting a recommender system was not easy however, using a graph model helped defining the entities involved and the interactions between these, that together made up the recommender system. Exposing this functionality for developers to use was made easier by using a REST API, which was the final component in trying to abstract the recommender system.

19

Developing a demo application was a good decision that allowed me to model my web service based on a real world use case. It has helped me identify required functionality and get an idea of how it should be exposed for content publishers to use.

Throughout the project, I have identified numerous limitations of the design and specifically for using a graph database as the core of the recommender system. By traversing a graph, only simple recommendations algorithms can be implemented, and the accuracy of the results is rather low. However, I do consider that using a graph database to model the data and store it was the right choice, and in the future I can use other methods for getting more accurate recommendations such as exporting the data from the graph and using it as the input for a more conventional machine learning approach.

In conclusion, I believe that this project has been a success in both delivering a recommender web service and also as a source of countless learning experiences for myself.

10 REFERENCES

[1] “Recommender systems,” [Online]. Available: http://recommender-systems.org/.

[2] “recommender systems - Google Scholar,” [Online]. Available: https://scholar.google.co.uk/scholar?hl=en&q=recommender+systems&btnG=&as_sdt=1%2C5 &as_sdtp=.

[3] “Netflix Prize,” [Online]. Available: http://www.netflixprize.com.

[4] “Microservices Architecture pattern,” [Online]. Available: http://microservices.io/patterns/microservices.html.

[5] “What Is REST?,” [Online]. Available: http://www.restapitutorial.com/lessons/whatisrest.html.

[6] “Apache Mahout: Scalable machine learning and data mining,” [Online]. Available: http://mahout.apache.org/.

[7] “Amazon Machine Learning - Predictive Analytics with AWS,” [Online]. Available: https://aws.amazon.com/machine-learning/.

[8] “Google Cloud Machine Learning at Scale,” [Online]. Available: https://cloud.google.com/products/machine-learning/.

[9] “Apache TinkerPop,” [Online]. Available: http://tinkerpop.apache.org/.

[10 “Titan: Distributed Graph Database,” [Online]. Available: http://thinkaurelius.github.io/titan/. ]

[11 “Benchmarking Cassandra Scalability on AWS - Over a million writes per second,” [Online]. ] Available: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html.

20

[12 “Yahoo Webscope,” [Online]. Available: https://webscope.sandbox.yahoo.com/. ]

[13 “MovieLens,” [Online]. Available: http://grouplens.org/datasets/movielens/. ]

[14 “Docker - Build, Ship, and Run Any App, Anywhere,” [Online]. Available: ] https://www.docker.com/.

[15 “SKOS Simple Knowledge Organization System,” [Online]. Available: ] https://www.w3.org/2004/02/skos/.

[16 “Commons Math: The Apache Commons Mathematics Library,” [Online]. Available: ] https://commons.apache.org/proper/commons-math/.

[17 “Early detection of Twitter trends explained,” [Online]. Available: ] https://snikolov.wordpress.com/2012/11/14/early-detection-of-twitter-trends/.

21