UNIVERSITY of CALIFORNIA, SAN DIEGO Bag-Of-Concepts As a Movie

UNIVERSITY OF CALIFORNIA, SAN DIEGO Bag-of-Concepts as a Movie Genome and Representation A Thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science by Colin Zhou Committee in charge: Professor Shlomo Dubnov, Chair Professor Sanjoy Dasgupta Professor Lawrence Saul 2016 The Thesis of Colin Zhou is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Chair University of California, San Diego 2016 iii DEDICATION For Mom and Dad iv TABLE OF CONTENTS Signature Page........................................................................................................... iii Dedication.................................................................................................................. iv Table of Contents....................................................................................................... v List of Figures and Tables.......................................................................................... vi Abstract of the Thesis................................................................................................ vii Introduction................................................................................................................ 1 Chapter 1 Background............................................................................................... 3 1.1 Vector Space Models............................................................................... 3 1.1.1. Bag-of-Words......................................................................... 4 1.1.2. TF-IDF................................................................................... 4 1.1.3. Latent Semantic Indexing...................................................... 5 1.2 Neural Net Models................................................................................... 5 1.2.1. Neural Net Architectures....................................................... 7 1.3 Bag-of-Concepts Model........................................................................... 9 1.4 Recommender Systems............................................................................ 10 1.4.1. Collaborative Recommenders................................................ 10 1.4.2. Content-based Recommenders............................................... 11 1.4.3. Recommendation Through “Genes”...................................... 11 Chapter 2 Concept Navigation and Tuning............................................................... 14 2.1 Methodology............................................................................................ 14 2.2 Concepts................................................................................................... 15 2.3 Recommendation Tuning......................................................................... 17 2.4 Discussion................................................................................................ 19 Chapter 3 Genre Classification.................................................................................. 22 3.1 Methodology............................................................................................ 22 3.2 Results...................................................................................................... 23 3.3 Discussion................................................................................................ 24 Chapter 4 Future Work.............................................................................................. 26 Appendix.................................................................................................................... 28 References.................................................................................................................. 30 v LIST OF FIGURES AND TABLES Figure 1: Ten BOC Vectors for the Nearest Neighbors of Cold Thoughts (2013)................................................................................................. 16 Table 1: Some words from selected word clusters and a human “best guess” for the cluster topic................................................................. 15 Table 2: Precision, recall, and F-score averages over all genres...................... 23 Table 3: Precision, Recall, and F-1 Scores broken down by genre.................. 28 vi ABSTRACT OF THE THESIS Bag-of-Concepts as a Movie Genome and Representation by Colin Zhou Master of Science in Computer Science University of California, San Diego, 2016 Professor Shlomo Dubnov, Chair As online retailers, media providers, and others have learned, the ability to provide quality recommendations for users is a strong profit multiplier: it drives sales, maintains retention, and provides value for users. Video streaming, retail, and video databases particularly have had extensive effort and research devoted to finding effective ways to make representations. One of these is MovieLens.org, a video recommender website that uses a representation called the tag genome to understand movies and make tuned recommendations. The tag genome representation is built from user input on the website, where users can apply tags to movies and rate them in order to provide the vii information needed to make recommendations. In this work, we draw from research in information retrieval to implement a bag- of-concepts representation for movies and make tuned recommendations in the same manner as MovieLens. Our implementation is fully unsupervised and does not require the user data needed in the implementation of MovieLens while still having similar properties to the tag genome that enable interesting tuned recommendations. viii Introduction Online movie/video information databases and video streaming services often come with a recommendation feature that finds new videos that users may enjoy. These recommendations can keep users engaged and improve their experience with the site or service, which makes these recommendation systems an important feature for designers to focus their efforts on. The Netflix Prize challenge [20] shows the interest in and importance placed on recommender systems by researchers and companies. The competition's goal was to find a recommender system that could beat the accuracy of Netflix's own system by ten percent. Netflix offered a prize of one million dollars and gathered significant media attention, and more than forty thousand teams submitted their own recommender systems. Many recommender systems for movies that are discussed in literature require both detailed information about the films themselves as well as a body of user data and feedback in order to make their recommendations. This work is an attempt at implementing and evaluating a recommender system that has much lower data requirements: we use only text descriptions of each movie's plot and no user data in order to build representations for our movies and make recommendations. The data set used in this paper comes from IMDB.com, accessed through its public FTP interface [3]. Plot descriptions are contributed by its userbase, and contain a large amount of variation. 1 2 Users differ in their writing style, tone, descriptiveness, quality, and more when writing plot summaries, which makes the dataset a challenging one to work with. In Chapter 1 we review relevant literature from the areas of document representation and recommender systems. Models of representation discussed are the vector space model and neural net derived distributed representations. The two approaches of collaborative and content-based recommenders are explained, and relevant previous work is reviewed. In Chapter 2 we describe the methods of manipulating bag- of-concept vectors with meaningful semantics in their composition as an analogue of tag genome navigation pioneered by the GroupLens research lab [27]. Chapter 3 evaluates the performance of bag-of-concepts in comparison to other representation methods in the task of genre classification. The paper concludes with a discussion of possible future work in chapter 4. Chapter 1 Background Document representation has always been a important area of work for fields such as information retrieval and data mining. Once computer systems were put to use in storing and retrieving large bodies of information, researchers began looking for ways to automatically search for documents in a large collection. One influential idea that emerged in this early era, published by H.P. Luhn in 1957 [4], was that the encoding of a document could be made automatic, and that it could capture the “concepts” in a document without needing to understand it in the same way that a human cataloger would. Luhn proposed that the conceptual makeup of documents could be indicated by the words in the document, and that documents could be retrieved based on the similarity of their word distributions. 1.1 Vector Space Models One key development of the 1960's was the SMART information retrieval system, developed by a team led by Gerard Salton[5,6]. The SMART system introduced the vector space model [7, 8] for representing documents, which has become a very widespread model for representing documents and objects in general. In the vector space model, documents are represented as vectors with each dimension corresponding to a 3 4 different term. In most cases a term would be a single word, but in different applications a term can be a phrase or other unit. If a term appears in the document, then its corresponding dimension in the representation has a non-zero value [41].

UNIVERSITY of CALIFORNIA, SAN DIEGO Bag-Of-Concepts As a Movie

The Netflix Prize Contest

Lessons Learned from the Netflix Contest Arthur Dunbar

Movie Genome: Alleviating New Item Cold Start in Movie Recommendation

Netflix and the Development of the Internet Television Network

Document and Topic Models: Plsa

Introduction

Redalyc.Latent Dirichlet Allocation Complement in the Vector Space Model for Multi-Label Text Classification

Evaluating Vector-Space Models of Word Representation, Or, the Unreasonable Effectiveness of Counting Words Near Other Words

Gensim Is Robust in Nature and Has Been in Use in Various Systems by Various People As Well As Organisations for Over 4 Years

Deconstructing Word Embedding Models

Vector Space Models for Words and Documents Statistical Natural Language Processing ELEC-E5550 Jan 30, 2019 Tiina Lindh-Knuutila D.Sc

ŠKODA AUTO VYSOKÁ ŠKOLA O.P.S. Valuation of Netflix, Inc. Diploma