Quick viewing(Text Mode)

Universidad Politécnica De Madrid Movie Recommender Based On

Universidad Politécnica De Madrid Movie Recommender Based On

Universidad Politecnica´ de Madrid

Movie recommender based on visual content analysis using deep learning techniques

MASTER´ UNIVERSITARIO EN INGENIER´IA DE TELECOMUNICACION´ TRABAJO FIN DE MASTER´

Luc´ıaCasta˜neda Gonz´alez

2019

MASTER´ UNIVERSITARIO EN INGENIER´IA DE TELECOMUNICACION´

TRABAJO DE FIN DE Master

T´ıtulo: Movie recommender based on visual content analysis using deep learning techniques.

Autor: Luc´ıaCasta˜nedaGonz´alez

Tutor: Alberto Belmonte Hern´andez

Ponente: Federico Alvarez´ Garc´ıa

Departamento: Se˜nales Sistemas y Radiocomunicaciones (SSR)

MIEMBROS DEL TRIBUNAL

Presidente:

Vocal:

Secretario:

Suplente:

Los miembros del tribunal arriba nombrados acuerdan otorgar la calificaci´onde:

......

Madrid, a de de 2019

Universidad Politecnica´ de Madrid

Movie recommender based on visual content analysis using deep learning techniques

MASTER´ UNIVERSITARIO EN INGENIER´IA DE TELECOMUNICACION´ TRABAJO FIN DE MASTER´

Luc´ıaCasta˜neda Gonz´alez

2019

Summary

Nowadays there is a growing interest in the artificial intelligence sector and its varied appli- cations allowing solve problems that for humans are very intuitive and nearly automatic, but for machines are very complicated. One of these problems is the automatic recommendation of multimedia content.

In this context, the work proposed try to exploit Computer Vision and Deep Learning tech- niques for content analysis in video. Based on intermediate extracted information a recom- mendation engine will be developed allowing the inclusion of learning algorithms using as base data trailers of the films.

This project is divided into two main parts. After getting the dataset of movie trailers, the first part of the project consists of the extraction of characteristics from different trailers. For this purpose, computer vision techniques and deep learning architectures will be used. The set of algorithms goes from computer vision tasks as the analysis of color histograms and optical flow to complex analysis of actions or object detectors based on Deep Learning algorithms.

The second part of the project is the recommender engine. For the recommender, different machine learning and Deep learning methods will be put into practice in order to learn efficiently about correlations between data. This recommender will be trained using neural networks over the proposed selected dataset.

Three different options will be made with three different architectures for the recommender engine. The first will be a simple sequential neural network, the second an autoencoder and the third a double autoencoder. To compare the results of the three options, objective metrics (MSE, MAE, precision) and subjective metrics (polls) will be used.

The final output of the project is provide from one input trailer, the ten best matches only based on the content analysis and the trained recommender. Resumen

Hoy en d´ıa,existe un inter´escreciente en el sector de la inteligencia artificial y sus variadas aplicaciones que permiten resolver problemas que para los humanos son muy intuitivos y casi autom´aticos,pero para las m´aquinasson muy complicados. Uno de estos problemas es la recomendaci´onautom´aticade contenido multimedia.

En este contexto, el trabajo propuesto trata de explotar las t´ecnicasde visi´onartificial y Deep learning para el an´alisisde contenido en v´ıdeo. Bas´andoseen la informaci´onextra´ıda, se desarrollar´aun motor de recomendaci´onque permite la inclusi´onde algoritmos de aprendizaje que utilizan como base de datos tr´aileresde pel´ıculas.

Este proyecto se divide en dos partes principales. Tras obtener el conjunto de datos de tr´aileres de pel´ıculas, la primera parte del proyecto consiste en la extracci´onde caracter´ısticas de dichos tr´aileres.Para este prop´osito,se utilizar´ant´ecnicasde visi´onartificial y arquitecturas de aprendizaje profundo. El conjunto de algoritmos va desde tareas de procesamiento de im´agenes,como el an´alisisde histogramas de color y flujo ´optico,hasta an´alisiscomplejos de acciones o detectores de objetos basados en algoritmos de Deep learning.

La segunda parte del proyecto es la m´aquinade recomendaci´on. Para el recomendador, se pondr´anen pr´acticadiferentes m´etodos de aprendizaje autom´aticoy aprendizaje profundo para aprender de manera eficiente sobre las correlaciones entre los datos. Este recomendador se capacitar´autilizando redes neuronales sobre el conjunto de datos seleccionado propuesto.

Se realizar´antres opciones diferentes con tres arquitecturas distintas para el motor de re- comendaci´on.La primera ser´auna simple red neuronal secuencial, el segundo un autoencoder y el tercero un doble autoencoder. Para comparar los resultados de las tres opciones, se utilizar´anm´etricasobjetivas (MSE, MAE y precisi´on)y m´etricassubjetivas (encuestas).

El resultado final del proyecto proporciona, a partir de un tr´ailerde entrada, las diez mejores coincidencias solo en funci´ondel an´alisis de contenido y el recomendador capacitado.

Keywords

Machine learning, deep learning, recommender, neuronal network, autoencoder, image pro- cessing, computer vision, Python, Tensorflow, Keras, Pytorch.

Palabras clave

‘Machine-Learning’, aprendizaje profundo, recomendador, red neuronal, autoencoder, proce- samiento de im´agenes,visi´onartificial, Python, Tensorflow, Keras, Pytorch.

Gracias a mi familia, por el apoyo incondicional a una hija que, cuando les contaba sobre su TFM, parec´ıahablar en klingon. Y a mi tutor por su ayuda inagotable y por contagiarme su entusiasmo.

Index

1 Introduction and objectives 1

1.1 Introduction ...... 1

1.2 Objectives ...... 2

2 State of the art 3

2.1 Recommendation systems ...... 3

2.1.1 Deep learning basics ...... 6

2.1.2 Deep learning and recommendation systems ...... 8

2.2 Deep learning and visual content based recommendation systems ...... 9

2.2.1 Computer vision ...... 10

2.2.2 Action recognition ...... 13

2.2.3 Object detector ...... 16

3 Development 21

3.1 Machine Learning and Deep Learning process chain ...... 21 3.2 Proposed architecture ...... 27

3.3 Feature extraction ...... 28

3.3.1 Dataset ...... 29

3.3.2 Features ...... 32

3.4 Embedding ...... 49

3.5 Distances ...... 51

3.6 Deep Learning Recommender System Architectures ...... 53

3.6.1 Deep Neural Network ...... 54

3.6.2 Autoencoder ...... 57

3.6.3 Double autoencoder ...... 60

4 Results 66

4.1 Feature extraction ...... 66

4.1.1 Action recognition ...... 67

4.1.2 RGB Histogram Feature ...... 75

4.1.3 Object detector ...... 86

4.1.4 Optical flow ...... 95

4.1.5 Joined Feature ...... 102

4.2 Embedding ...... 103

4.2.1 Embedding training ...... 103

4.2.2 Embedding prediction ...... 105 4.2.3 Comparison between using or not embedding ...... 106

4.3 Distances ...... 109

4.3.1 Euclidean distances ...... 110

4.3.2 Cosine distances ...... 111

4.4 Recommender Objective Evaluation Metrics ...... 113

4.5 Deep Neural Network Recommender ...... 114

4.5.1 Neuronal Network training ...... 114

4.5.2 Deep Neural Network prediction ...... 116

4.6 Autoencoder ...... 117

4.6.1 Autoencoder training ...... 118

4.6.2 Autoencoder prediction ...... 119

4.7 Double autoencoder ...... 120

4.7.1 Double autoencoder training ...... 120

4.7.2 Double autoencoder prediction ...... 123

4.8 Subjective comparison between solutions ...... 124

4.8.1 Surveys ...... 124

5 Conclusions and future lines 129

5.1 Conclusions ...... 129

5.2 Future lines ...... 131

References 132 Appendices 138

A Ethical, social, economic and environmental aspects 139

A.1 Introduction ...... 139

A.2 Description of relevant impacts related to the project ...... 139

A.2.1 Ethic impact ...... 139

A.2.2 Social impact ...... 140

A.2.3 Economic impact ...... 141

A.2.4 Environmental impact ...... 141

A.3 Conclusions ...... 141

B Economic budget 142

C Survey results 144

C.1 Euclidean distance recommendations ...... 144

C.2 Artificial Neural Network recommendations ...... 147

C.3 Autoencoder recommendations ...... 149

C.4 Double Autoencoder recommendations ...... 152

D Survey template 155

E Detectable classes by object detector 157

F Detectable classes by the action recogniser 164 Index of figures

2.1 Youtube Machine ...... 5

2.2 LRNC architecture [1] ...... 14

2.3 3D CNN example from [2] ...... 14

2.4 Faster R-CNN architecture ...... 17

2.5 YOLO working scheme [3] ...... 18

2.6 SSD working scheme [4] ...... 18

2.7 RetinaNet working scheme [5] ...... 20

2.8 Mask R-CNN working scheme [6] ...... 20

3.1 Proposed architecture ...... 22

3.2 Gradient descent function ...... 23

3.3 Classification overfitting ...... 24

3.4 Classification underfitting ...... 25

3.5 Classification compromise between underfitting and overfitting ...... 26

3.6 Proposed architecture ...... 27 3.7 Multi-genres distribution ...... 33

3.8 Action recognition Training ...... 36

3.9 ResNet50 ...... 37

3.10 Action recognition prediction ...... 38

3.11 Histogram process chain ...... 40

3.12 Action film colour histogram ...... 41

3.13 Action film colour histogram ...... 41

3.14 Object detector training ...... 42

3.15 YOLO architecture [3] ...... 44

3.16 Object detection architectures comparison ...... 45

3.17 Object prediction ...... 46

3.18 Object prediction example ...... 47

3.19 Optical flow extraction process ...... 48

3.20 Embedding training ...... 50

3.21 Embedding prediction ...... 51

3.22 Artificial neuronal network ...... 54

3.23 Autoencoder ...... 58

3.24 Double autoencoder ...... 62

4.1 Example outside image ...... 75

4.2 Example inside image ...... 75 4.3 Outside example results...... 76

4.4 Inside example results...... 77

4.5 Example day image ...... 78

4.6 Example night image ...... 78

4.7 Day example results...... 79

4.8 Night example results...... 79

4.9 Example mountain image ...... 80

4.10 Example sea image ...... 80

4.11 Mountain example results...... 80

4.12 Sea example results...... 81

4.13 ”Batman & Robin” histogram results...... 82

4.14 ”Someone Marry Barry” histogram results...... 82

4.15 ”17 again” histogram results...... 83

4.16 ”Night at the Museum” histogram results...... 83

4.17 ”A resurrection” histogram results...... 84

4.18 ”Say It Is not So” histogram results...... 85

4.19 ”It’s Complicated” histogram results...... 85

4.20 Animals detection ...... 87

4.21 Vehicle detection ...... 88

4.22 Sport equipment detection ...... 89

4.23 Weapon detection ...... 89 4.24 Not all objects detected ...... 90

4.25 Wrong detection ...... 90

4.26 Dark place person ...... 91

4.27 Cartoon person ...... 91

4.28 Blurry image person ...... 91

4.29 Semi-transparent person ...... 91

4.30 Person detection ...... 91

4.31 Not all objects detected ...... 91

4.32 Human face detection ...... 91

4.33 Burning car ...... 92

4.34 Boat ...... 92

4.35 Sci-Fi ship ...... 92

4.36 ”Harry Potter” clothes detection ...... 93

4.37 ”Star Wars” clothes detection ...... 93

4.38 Object detections in cartoons ...... 93

4.39 Dancing optical flow representation ...... 96

4.40 Dancing optical flow HSV representation ...... 96

4.41 Talking optical flow ...... 97

4.42 Talking optical flow HSV representation ...... 97

4.43 Fighting optical flow representation ...... 98

4.44 Fighting optical flow HSV representation ...... 99 4.45 Join feature with PCA scatter ...... 102

4.46 Join feature with two dimensional TSNE ...... 103

4.47 Embedding loss ...... 104

4.48 Embedding accuracy ...... 104

4.49 Embedding feature representations ...... 105

4.50 Without embedding ...... 106

4.51 Embedded ...... 106

4.52 PCA action representation with 1 (red), 2 (yellow) and 3 (green) for action genre106

4.53 Without embedding ...... 107

4.54 Embedded ...... 107

4.55 TSNE action representation with 1 (red), 2 (yellow) and 3 (green) ...... 107

4.56 Without embedding ...... 108

4.57 Embedded ...... 108

4.58 PCA three genres representation action (red), science-fiction (yellow) and horror (green) ...... 108

4.59 Without embedding ...... 109

4.60 Embedded ...... 109

4.61 TSNE three genres representation adventure (red), crime (yellow) and thriller (green) ...... 109

4.62 Evolution along epochs ...... 114

4.63 Evolution from epoch 100000 ...... 114

4.64 Neuronal Network RMSE ...... 114 4.65 Evolution along epochs ...... 118

4.66 100000-300000 epochs ...... 118

4.67 Autoencoder RMSE ...... 118

4.68 Evolution along epochs ...... 121

4.69 100000-300000 epochs ...... 121

4.70 Doble autoencoder, first autoencoder RMSE ...... 121

4.71 Evolution along epochs ...... 121

4.72 100000-300000 epochs ...... 121

4.73 Doble autoencoder, second autoencoder RMSE ...... 121

B.1 TFM budget ...... 142

D.1 Survey Template ...... 156

Glossary

ML – Machine Learning

DL – Depp Learning

CV – Computer Vision

RGB - Red, Green, Blue

HSV - Hue, Saturation, Value

NMS – Non-Maximum Suppression

NN – Neural Network

ANN – Artificial Neural Network

CNN – Convolutional Neural Network

RPN - Region Proposal Network

LR - Learning Rate

SGD - Stochastic Gradient Descent

ReLU - Rectified Linear Unit

ResNet - Residual Neural Network

Faster R-CNN – Faster Region Based Convolutional Neural Network

YOLO – You Only Look Once

KNN - K-Nearest-Neighbours

GMM - Gaussian Mixture Models

PCA - Principal Component Analysis

TSNE - t-Distributed Stochastic Neighbour Embedding

1

Chapter 1

Introduction and objectives

1.1 Introduction

Nowadays, multimedia content recommenders are in high demand. It is being observed that there are many advantages in using recommenders in multimedia services on demand. It directly influences the users evaluation of the service and, therefore, of its permanence or their purchases in the service.

The most innovative recommenders are based on deep learning. Deep learning is also a booming technology in recent years. Deep learning is a key technology in the future of Artificial Intelligence and Big Data.

This project makes use of these trendy technologies to create a film recommender. Uniting it with image processing, computer vision and machine/depp learning techniques. This work de- scribed the development and results of a movie recommender based on visual content analysis using deep learning techniques.

The work is divided into three large blocks. The extraction of content, an embedding and the artificial nets. From each block comes a solution that can be applied in different areas.

From the extraction of content, information is generated from a movie. This information can be used for multiple purposes. In this case it is used to recommend, but it could be used to 2 classify the content or to find peculiarities of the films. What feature extraction does is create a database with visual information about the movies.

The embedding block trains a model that basically allows two things. The first is the one used in this project and it is to relocate the content extracted in another subspace that facilitates its subsequent training to recommend. But it has another possible use and that is as a classifier of film genres.

And finally there is the block of artificial networks that generate a recommendation. In this block, three different deep learning architectures have been tested to find the best way to recommend a movie from a dataset. Each network allows to train for any movie dataset, first going through the previous blocks. In addition, it not only generates a recommendation but also indicates how recommended is each film in the dataset with respect to the film for which the recommendation is sought.

1.2 Objectives

The main objectives of this work are presented in the following list.

• Learn the use of deep learning techniques and programming tools. • The creation of a method for extracting visual content based from a movie set. • Use the concept of embedding to project the data in a different subspace. • Generation of a movie recommender using different deep learning architectures with the same purposes.

• Evaluation with objective and subjective metrics.

The work includes different types of techniques starting from computer vision tools to finish with state of the art techniques in deep learning to extract hidden knowledge from the films. Different deep learning architectures have been tested increasing the complexity.

The results include both, general metrics to measure the performance of the final trained algorithms and subjective tests carry out to get the feeling with the proposed recommendations with real persons. 3

Chapter 2

State of the art

2.1 Recommendation systems

Nowadays, due to several reasons, among which include the increase in broadband Internet access and the proliferation of smart-phones, multimedia content is increasing rapidly. This has meant that both, traffic and multimedia consumption in the network, have grown expo- nentially in recent years. This rise is the key to the success of multimedia platforms such as YouTube, Netflix or Spotify. However, this rapid growth of multimedia information in our daily lives has created an information overload and a greater complexity in decision making. Therefore, due to the large amount of multimedia content that exists, it is very important to filter it, having two main objectives.

The first objective is to provide the user with a specialised service that allows him to easily access the contents that interest him and allows a better user experience. And the second is triggered by the first, because offering the user content according to their interests leads to greater consumption by them and, therefore, increases the benefits for the company.

A recommender system is a technology that filter a content in order to improve access and proactively recommends relevant items to users by considering the content information and/or the users’ preferences and behaviours.

In order to work with machine learning to implement a recommender it is necessary to use 4 technology based on algorithms.

The algorithms used for recommendation are usually divided in two categories. The content- based methods and the collaborative filtering methods or a combination of both. The content- based methods do not involve other users. It only need the user likes to find a recommendation. It is based on analysing the content characteristics using different techniques like NLP, com- puter vision or audio processing. Once the content has been analysed, the recommender will make recommendations for multimedia material that have content like those that the user has indicated they like.

The collaborative filtering bases its recommendations on users’ past behaviours. The collab- orative filtering is based on the idea that similar users will have similar interests.

Nowadays there are several companies working exclusively in recommenders with machine learning. Such as Think Analytics [7] , Gravity R&D [8] and Recombee [9]. And there are also a lot of company that are dedicated to the broadcasting of multimedia content that are also improving their recommendations with machine learning algorithms, such as Netflix [10] or Spotify [11].

This technique of recommendation has begun to be in high demand in recent years since it has been seen to be a powerful tool to satisfy the user expectations.

The best algorithms and methods have been implemented by privates’ companies, so the codes have not been released. But there are some datasets and public code that are also interesting.

About open code, it can be found that for each of the dataset already mention there is a lot of open code already developed. In the LMTD [12] dataset GitHub there are two notebooks with examples of how to use the dataset. Apart from this dataset examples, it can be also found a lot of open source codes in GitHub and in Kaggle. But we could find that while for recommendation of music you can find many projects, for movies recommendation is not so easy to find complete projects in this regard.

Although there is not abundant open source code, there are tow examples of the scheme that important video broadcasting companies follow. The two examples shown before are the Youtube and the Netflix mechanism:

The YouTube system [13] consist in two neural networks: one for candidate generation and one for ranking. 5

The first network, candidate generation, takes events from the user’s YouTube activity history as input and transform the whole set of videos that make up the corpus in only a small set of candidate videos. This first neural network provides a personalised recommendation per user through collaborative filtering.

The ranking network is responsible for scoring each video based on an objective function established considering a series of parameters, this score allows the user to be persuaded those videos considered as the best recommendations.

Figure 2.1: Youtube Machine

For the Netflix recommendation system [10] the scheme starts from the first moment, when a user creates a Netflix account, or add a new profile in its account, they ask the user to choose some titles he likes. They use these titles to start their recommendations and connect with the user’s preferences. If the user skips this step, the first recommendations that will be provided will be content that is popular and relevant among most Netflix users and later will have more personalised content.

In the second step, the Netflix recommendation system takes care of observing the user’s in- teractions with the service, other members with similar tastes and preferences in their service (Collaborative Filtering), and information about the titles, such as gender, actors, etc. In addition to knowing what you have watched on Netflix, to best personalise the recommen- dations they also look at things like the time of day you watch, the time you watch or the devices that Netflix is watching. 6

The system chooses which titles to include in the rows of its homepage, in addition it also classifies each title within the row and then classifies the rows themselves (Neural networks).

2.1.1 Deep learning basics

In these days, deep learning solutions are presenting the state of the art solutions in a wide range of fields. Artificial neural networks (ANN) were the start of this world breaking the way of learning of machine learning techniques introducing non linear layers that breaks the linearity between data.

Several types of deep learning networks have been appearing and breaking the results obtained with traditional techniques as feature extraction and selection, computer vision, or time series analysis. Artificial neural networks are able to learn complex behaviours from feature vectors in different way that machine learning techniques do this process. Convolutional Neural Networks (CNN) are one of the main important advantages in computer vision due to the ability to extract automatically feature vectors learned during training avoiding the manual selection of them. Finally, Recurrent Neural Networks (RNN) can work with data in time learning complex patterns to predict future behaviours in time.

2.1.1.1 Deep learning advantages and disadvantages

Below are presented the more significant advantages and disadvantages of the use of deep learning algorithms. Due to the intrinsic complexity of this type of algorithms some tasks can be preformed efficiently but others suffer from computational requirements. Deep learning is only a new tool but traditional algorithms are able to solve some concrete solutions efficiently without the inclusion of this complexity in the system. Some advantages and disadvantages are enumerate in the next paragraph regarding deep learning techniques in real world applications in contrast to traditional techniques (computer vision, well known algorithms to perform concrete operations among others).

Deep learning advantages:

• Non-linearity: Unlike classical models, which are basically linear models, deep learning models are non-linear. Using non-linear activations (relu, sigmoid, tanh, etc.) deep 7

neural networks are able to model non-linearity in data. With this property the deep learning algorithms can find complex and intricate interaction patterns of the data.

• Representation learning: This advantage is due to the fact that deep neural networks are effective in learning helpful representations of data. In the case of recommendations, it can be easily verified that there is a large amount of data available with information regarding the relationship between items and users. Making use of these data can expand the knowledge we have regarding the items and users, which improves the recommender. For this reason, using deep neural networks to learn representation shows that it is a good choice to improve recommendations. Using representation learning present to main advantages:

– The difficulty in hand-craft feature design decreases. Deep neural networks make possible that feature engineering can be treated as an automatic activity, with supervised or unsupervised approaches. – Using representation learning makes possible to include different content such as text, audio, images or video. It has been proved that deep learning improves with representations learning from different sources.

• Sequential modelling: Sequential models can deal temporal dynamics of users behaviours with good results. And both, RNN and CCN, are deep learning techniques that can be applied in sequential modelling.

• Flexibility: In general, not only in the recommendations field, deep learning techniques have high flexibility. In especial when working with the most popular frameworks such as Tensorflow, Keras, Pytorch, Theano, etc. This frameworks works in a modular way and they also have a good support by a very active community and professionals. The modular way they have to work provides them efficiency when developing. One example of this is the facility they have in combining different neural networks in hybrid models, or when replacing one modules. These easiness makes less complex the task of capturing different characteristics and factors simultaneously.

Deep learning disadvantages:

• Interpretability: One of the main problems of deep learning is that it acts as a black box. Not providing explanations of the predictions is a complicated disadvantage. This makes the hidden weights and activations non-interpretable. Nevertheless, nowadays 8

models are starting to be able of some interpretability what makes possible explainable recommendations.

• Hyperparameter tuning: The disadvantage arises because in order to have good results the correct choice of hyperparameters is essential. But this choice is very complex since there is no correct way to calculate them, there are many hyperparameters and a very large tuning range. But this problem is not only in deep learning, it already appears in machine learning, although it is true that normally in deep learning more hyperparameters are added. Many researches has been done to find the correct way to calculate which is the ideal selection of values for the hyperparameters, but an optimal solution has not yet been found. Other investigations pursue to achieve to be able to work with a single hyperparameter instead of several, to facilitate the hyperparameter tuning.

• Data need: This last disadvantage is associated with the fact that deep learning in general, not only for recommendations, is data hungry. That is, you need data sets large enough to work correctly. But on the other hand, in the field of recommendations there are many data so this problem would be a less concern.

2.1.2 Deep learning and recommendation systems

At this moment deep learning (DL) enjoys great popularity. In the last few decades have increase considerable in success in many domains, like speech recognition or computer vision. And both academia and industry are currently in constant search to improve deep learning techniques. Investigating how to apply it to a wider range of applications, in which this discipline can help thanks to its ability to solve complex tasks.

Recommendation architectures, in the recent times, have drastically change since deep learn- ing have been applied in them. Deep learning provides more opportunities to enhance recom- mendation efficiency. The interest for the last advances in recommendation systems based on deep learning has increased considerably because it has overcome obstacles that the conven- tional models were not able to solve, obtaining recommendations of great quality.

For the industry, a recommendation system is very important to improve the user experience, what promotes sales. Some interesting examples are the recommendation of Netflix and YouTube. In the case of Netflix, 80 percent of the movies that users watch are thanks to the recommendations. For YouTube, 60 percent of the videos that are clicked come from 9 recommendations.

In YouTube case, the paper [13] explain how have been used recommendation algorithm based on deep neural network for video recommendation. In the paper [14] can be seen the Google Play recommender system that use wide and deep model. And the last example is the Yahoo News recommender, it uses a recommender system based on RNN as the paper [15] explains. All these model examples have shown the important improvement over traditional models. An other example of the enhance of the deep learning in the recommendation systems is that since 2016 RecSys, the leading international conference on recommender system, started a regular workshop on deep learning for recommender system.

Deep learning is a subfield of machine learning that make use of artificial neural networks. Deep learning learns deep representations, this mean multiple levels of abstractions and repre- sentations from data. The different algorithms of deep learning are based on techniques such as Convolutional Neural Network, Recurrent Neural Network Multilayer Perceptron, Au- toencoder, Restricted Boltzmann Machine, Neural Autoregressive Distribution Estimation, Adversarial Networks, Attentional Models and deep reinforcement learning among others.

Deep learning has many advantages for recommendations. One of the most interesting prop- erties for this field is that they are end-to-end differentiable and provide adequate inductive biases for the class of data. That is, if in the data it is possible to find some kind of inherent structure, deep neural networks will be adequate for that case. In addition, in the cases of content-based recommendation, deep learning has the advantage that they are composite. This means that multiple neural building blocks can be presented as a unique differentiable function and trained end-to-end.An example of this is that to work with textual data or image data, CNNs and RNNs are a neural building blocks that are practically indispensable.

2.2 Deep learning and visual content based recommendation systems

In order to make a content based recommender, in the visual field, we will analyse the clas- sifications of visual concepts that must be performed. This classification is complicated due to the complexity it requires and the variability of its appearance. In paper [16] it is pro- posed, for example, objects, sites, scenes, personalities, events, or activities as visual concepts to analyse. In this other article [17], a standardisation is proposed when looking for these 10 concepts to avoid a semantic gap, they call it lexicon. It consists in categorising a series of general concepts into five categories: ”who”, ”what”, ”where”, ”when” and ”how”. And for each category, it proposes a definition. ”who” corresponds to the number of people or ani- mals that appear on the scene, ”what” indicates the actions or events, ”where” the location or places, ”when” indicates whether it is day or night and finally ”how” information about shot sizes since they strongly correlate with specific actions. In the article [16] it also indicated the importance of defining a minimum number of positive samples per concept.

Generally, in the classics image recognition, it have only been considered a single concept per image, this is ”singlelabel”. But nowadays there are other proposals, such as [16], in which ”multilabels” are used extending the CNN architecture with a sigmoid layer.

Next, the state of the art of the most interesting features to analyse for a content based recommendation systems will be detailed. For two of them using deep learning and for the other two computer vision techniques are applied.

2.2.1 Computer vision

Computer vision is a field that acquire, process, analyse and try to understand images or sequence of images. This discipline seeks to quantify and produce information, from images, that a computer is able to understand and deal with. In order to achieve the acquisition of such information, exist a huge vary of techniques.

Computer vision is closely linked with artificial intelligence, as the computer must interpret what it sees, and then perform appropriate analysis or act accordingly.

But there are important challenges in computer vision. Initially, it was believed to be a trivially simple problem that could be solved by a student connecting a camera to a computer. After decades of research, computer vision remains unsolved, at least in terms of meeting the capabilities of human vision.

One reason is that we don’t have a strong grasp of how human vision works.

Studying biological vision requires an understanding of the perception organs like the eyes, as well as the interpretation of the perception within the brain. Much progress has been made, both in charting the process and in terms of discovering the tricks and shortcuts used by the system, although like any study that involves the brain, there is a long way to go. 11

Another reason why it is such a challenging problem is because of the complexity inherent in the visual world.

A given object may be seen from any orientation, in any lighting conditions, with any type of occlusion from other objects, and so on. A true vision system must be able to “see” in any of an infinite number of scenes and still extract something meaningful.

Computers work well for tightly constrained problems, not open unbounded problems like visual perception.

Some example of techniques used in computer vision are colour histogram, background ex- traction, optical flow, surface and shape estimation, depth map or the optical flow. All these techniques have been very useful to extract knowledge from the image to perform another tasks as classification, regression or detection and recognition. One of the main important parts is the feature extraction. This task consist in extract different vectors that can represent in a accurate way different situations, scenes or parts of the images.

This features or image descriptors can be obtained applying several different techniques. For example, Histogram of Oriented Gradients (HOG) is used to extract knowledge about the size and form of objects in images. Local Binary Patterns (LBP) is a descriptor that is very useful to detect different textures. Other several techniques as keypoints extractor exists in the literature (Harris Detector, Sobel mask, FAST, SURF, BRIEF, ORB among others).

Combining all these techniques with machine and deep learning algorithms more complex solutions can be proposed and actually Convolutional Neural Networks are replacing this techniques due to the ability of this networks to extract automatically rich feature vectors learned during training. But computer vision techniques continues being a good choice in several works due to its availability, easy implementation and few time consuming.

A lots of tasks takes computer vision as a fundamental part in the development. Here are just a handful of them:

• Face recognition: Face-detection algorithms are applied and in combination with filters it is possible recognise you in pictures.

• Image retrieval: Content-based queries to search relevant images. The algorithms anal- yse the content in the query image and return results based on best-matched content.

• Gaming and controls: A great commercial products in gaming that uses stereo vision 12

or other types of cameras exists.

• Surveillance: Surveillance cameras are ubiquitous at public locations and are used to detect suspicious behaviours.

• Biometrics: Fingerprint, iris and face matching remains some common methods in bio- metric identification.

• Smart cars: Vision remains the main source of information to detect traffic signs and lights and other visual features.

It may be helpful to zoom in on some of the more simpler computer vision tasks that are of interest to solve given the vast number of publicly available digital images and videos available in datasets.

Many popular computer vision applications involve trying to recognise things in images, for example:

• Object Classification: What broad category of object is in this image?

• Object Identification: Which type of a given object is in this image?

• Object Verification: Is the object in the image?

• Object Detection: Where are the objects in the image?

• Object Landmark Detection: What are the key points for the object in the image?

• Object Segmentation: What pixels belong to the object in the image?

• Object Recognition: What objects are in this image and where are they?

Other common examples are related to information retrieval, for example, finding images like an image or images that contain an object. 13

2.2.2 Action recognition

Action recognition is a complex task. It requires identifying the different actions that happen in a video clip, where such action may or may not be developed throughout the entire video. It also need to be analysed entirely in context, not just analyse the different frames separately.

The biggest challenges that the action recogniser must overcome are the following:

• Computational cost: Large architectures and probable overfitting

• Long context:In order to recognise actions, it is necessary to capture a certain spa- tiotemporal context throughout the frames. Another problem also appears, and that is that you have to compensate for the movement of the camera.

• High complexity architectures: The architectures that are needed to capture the spa- tiotemporal information require a high complexity. In them you have to choose a series of parameters that are complicated to select and evaluate and are expensive.

• Non-standardized datasets: There is a lack of standardization in action datasets.

The current basis of the recognition of actions are from two studies [18] and [2]. In [18] it is attempted, using 2D pre-trained convolutions, multiple ways to join the temporal information of consecutive frames. In the case of [2] instead of using a single network, separate the architecture into two networks. One of them for the spatial context, the pre-trained. And the other network for the context of the movement.

Based on these two studies arise those that are currently the most novel. This new studies are LRCN [1], C3D [19], Conv3D & Attention [20], TwoStreamFusion [21], TSN [22], ActionVLAD [23], HiddenTwoStream [24], I3D [25], T3D [26].

LRCN [1] uses LSTM networks after making the convolutions to the images, using end-to-end training to entire architecture. The use of LSTM networks is interesting for this type of data since it is a recurrent neural network with feedback connections. In this way it processes the input data separately but also considering them as data sequences. The network architecture is presented in Figure 2.2 where the initial Convolutional part to extract features and the Recurrent Network to learn in time is drawn. 14

Figure 2.2: LRNC architecture [1]

In the case of C3D [19], Conv3D & Attention [20], I3D [25], T3D [26] they all use 3D convolu- tions. The use of 3D convolution in action recognition is very widespread since this technique allows finding patterns to 3 spatial dimensions data. In the case of action recogniser, these dimensions are time, height and width. the architecture of this network is drawn in Figure 2.3.

Figure 2.3: 3D CNN example from [2]

TwoStreamFusion [21], TSN [22] and ActionVLAD [23] are modifications of two stream ar- 15

Network Score Score note 82.92 With flow and RGB inputs LRCN 71.1 Only with RGB 82.3 C3D (1 net) + linear SVM C3D 85.2 C3D (3 nets) + linear SVM 90.4 C3D (3 nets) + iDT + linear SVM Conv3D & Attention - For video description prediction 92.5 TwoStreamfusion Two Stream Fusion 94.2 TwoStreamfusion + iDT 94.0 TSN (input RGB + Flow ) TSN 94.2 TSN (input RGB + Flow + Warped flow) 92.7 ActionVLAD Action VLAD 93.6 ActionVLAD + iDT 89.8 Hidden Two Stream Hidden Two Stream 92.5 Hidden Two Stream + TSN 93.4 Two Stream I3D I3D 98.0 Imagenet + Kinetics pre-training 90.3 T3D T3D 91.7 T3D + Transfer 93.2 T3D + TSN

Table 2.1: Action recognition state of art comparative chitecture. With this architecture the frame input is considered by two different stream. The first one analyse only the frame (spatial stream net) , and the second one analyse the frame in the context of a sequence of frames (temporal stream net). As in the case of 3D convolution, this technique is very successful to recognise actions. Since they are analysing both the image and the movement along several images.

HiddenTwoStream [24] analyse the optical flow of the video. The optical flow can be used to measure the quantity of movement. In the case of recognising activities, optical flow is useful to relate the amount of movement with the different activities.

Table 2.1 shows a comparative summary of all the studies below described with different configurations. 16

2.2.3 Object detector

Object detector is an area that is improving very quickly. The most important reason for this improvement is the application of deep learning for object detector. Each year new algorithms appear, that considerably improve the previous ones. There are a lot of object detection algorithms with high efficiency. In addition there are many of these algorithms already pre-trained with known datasets so it is not necessary to train them to start detecting objects.

Among the most famous models and models, the most interesting ones are detailed below. These algorithms are in order of effectiveness, from the least good to the best (usually the newest ones).

• R-CNN (Region-based Convolutional Neural Networks) [27]. The methodology of this network begins in a given image. From which a series of regions of interest are generated. For each region a neuronal network extracts characteristics. And each region is classified according to a series of classes. Of the disadvantages of R-CNN it is worth mentioning the computational cost of the training.

• Fast R-CNN emerges as a direct improvement of the R-CNN. In the article [28] Ross Girshick describes the disadvantages of an R-CNN and proposes a new methodology to reduce them. Fast R-CNN performs training in a single stage, improving detection rates. But this method has its bigger disadvantage in the cost of generating regions of interest, which is very high.

• Faster R-CNN was created to mitigate the problem of the generation cost of Fast R-CNN regions of interest, [29] [30]. It allows simultaneously providing regions of interest and classification results. The Faster R-CNN architecture uses the Region Proposal Network (RPN). The RPN is a fully-convolutional network that works simultaneously predicting bounding of objects and objectness scores at each position. The detection network shares full-image convolutional features with the RPN. What gets a nearly cost free region proposals. RPN networks are trained end-to-end to achieve high quality region proposals. These region proposals are those used by the Fast R-CNN for detection. Also RPN and Fast R-CNN can be trained to share convolutional features. The architecture of Faster R-CNN can be found in Figure 2.4. 17

Figure 2.4: Faster R-CNN architecture

• OHEM (Online Hard Example Mining is an algorithm for training region based ConvNet detector, [31]. It emerged as a proposal to solve the problem of the great imbalance between the number of annotated objects and the background examples. Shrivastava proposes an online mining algorithm for automatic selection of the hard examples. This new method increases the effectiveness and efficiency of training.

• YOLO v1 (You Only Look Once) is the proposal of Redmon to an object detector [3]. In the previous proposals the detection of objects repurposes the classifiers to perform the detection. YOLO proposes to address the problem of object detectors as a regression problem to spatially separated bounding boxes and associated class probabilities. To carry it out with one neural network, bounding boxes and class probabilities will be predicted, all directly from the complete image to evaluate. The detection is optimised end-to-end thanks to the fact that only one neural network was used. In the image 2.5 it is presented the working methodology of YOLO. The architecture of YOLO consists of 24 Convolutional layers and 2 Fully Connected layers. The performance supposes an improvement of the previous proposals. The images can be processed at 45 frames per second, this means that you can process them in real time using the proposed sizes of the images. Exist another version with a smaller network, called Fast YOLO that can process up to 155 frames per second lossing accuracy in the predicted bounding boxes. It also has another improvement, and that is that it produces less false positives in the 18

background.

Figure 2.5: YOLO working scheme [3]

• SSD (Single Shot MultiBox Detector) emerged as YOLO improvement. Since YOLO had certain problems when detecting small objects in a group. This is due to strong spatial constraints imposed on bounding box predictions. To solve this problem in [4] is raised SSD. From a given map feature SSD takes advantage of the set of default anchor boxes with different aspect ratios and scales that allows to discretize the output space of bounding boxes. In order to detect objects with different sizes, the network fuses the predictions of several feature maps that have different sizes. The architecture of an SSD network can be found in Figure 2.6.

Figure 2.6: SSD working scheme [4]

• R-FCN (Region-based Fully Convolutional Networks) [32] is a fully convolutional net- work that attempts to improve the accuracy and efficiency of other region-based prior 19

objects detectors, such as Fast R-CNN and Faster R-CNN. While those other detec- tors performed costly per-region sub-network hundreds of times, this new approach is fully convolutional and with practically all computation shared on the entire image. To carry out the detection they use position-sensitive score maps to solve the problem between translation-invariance in image classification and translation-variance in object detection. Resembles ResNet in that it can naturally adopt fully convolutional image classifier backbones. The results show that it takes 170ms to process an image, this time is 2.5-20x faster than the results of Faster R-CNN.

• YOLO v2 [33] is an improvement of YOLO v1. It developes new strategies such as batch normalization (now used on all convolutional layers), convolution with anchor boxes (removing all fully connected layers and uses anchor boxes to predict bounding boxes), dimension cluster, direct location prediction and multi-scale training. In [34] can be found a more exhaustive comparison of the YOLOv2 improvements.

• FPN (Feature Pyramid Net) [35] exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to create feature pyramids with marginal extra cost. It is built a top-down architecture with lateral connections for building high-level semantic feature maps at all scales. FPN can work together with other object detection archi- tectures improving the results. It allows to process images at a speed of 5 FPS in a GPU.

• RetinaNet [5] is composed by a backbone network and two task-specific subnets, and it is a single unified network. To calculate a conv feature map on a complete input image, the backbone network is used. In addition, the backbone network is an independent convolutional network. For its part, the first subnet is responsible for performing the classification in the output of the backbone network. And the second subnet accom- plishes the regression of the bounding convolutional box. The proposed loss function shows an improvement in training and in the final estimated bounding boxes. In Figure 2.7 is shown the RetinaNet architecture.

• YOLO v3 [36] is the latest version of YOLO, which presents many small improvements. The network is somewhat larger and therefore larger but more effective. For example, a 320x320 image on YOLOv2 runs in 22 ms at 28.2 mAP and with an accuracy as SSD has under the same conditions, but SSD would be much slower.

• Mask R-CNN [6] allows to detect objects efficiently in an image while generating high- quality segmentation mask for each instance. Mask R-CNN is an extension of Faster R-CNN. A branch is added to Faster R-CNN that allows predicting an object mask in 20

Figure 2.7: RetinaNet working scheme [5]

parallel to the branch that is recognising the bounding boxes. Mask R-CNN only adds a small overhead to Faster R-CNN. The image processing speed is 5 fps. The new brach of Faster-RCNN to create Mask R-CNN network is shown in Figure 2.8.

Figure 2.8: Mask R-CNN working scheme [6]

• RefineDet [37] is made up of two interconnected modules. These modules are anchor refinement module and the object detection module. The first module seeks to filter out negative anchors, so that the search space for the classifier is reduced, adjusting in a general way the locations and sizes of anchors. With this first module the initialisation for the subsequent regressor is improved. And the second module uses the anchors found in the first module as input to improve the regression and predict multi-class label. The whole network is in an end-to-end way thanks to the multi-task loss function.

A comparison of the above explained technologies can be found in reference [38] 21

Chapter 3

Development

This chapter describe all the steps taken to obtain the recommendation engine, starting with the description of the architecture of this project and continuing with a detailed description of each of the blocks involved in the system. Each block consist on a series of algorithms where in the beginning the purpose is extract some visual features and in the middle and last parts adapt and learn from the data. Deep Learning paradigm is used in order to learn complex representations of the data to provide a more accurate output. The aim of the project is provide from a set of input features (extracted from a movie trailer) the ten best recommendations that exists in the database.

3.1 Machine Learning and Deep Learning process chain

When implementing a ’Machine Learning’ or ’Deep Learning’ project, it is usually followed the chain shown in Figure 3.1.

In this project have been applied Supervised Learning because the categories/labels are avail- able from the data used. In the last part of the project another type of Supervised Learning is applied using regression optimizations to fit labels values during training. First, it is necessary to acquire data for the creation of three datasets: one for training (’training set’), another for model validation (’validation set’) and another to test the model (’test set’). 22

Figure 3.1: Proposed architecture

Secondly, it is sometimes advisable to pre-process the data before being introduced to the algorithm for training. In the case of this project, pre-processing is very important since the extraction of features from the trailers has been made, and these features will be the input to the neural networks proposed. These features also need a standardisation process.

The third step is to train the model, for this, samples of the training set are introduced by batches in order to adjust the parameters that define the models of ’Machine Learning’ and ’Deep Learning’ algorithms/architectures. The parameters are adjusted trying to minimise a cost/loss function that measures how well our model predicts when predicting the category to which the entered data belong compared to its true label.

To adjust the parameters, two steps are carried out [39]: ’forward propagation’ and ’backward propagation’. The first is to enter the training samples to calculate the output and compare with the true value of the label, the difference between both values is the error. The second step is to propagate in the opposite direction by applying backward propagation algorithm to calculate the values of the parameters for which we are at the minimum of the Loss function. To do this, an optimisation algorithm calculate the slope at each point and steps are given proportional to the negative gradient shown in Figure 3.2. The function of the gradient can be seen in equation 3.1.

∂f(x) f(x + 1) = f(x) − α · (3.1) ∂x 23

Figure 3.2: Gradient descent function

Through the validation set some interesting metrics [40, 41] of the behaviour of the models can be obtained: confusion matrix, precision,recall, F1 score, accuracy, loss, etc.

The precision metric measures the success of the algorithm, that is, the number of samples of a class that have been identified well from the total number of samples that have been classified as belonging to that class. The equation to calculate the precision percentage is presented in equation 3.2.

T rue positive precision = (3.2) T rue positive + F alse positive

The recall metric measures meticulousness, that is, the number of samples of a class that a certain algorithm has been able to identify from the total number of samples of that class. The equation to calculate the recall percentage can be found in equation 3.3

T rue positive recall = (3.3) T rue positive + F alse negative

An algorithm with proper functioning is one that finds a balance between recall and precision, that is, it detects all the samples of a class, but it is not wrong with other classes.

The F1 score metric is a harmonic mean between precision and recall, which aims with a single value to provide an intuition of the functioning of the algorithms showing that balance between recall and precision that must exist. As its formula shows in equation 3.4, a high 24 value of precision is not desirable if it is linked to a low value of recall (and vice versa), ideally it is a value as high as possible of both metrics.

2 precision × recall F1 = 1 1 = 2 · (3.4) recall + recall precision + recall

Finally, other metrics used to evaluate the model are the accuracy and the loss. The accuracy is the fraction of predictions that the model made correctly with respect to the total and the loss is the sum of the errors committed in the training and validation set. The formula of accuracy can be formulated as shown in equation 3.5. Where tp is true positive, tn is true negative, fp is false positive and f’ is false negative.

tp + tn Accuracy = (3.5) tp + tn + fp + fn

These metrics are very important in order to detect a very common phenomenon in Machine- Learning and Deep-Learning, the overfitting, which occurs when the built model is excessively complex and captures the noise of the information instead of the trend of it. This causes that it is not sufficiently generalised and with new information it will behave in an inappropriate way and will classify the samples erroneously with high probability. An example of this behaviour can be seen in Figure 3.3.

Figure 3.3: Classification overfitting

A very simple way to identify this phenomenon, as well as using the previous metrics, is to 25 compare the behaviour of the model in terms of accuracy (in Machine / Deep Learning) and loss (Deep Learning) in the training set and in the validation set, hence the importance of differentiating between both datasets. If the loss in the training set is very low, that is, it commits very few errors before known information while the error in the validation set is very high, it commits many errors before new information, the model is experiencing overfitting. A compromise must be reached between both errors. The curves of loss should be decreasing in a similar way in both sets along the epochs while the curves of accuracy should be growing in a similar way in both set too.

To prevent this overfitting problem it is convenient to reduce the complexity of the model or modify the value or number of parameters. For example, in a neuronal network of Deep- Learning this would result in reducing the number of hidden layers or hidden neurons. Another aspect to take into account when avoiding overfitting is that the size of the dataset necessary to train an algorithm grows exponentially with the size of the model. This means that more complex models require more samples for their correct operation. As it is sometimes expensive to get as much information for training it is necessary to simplify the models.

It can also happen the opposite phenomenon, the underfitting, shown in Figure 3.4. This occurs when the model is too simple and is not able to capture the trend that the data follow, therefore, it will have a bad behaviour in both the ’training set’ as in the ’validation set’.

Figure 3.4: Classification underfitting

The ideal is to find a commitment value in such a way that the behaviour of the loss curves in the training set and validation set is decreasing and similar in both sets or that the accuracy in both sets behaves similarly. An example of this good behaviour can be found in Figure 3.5. 26

Figure 3.5: Classification compromise between underfitting and overfitting

The selected model will depend on the algorithm used to train. As for Machine-Learning there are numerous classification algorithms growing in complexity or that adapt better to certain conditions depending on the data used. Some examples are Logistic Regression, K- Nearest-Neighboors, Decission Tree, Random Forest among others. In this work have been used K-Means Clustering, Hierachical Clustering and Gaussian Mixture Models in order to get some intermediate results in the feature selection block.

Regarding the paradigm Deep-Learning the model depends on the creation of a network architecture using a series of available layers that perform different functions on the input data. Deep learning algorithms are now improving the results presented by machine learning solutions and will be used in this work to create the final recommendation engine. 27

3.2 Proposed architecture

Figure 3.6: Proposed architecture

The architecture that has been proposed to perform the recommendation system is represented in Figure 3.6. The architecture consists of four differentiated blocks in which different image processing and machine/deep learning techniques are used.

The first block is the feature extraction (Section 4.1). In it, an analysis of the trailers from the selected dataset is carry out 3.3.1 using both Computer Vision and Deep Learning techniques. Four different features have been extracted from each trailer. These features have been selected with the criterion of adjusting to the most relevant characteristics to describe successfully each movie trailer. The selected feature extractor considered in this work have been a deep learning action recogniser, colour histograms, deep learning object detector and optical flow. Each characteristic is processed to obtain a vector of values per feature. These four vectors are joined together forming the final vector of characteristics of each film that will be the input to the second block in the architecture.

The next block is an embedding of the feature vectors. The embedding allows to find another sub-dimension to represent the feature vectors in a different sub-space that can separate vectors in the space and fit better the values to the final purpose. To perform the embedding, a neural network was used, this network has as input the feature vectors and as labels the labeled genres of the films trailers. The output of this block is the prediction of all the films once the model is trained giving a vector per film, with a dimension equal to the layer before 28 the classification layer of the network. The network was trained as a multi-label classification problem due to its film is categorised by more than one genre.

The next block solve an optimization problem where a distance function is optimized to output the final distances between the embeddings. This block calculates a distance value of each film trailer with the rest of the films in the dataset. The distances have been calculated with different algorithms to check which one best fits the problem. The output of this block is a vector per film with a length equal to the number of films trailers.

In the last block a final training of all the trailers is carried out. Different network architectures were used in order to compare the performance between them. Fistly, an Artificial Neural Network (ANN) was used using as input the embedding and as labels the distances. This problem is solve as a regression problem in order to learn the distances between films trailers. The second approach takes the advantage of deep learning autoencoder architectures. A first autoencoder was used to learn the distances between films in order to reproduce the input in the output. A second autoencoder takes the decoder part from the previous autoencoder and include a new encoder part that takes as input the embedding vectors. After the training, a model is generated that allows to made recommendations. When a prediction is performed the output is a vector of a length equal to the number of movies where values are between 0-1 range where 1 represent the most similar film trailer and 0 the less similar one, So, the positions with the 10 highest values are the final recommendations from the proposed recommender system engine.

Through the complete architecture we obtain a final model that allows us to carry out a content-based recommendation of a movie trailer.

3.3 Feature extraction

As the previous section explains, the first step carry out in the proposed recommendation engine system is the features extraction process. To achive this firstly it is necessary to choose the dataset of movie trailers (Section 3.3.1) used. Next, an analysis of the features to be extracted is carried out. Different feature extractors were used (Section 3.3.2.1), to get the input values and were normalised in order to get a common representation for each movie trailer. 29

Dataset Domain Content Feature Number of items Number of users Number of ratings LDOS-CoMoDa dataset [42] movie M+context 1K 1K 2K Million Song Dataset [43] music A,M 1M (track) 1M 48M Million Musical Tweets [44] music A,M 134K (track), 25K (artist) 214K 1M LFM-1b [45] music M 32M (track), 3M (artist) 120K 1.1B MovieLens 20M (ML-20M) [46] movie M,A,V 26.7K 138.5K 20M MMTF-14K [47] movie M,A,V 13.6K 138.5K 12.4M Labeled Movie Trailer Dataset [12] movie M,A,V 4K IMDB 4K

Table 3.1: Comparison between datasets, based on [47] research

3.3.1 Dataset

In order to begin the process of the recommendation system, the trailers and its information are needed. The database or dataset is an essential part of any machine learning and deep learning project. It is necessary to have a good set of information that can represent the mejor part of the possible cases that can appear in your problem. It is of vital importance that the data is appropriate for each problem and also takes in account that the information it provides is reliable (labels they offer are correct, information have good quality...).

To select the dataset that has been used, a deep search of the available datasets in open source has been made. This comparison can be checked in the table 3.1. The type of content feature is denoted as M (metadata), V (video) and A (Audio).

The most outstanding datasets are the last three one (MovieLens, Multifaceted Movie Trailer Feature Dataset and Labeled Movie Trailer Dataset), since they have video information.A description of what they offer, the quality of their information and the dimensions of each one are described below.

The first dataset is the MovieLens [46] dataset. This is a set of different datasets parts which differ in the number of movies. For each set they offer a list of movies and their ids on youtube, which facilitates the download of the dataset. The recommended dataset for research is called MovieLens 20M. The 20M indicates the number of ratings it has in metadata. The database contains 27,000 movies. The dataset is not updated, then the most of the links to youtube are out of the date.

Another dataset of movie trailers of great interest is Multifaceted Movie Trailer Feature Dataset [47]. This dataset provides 14,000 movie trailers. In addition to a series of audio and video descriptors, metadata and ratings. The visual descriptors include Aesthetic features and AlexNet features. And the audio descriptors include block-level features and i-vector 30 features.

Finally, other common dataset is the Labeled Movie Trailer Dataset [12]. It is oriented to a multilabel movie genre classification providing 9 different classes. This dataset offers 4021 trailers of tagged movies. In addition to the multi-genres of the films, one of its biggest advantages is its metadata, that offers all the data stored in IMDB for those films. That includes the genres indicated by IMDB, name, director, film awards, main actors, plot resume, image url of the film cover, among other much information.

After this exploration, the dataset used in this work is the Labeled Movie Trailer Dataset. This dataset has been selected taking into account the importance of the labels of the genres and the quality and quantity of the trailers available to download from Youtube as in our work only genres and video trailers will be used from all the provided data.

3.3.1.1 Genres

In a deep learning project, an exploratory data analysis of the used dataset is usually carried out. But in this case, the only metadata that must be checked are the genders. For all the films it is verified that there is data of the genres. Next it is shown other relevant information about the genres of the dataset.

The genres per film offered by this database have two different types. In one hand, IMDB provide 24 different genres (Table 3.2). In the other hand, the genres of LMTD dataset are divided in 9 classes (Table 3.3). 31

Genre Number of movies Action 856 Drama 2032 Thriller 693 SciFi 313 Comedy 1562 Romance 651 Crime 659 Adventure 593 Horror 436 Mystery 296 Biography 226 History 97 Fantasy 300 Musical 41 Documentary 62 Music 121 War 62 Sport 110 Family 209 144 Western 26 Short 60 GameShow 1 News 2

Table 3.2: Genres with IMDB

The number of genres generated in the metadata information from IMDB is much higher than the number of genres offered by LMTD. As the genres are going to be used for the embedding as labels we will use the LMTD ones since they summarise the genres of IMDB in a more simpler way increasing the number of data per label.

By the number of films per genre it can be observed that half of the films have the drama genre. It is the genre with the most number of films. This must be taken into account since when training films the model will tend to train drama films better. Comedy films are behind 32

Genre Number of movies Action 856 Adventure 593 Comedy 1562 Crime 659 Drama 2032 Horror 436 Romance 651 Sci-Fi 313 Thriller 693 Total 4021

Table 3.3: Genres of LMTD the drama films, also with a large number of films. From there, the rest of the genres have a balanced number of films, except for the Sci-Fi films.

As mentioned before, the films have been considered multi-genre. In the Figure 3.7 can be seen the number of movies with different numbers of genres labelled. The figure shows how the maximum number of genres that can be presented in a film is three.

3.3.2 Features

In this section, the selection of features is carried out, their description, their extraction and finally the normalisation of the feature vectors. Feature selection is one of the main important parts (and one of the most complex) in a learning problem due to the selected data will be the input of the algorithms that will learn hidden patterns between these values. A carefully analysis should be done in order to select features that are a good representation of the input data and can be separable in the multidimensional space during the training process.

3.3.2.1 Features selection

The first step is perform an analysis of the possible features that will be extracted. In Table 3.4 the characteristics that were firstly considered are presented. Once we found the features 33

Figure 3.7: Multi-genres distribution that could be interesting to characterise a trailer, the necessary tools to extract them properly were analysed.

It can be observed that there is a pattern of the tools that are used for the extraction of these characteristics. Computer Vision and Machine/Deep Learning (ML/DL) techniques are the most common tools actually used to extract these type of features. The analysis in time can also include inside the complete algorithms some Computer Vision and ML/DL techniques like scene changes and shot detectors. With these tools the vast majority of characteristics that define a film can be wrapped.

The RGB image colour histogram (extracted with computer vision techniques) will allow us to differentiate between a wide range of characteristics such as the most outstanding colours, if it is developed in an exterior or an interior place, differentiate between landscapes and even weather conditions. The RGB colour histogram have been used for several applications, for example in [48] explain how to make a simply classifier using only the colour histograms. Or other example that use the colour histogram for video retrieval [49].

The RGB colour histogram have been used in other movies classifications solutions as in [50] or taking into account the colour like in [51], this last one use colour histogram and colo variance. 34

On the other hand, the object detector allows us to detect any object for which it has been trained with high accurate performance regarding the actual state of the art algorithms [36]. Therefore we would be covering all those characteristics that account the number of objects, animals, people or meals. Using object detector as a feature for movie classification or rec- ommendation have already been researched as in [52], it detect objects in movies posters, to clasify those movies.

Besides these tools it is also observed that the use of metadata from the selected database/- dataset can be very convenient, but in this work this information have reserved to be used it as labels in one block of the recommendation system engine. In cases like in [53] it is used the genre metadata correlation for movie recommendations. In [54] use a wide range of metada information, as genre, rating, actors names, directors, producers, writers. In this particular case they use IMDB metadata, as the one given by the Labeled Movie Trailer Dataset used in this TFM.

In this way the last characteristics that we have are emotions, activities and optical flow. For the emotions it was carried out different novel deep learning algorithms. The biggest problem was that it was necessary to force the expressions of the face have a percentage of success that could be considered as appropriate. Also, to generate a vector, there were only 7 values (one probability value per emotion), which compared to the length of the other features meant that it would not be significant. Therefore, in this work, this option was rejected to be used due to the necessity of combine an accurate face detector and the short time that faces appear in each shot of the trailer. But in the research [55] was used the emotion extraction and its output as a feature to improve movie recommendations.

Furthermore, for actions recognition, the state of the art shows quite good results when recognising a wide range of activities, with a high level of accuracy like for example in [24] and in [21]. For this reason, and because there are some researchers using action recognition for movies with good results [56] [50], it was selected as one of the features to be analysed.

Lastly, the optical flow. This feature have been used for many application, for example for face tracing [57], or for autonomous vehicles [58] or even for action recognition [2]. It is a good descriptor of a film, since it indicates the pattern of the apparent movement. Also for its calculation there are many computer vision algorithms that allow the implementation to be accurate. There are even some deep learning techniques as FlowNet [59]. The optical flow as a movie feature have been used in researches such as [51] and [50]. For these reasons, the optical flow was included in the list of features of our recommendation engine system. 35

ID Characteristics Tool 1 Shot duration Shot detector 2 Time of inside and outside shot Shot detector, computer vision 3 Colours Computer vision 4 Colours of each scene Scene detector, computer vision 5 Landscape Computer vision, ML/DL 6 Meteorological condition Computer vision, ML/DL 7 Objects ML/DL object detector 8 Animals ML/DL object detector 9 Number of humans per scene ML/DL object detector, scene detector 10 Meals ML/DL object detector 11 Emotions ML/DL emotions detector 12 Activities ML/DL action recogniser 13 Film date Database metadata 14 Rate Database metadata 15 Film genre Database metadata 16 Optical flow Computer Vision, ML/DL

Table 3.4: Features analysis

3.3.2.2 Action recognition

The action recognition feature is a vector that indicates the probability that a series of ac- tivities are happening during the film trailer. In this work, this vector has the length of the number of activities that can be detected by the used algorithm, in this case 339 values. Each position of the vector indicates one of those activities. So in each position of an activity the probability of this activity happening in the trailer is indicated.

To perform the action recognition it have been used the research [60]. In order to perform the recognition of movie dataset activities, three different tasks are required. First, perform the training of a model, then make the prediction of all movies with that model and finally perform the normalisation.

To create this vector, it is necessary a detector of actions. The selected action detector is based on deep learning 3.8. In this work, a dataset of activities is introduced through a ResNet50 network. 36

The dataset used for training is Moments in Time Dataset [60]. The dimensions of the dataset is of one million of videos. Those videos have a length of less than 3 seconds. This dataset have 339 classes of action (Appendix F). The action are performed by humans, object, animals and there are also natural phenomena. The components of the dataset are visual and auditory events.

The labels are a vector with the length of all the actions for which is train. This length is 339. Each position of the vector represent an action, and the value on it is the percentage of visualisation time of that action in the video. For each video input there is a label vector. The output of the network is a layer with the number of activities. With this methodology a model that allows detecting actions is generated.

Figure 3.8: Action recognition Training

Before this training the ResNet50 network was initialised with ImageNet (ResNet50-ImageNet) [61].

The network is formed by eight layers with weights. The first five layers are convulsive layers and the last three are fully-connected. The output of the last layer ends with a Softmax of length 1000 that is responsible for producing the distribution of the 1000 classes per label. The loss function uses a logarithmic function, seeking to maximise the average of the training cases of the log-probability of the correct label under the prediction distribution.

The dimensions of the architecture are the following:

• First layer filters (convolutional): 224x224x3 input image, 96 kernels of 11x11x3 with stride 4 pixels.

• Normalization and pooling layers.

• Second layer (convolutional): 256 kernels of 5x5x48.

• Normalization and pooling layers. 37

• Third layer (convolutionl): 384 kernels of 3x3x256.

• Fourth layer (convolutional): 384 kernels of 3x3x192.

• Fith layer (convolutional): 256 kernels of 3x3x192.

• Last three layers (fully-connected): 4096 neurons per layer.

The architecture of the ResNet50 is shown in Figure 3.9.

Figure 3.9: ResNet50

The second task is to predict each film 3.10. Using the prediction model of actions we obtain a vector with the length of the number of activities, 339 values. With a probability in each position of the vector. This tells us the probability of each activity happening in the video that has been introduced. 38

Figure 3.10: Action recognition prediction

For example, if a prediction is made to the trailer of the movie ’Dirty Dancing’ a vector of length 339 is obtained. In each position of the vector we will have the probability of the activity that represents that position. Using a vector that tells us the name of the activity of each position, we can analyse which activities are the ones that have been detected and are the most likely to be happening. In the table 3.5 the results for ’Dirty Dancing’ film trailer are presented.

Action Probability Dancing 0.073 % Climbing 0.073 % Kissing 0.066 % Rocking 0.022 % Adult+female+singing 0.020 % Spinning 0.019 % Performing 0.018 % Smoking 0.014 % Hugging 0.014 % Smiling 0.013 %

Table 3.5: Action recognition prediction example 39

The results of the prediction of the trailer for the movie ’Dirty Dancing’ are very interesting. Since, with the exception of climbing, all the activities detected by the model happen in the trailer. In addition, these activities are likely to be a good summary of what the film is. So using this feature we would have a good first description of the movie. The exception of climbing may be due to the fact that in many frames of the trailer the characters are observed practising port´es in a forest, in nature.

The last task is the normalisation. The normalisation of the action recognition feature vector consists of using the probabilities vector as it comes out of the prediction but rounding the probabilities to 8 decimals.

3.3.2.3 RGB Color Histogram

The histogram feature vector is a vector that contains three colour histograms, Red, Green and Blue, consecutively. A histogram is a representation of the amount of colour in an image. The number of pixels is calculated and displayed for each of the fixed colour ranges. The final length of the histogram feature vector in this work is 768, which corresponds to the three color histograms of length 256 bins per histogram.

The histogram is calculated for any type of colour space, commonly for HSV and RGB. It can also be calculated for monochromatic images by calculating the intensity. But in the case of this project the RGB colour space was the selected one.

It is also possible to calculate the histogram by separating the images in each of the colour spaces, calculating the intensity of each image. This is what was done in the histogram feature extraction.

To perform the extraction of the histogram feature the average histogram of each scene is calculated. In this way the characteristic of scene changes is being integrated, as it was commented in Section 3.3.2.1. To differentiate between different scenes it is necessary a scene change detector.

As explained in section 3.3.2.1, scene changes can be a descriptive feature of a trailer. There- fore it is necessary to develop a tool that is able to find the scene changes.

The scene detector analyses a video looking for scene changes and cuts. To perform scene recognition, another tool is needed to divide the video into frames from the same scene to be 40 able to make a frame to frame analysis of the video that makes it possible to detect patterns of a specific video.

The process of extracting the histogram, which can be seen in Figure 3.11, starts with the scene detection. This detection gives all the first and last frames of each scene. Once the scenes are detected, the frame-by-frame video is capture and the histogram of each colour (Red, Green and Blue) of each frame is calculated. When all the frames of a scene have been read, the average of the histograms of the scene is stored. Each scene histogram is normalised between 0-1 range. The normalisation of the histograms is done independently, not with respect to the others. Capturing the trend of colours but not the difference in values between histograms of different movies.

Figure 3.11: Histogram process chain 41

Finally, the average of all histograms per colour were performed. So, the final histogram feature will have three histograms, one per colour, consecutively.

For example, performing the histogram process to two films trailers with different genres gives a different result. An example of this are the films ’The Delta Force’ and ’#Horror’. The genres of the first film are action, adventure and drama, and the genres of the second film are horror and drama. The results of plotting the histograms for each colour and for the movie ’The Delta Force’ can be seen in Figure 3.12, and those of the film ’#Horror’ in Figure 3.13. At first glance it can be observed how the action and adventure movie has a wider colour range, while the horror movie has the most colours in the range of dark colours.

Figure 3.12: Action film colour histogram

Figure 3.13: Action film colour histogram

3.3.2.4 Object detector

Object detection is one of the main tasks in computer vision problems. Many problems related to image field start applying computer vision or machine/deep learning techniques in order to 42 detect and recognise object of interest in images. In this work deep learning object detector have been used to extract a binary feature vector where one position with value 1 represent that the object is in the trailer and 0 represents an object not presented during the trailer.

Object feature extraction use a methodology similar to the action recogniser. First it is necessary to train a deep learning model and then make a prediction of the objects by film. To do this, instead of use all the frames in the trailer (redundant information), key-frames were detected and the object detector is applied over these frames.

For the first part it is necessary to train a network with a dataset with different objects classes (Figure 3.14). The network used is Yolov3 [36] and as dataset the Open Images Dataset [62].

Figure 3.14: Object detector training

The Open Images Dataset is composed of 9.2M of images. Of which 1.2M images have bounding boxes. Each image has an average of 8 objects. The Open Images Dataset provides 600 different categories of objects, so the predicted model will be able to detect if in one image there are any of those 600 types of objects (Appendix E) with a proper algorithm training. This 600 of classes form a hierarchy. This means that for example one object is in the class of level zero apple, in level two fruit and in level 3 food.

For the creation of this dataset images with Creative Commons Attribution (CC-BY) license were collected. And as the number of images and the number of classes was very high, the technique that was carried out to label all the images was to make a prediction of the possible objects. This gave a first intuition, which had to be checked manually by people.

There are other dataset that have been used on many occasions such as the PASCAL VOC Dataset [63] or the COCO dataset [64]. But the Pascal Dataset only has 20 classes and the COCO 80. Although there are many classes, not all the objects, which were considered important if they were present in a trailer, were in those classes. That is why we searched for a dataset with a large number of classes that would cover a large number of objects. 43

The YOLOv3 [36] architecture is the evolution of YOLO. YOLO ’(You Only Look Once’) is based on CNN, and was proposed in [3]. This architecture is able to provide bounding boxes as well as class identifiers directly, for each image, in a single evaluation. The sliding window technique, a rectangle that goes through the image in search of objects, was a very used technique, which consumes too much time and does not always produce the most precise bounding-boxes. ’YOLO’ is committed to the creation of a grid with cells that are narrow enough to obtain a good coverage of the image and apply the classification and location algorithm on each of the cells. In each grid define some rectangles, anchor boxes, in each cell of different sizes and orientations. This technique differs from the sliding window in that it has a convolutional implementation that helps to provide efficiency to this algorithm.

As loss function YOLO uses the addition of three loss function. The classification loss equa- tion,the localisation loss equation and finally the confidence loss. Resulting in the equation 3.6

S2 B X X 1obj 2 2 λcoord ij (xi − xˆi) + (yi − yˆi) i=0 j=0 S2 B  2  q 2 X X 1obj √ p p ˆ +λcoord ij wi − wˆi + hi − hi i=0 j=0 S2 B  2 X X obj + 1ij Ci − Cˆi (3.6) i=0 j=0 S2 B  2 X X obj +λnobj 1ij Ci − Cˆi i=0 j=0 S2 X obj X 2 + 1ij (pi(c) − pˆi(c)) i=0 cclasses

Despite its advantages, YOLO is not without limitations, it has difficulties to discern between very close objects, or to detect small objects that appear in groups such as birds. However, it is one of the most promising algorithms in that it achieves high reliability and real time processing. In YOLOv2 [33] an extension of YOLO based on recurrent neural networks is proposed. And YOLOv3 proposes a series of improvements with respect to YOLOv2, among 44 which the increase in detection speed stands out.

The YOLO architecture is shown in Figure 3.15.

Figure 3.15: YOLO architecture [3]

In order to choose this architecture it have been analysed the different possible architectures that nowadays present results similar to the results of YOLOv3. In Figure 3.16 can be found a comparison among architectures. MASK RCNN [6] and RetinaNet [5] have better results. But they are much slower than YOLOv3. And taking into account that it has to make a prediction in a dataset of 4021 trailers of 600 categories, the option of YOLOv3 was the most understandable. It has good results in an acceptable time. 45

Figure 3.16: Object detection architectures comparison

Once the model is trained it is necessary to predict for each film the objects that can be seen in the trailer (Figure 3.17). To achieve it the I frames (frames with all the information inside I, P, B frames) of all the films have been obtained. The reason of using I frames is that the computational cost of making a detection for all the frames of all the films was very high, and the results did not vary significantly with respect to only using the I frames (high redundancy between near frames, no new information). Using Yolov3, each I frame passes through the network and the objects detected by the algorithm are obtained. This generates a vector of 600 positions per frame (number of classes in the dataset).

In the vector can be observed that there are a series of objects that are not usually present in the trailers. Therefore, these objects were removed from the vectors, now with a length of 96 positions, working as a film object filter that deletes object not used in films trailers of the dataset. This process has been done since to avoid having a very large amount of 0 values in the join feature vector. Since the vector of objects would be of a length of 600, where 504 positions would be 0. This could worsen the training considerably. 46

There was one last change, in the first analysis of the objects the threshold was used and the detected objects were mostly correct. This threshold is the normally used is the object detectors. But in this case the number of classes of detectable objects is much greater than usual and it is very complicated that training this type of datasets the probabilities are very large for all objects. Therefore, due to the complex learning process, lowering the threshold forces it to be more permissive and therefore there are more objects. The threshold went down to 0.3. And, as expected, the number of detected objects increased. In addition, the range of categories of objects detected increased, and the vector remained at 138 classes of objects, therefore a length of 130, after removing the always absent objects.

Figure 3.17: Object prediction

An example of the object detection can be seen in Figure 3.18. In it can be observed a frame from a film of ”Harry Potter” where the shot is of different boys sitting in class. The detector finds all the people present in the image. 47

Figure 3.18: Object prediction example

3.3.2.5 Optical flow

Optical flow allows estimate the amount of movement from one frame to another in a sequence of images. Normally the optical flow describes the movement by means of vectors that indicate which pixel, in what quantity and in what direction it has moved.

The applications are many and very diverse. As image segmentation, object classification, time to collision and driver assistance. In video recommendations can be useful to indicate the quantity of motion.

The process for extracting the optical flow can be seen in Figure 3.19. Firstly, it is necessary to find the scene changes, secondly, perform the calculation of the optical flow and finally calculate flow representation and find the values with more movement of those calculated from the trailer. 48

Figure 3.19: Optical flow extraction process

To calculate the optical flow, the scene change detector was also used. This must be taken into account since for each change of scene the optical flow values increase even though there is no movement since it counts the changes from one frame to another to find the movement of the scene. Therefore, it is necessary to know in which frames scene changes are to not take them into account when calculating the optical flow.

Once the scene changes have been detected, the optical flow between the frames of the entire trailer is calculated. The Farneback method [65] has been used to calculate the optical flow. The Farneback method is a two-frame movement estimation algorithm. To perform the cal- culation, it uses polynomial expansion, approaching a neighbourhood by polynomial for each image pixel. The equation of the Farneback method can be found in equation 3.7.

prev(y, x) ∼ next(y + flow(y, x)[1], x + flow(y, x)[0]) (3.7) 49

Once the optical flow is calculated, the calculation of the HSV is done, usually used to represent the optical flow of a video. This HSV is transformed to black and white and the sum of the intensity of the pixels are counted per frame. The hue and the value (HSV) depend on the apparent movement. The hue describes the angle of the optical flow vector while the value determines the magnitude of the optical flow vector.

After making these calculations, a vector of length the number of scenes of the trailer is obtained. And it will be necessary to normalise it so all films have the same vector. By film, a vector with the 200 highest values of pixels intensity per frame is stored as the final feature vector.

For example, applying the optical flow to the movie ’Titanic’ it is obtained the vector of 200 positions, in which the first 3 values are 2256408, 1969240, 1904171. And for the film of ’The tribe’ it is obtained a vector of 200 positions wherein the first 3 values are 2653206, 2398636, 1953803. The genre of the movie ’Titanic’ is drama and romance, while for the film ’The tribe’ is comedy. The results of optical flow are higher for the comedy film than for the romantic film. They also remain higher throughout the vector at all times.

3.4 Embedding

The use of deep neural networks has increased significantly in recent years, expanding the range of applications where they can be applied as image and audio analysis, natural language processing or time-series forecasting. One of the applications of neural networks with more success is embedding, used to represent discrete variables as continuous vectors. It allows to map the input information in another subspace where they are as separated as possible which allows a better characterisation of the information. Through this technique, great advances have been made in word embedding for machine translation and entity embedding for categorical variables.

To perform the embedding of the trailers dataset it is necessary to have the vector of features and the genres of each film. First it is necessary to train the network and then to predict the embedded features vectors. Film genres have been selected as the labels for training due to similar films should have similar genres in order to do a recommendation. This assumption is not always true but it is interesting due to the films are labeled with one, two or three labels and it is possible train a network as a multi-label classification problem mixing labels and learning inter-dependencies between genres. 50

For the first step the process that has been carried out can be observed in Figure 3.20. A deep neural network is used taking as input the feature vectors with a multi-label classification purpose. The network configuration was the followed:

• Two fully connected layers of size 2048 neurons.

• ReLu activation function between neurons layers.

• Final 9 neurons layer with Softmax output representing the genres in the labels.

• Input size of 1445 values (concatenation of all independent features extracted).

• One hot representation of the labels (binary vector with ones in true labels).

• Droput during training to prevent overfitting at the beginning.

• Categorical Cross Entropy Loss Function.

Figure 3.20: Embedding training

Once the trained model is obtained, the prediction of all the embedding vectors is carried out. But this time, to make the prediction, the last neurons and Softmax layers will be removed. The main idea is obtain not the prediction of the genres but the mapping of the original features to a new subspace that facilitates the subsequent training as the output of the second fully connected layer. In this way, the output obtained by predicting is a vector 51 of dimension the last layer, that is, 2048 values. This will be the feature embedded vector used to perform the final recommendation using a different deep learning network trained for these purposes. This process is represented in Figure 3.21.

Figure 3.21: Embedding prediction

3.5 Distances

In this section the process to get the labels used to train a network for film recommendation taking as input the embedded vectors is presented. This block receives the embedded feature vectors and draws an array of distances.

The distance matrix consists of one vector per film, in which each vector of each film has the distances of the film to the rest of the films. In this way, each vector has a length of the number of films in the dataset, that is 4021, and the matrix is made up of one of these vectors per film. The values of the distances are normalised from 0 to 1 along each, being 0 the minimum distance, of the film with itself. The normalisation have been performed vector by vector instead using the entire set of films to maintain closer distance values between films independently. In order to be more coherent with the numbers, the normalised distances have been changed as 1−distance to be 1 the best recommendation (film trailer with less distance) and 0 the worst one (farthest film trailer in the dataset).

An example of the final label distance matrix can be seen in Table 3.6. The example represents 52 the distance vector of each film where D(a, b) illustrates the function for calculating the distance between the film a and the film b. The diagonal of this matrix will be full of ones values due to the distance between the film and the same film is zero.

Film 1 Film 2 Film 3 ... Film Nf 1 1 − D(film1, film2) 1 − D(film1, film3) ... 1 − D(film1, filmNf ) 1 − D(film2, film1) 1 1 − D(film2, film3) ... 1 − D(film2, filmNf ) 1 − D(film3, film1) 1 − D(film3, film2) 1 ... 1 − D(film3, filmNf ) ...... 1 − D(filmNf , film1) 1 − D(filmNf , film2) 1 − D(filmNf , film3) ... 1

Table 3.6: Labels Distance Matrix

Two methods have been used to calculate the distances, the cosine similarity and the euclidean distance. And two distance matrices have been generated, one per method, to later compare the results and choose which one use for the recommendations engine.

Cosine similarity is a metric used to check how similar are two vectors. For the calculation of the distance it measures the cosine of the angle that exists between two projected vectors in a multi-dimensional space. The smaller this angle is, the greater the cosine similarity will be (Equation 3.8).

PN pq i=1 piqi cos(p, q) = = q q (3.8) kpkkqk PN 2 PN 2 i=1 (pi) i=1 (qi)

A high cosine value indicates that an entry is closely related to other. Parameters p and q are the vectors under comparison and N is the length of these vectors. The final distance is calculated as follow (Equation 3.9):

d(p, q) = 1 − cos(p, q) (3.9)

The euclidean distance is the straight-line distance between two points in Euclidean space. Deriving the Euclidean distance between two data points involves computing the square root 53 of the sum of the squares of the differences between corresponding values (Equation 3.10):

v u N p 2 2 2 uX 2 d(p, q) = (p1 − q1) + (p2 − q2) + ... + (pN − qN) = t (pi − qi) (3.10) i=1

Where qi and pi represent each point in the vector at the same position and N is the total length of the vectors under comparison. In resume the cosine similarity captures the orientation of the angle and the square euclidean distance the magnitude. The final matrices are obtained applying the distance metric to the embedding matrix.

3.6 Deep Learning Recommender System Architectures

This section describes the last block of the recommendation system engine. This block is a deep neural network, trained with the embedded feature vector and the distance matrix as labels for each film. The problem is solved as a regression in order to try to learn the distance values and finally perform the recommendation using the top K predicted values. Three main architectures have been used to compare the performance between them.

Firstly, an Artificial Neural Network is used to learn directly from the embeddings and pre- dict the label distance values to get the final recommendation. Secondly, a very interesting approach is the use of autoencoder architectures. This type of networks try to encode the information and decode the encoded vector in order to learn the input representation or a different type of modality. The first architecture is an autoencoder that takes as input the embedding vectors and try to reproduce the distances directly in the output. This network try to mix two different modalities in the same network. The second idea was create a double autoencoder. In this case one first autoencoder try to reproduce the same input in the out- put. This learning is easiest than the learning between different modalities and the network can achieve better performance. As second step, the encoder part of the first autoencoder is removed and a new encoder part is added. This second part is trained using as input the embedding vectors and fixing the weights of the previous decoder to don’t learn during this training phase. 54

3.6.1 Deep Neural Network

One of the initial common solutions is the use of Artificial Neural Networks with several neurons layers converting the network in ”Deep”. This networks have demonstrate that can be used for several purposes, like classification problems and also regression. In this work, a regression problem is done in other to have a neural network able to take the embedding vectors as inputs and learn to predict directly the distances between films trailers.

The proposed network architecture of is represented in Figure 3.22. Through this neural network a model is generated, that is the one that will be used to make the recommendations.

Figure 3.22: Artificial neuronal network

The input of the neural network is the feature embedding vectors. And the labels are the generated distances. The entry has a dimension of 2048 (embedding length) and the labels 4021 (number of films trailers in the dataset).

The training process is that for each embedded vector of a movie the network try to learn what distances it should generate. So that once the prediction of a film is generated, it would be a vector of 4021 positions with the distances from the film to all the films in the dataset. Being 1 the best film to recommend and 0 the worst. The configuration of the network was as follow: 55

1. Two fully connected layers with 4098 neurons. 2. ReLu activation function after each of the previous layers. 3. Final neuron layer with size 4021 (number of films in the dataset). 4. Sigmoid function at the end to get values between 0-1 range.

The hyperparameter selection for this network configuration will be presented in the next section. With this final configuration a ”deep” neural network will learn from the data to create a final model to predict the final distances that will be our final output to take the final decisions about the recommendations.

3.6.1.1 Neuronal network hyperparameters

The most important part in the world of deep neural networks is the hyperparameters selec- tion. There exist a trade-off between this parameters and the network performance during learning. The number of layers, the number of neurons, the optimizer, the learning rate and the batch size when high computational capabilities are available are very important parameters during the training process.

• Neuronal network dimension

The selection of the dimensions of a neural network is done by making a first estimation and checking its results. There exists some techniques in the literature to select the number of layers and the number of neurons in each layer [66]. The method used here was try/error modifying the number of layers and neurons in each layer in an incremental way.

In firsts attempts networks with the same layers but with the first and second layers with a lower dimension was used. After several tests it was found that a larger dimension, such as 4096 neurons, learned more efficiently but with an increment in the computational cost. After these experiments the increase in the number of neurons didn’t improve the performance of the network.

The rest of the dimensions are given by the data that has been generated. That is, the input of the first layer must be of the dimension of the embedded feature vectors. And the output layer must have the dimension of the number of movies in the dataset. 56

• Epochs

The selection of the number of epochs is done by pursuing the accuracy it project want to reach. In the case that concerns us with a 50000 epochs, a loss value was already sufficiently low for the purposes of the project. In Section 4 results during training showing the metric values resulted during training with the selected metric will be presented.

• Learning rate

For the selection of the learning rate it is necessary to observe the progress of the training. So as long as the loss continues to fall the learning rate will be successful. But if the loss does not fall, or lowers slowly the learning rate will have to be reduced manually or using some of the metrics extracted during training.

Another method is to use a learning rate that goes down gradually from the beginning. So that at the beginning it would have a high learning rate value, which is the most convenient in the first epochs since the general traits are learned very quickly. But as the epochs increase, the learning rate goes down progressively to find the details more difficult to discover.

In the case of this project, it was not necessary to modify the leaning rate since the low loss continued until reaching the desired value. Therefore a fixed leaning rate was used.

The selected learning rate, after several test, was 0.01. The tested learning rates were in a range from 0.1 to 0.0001. With 0.1 the network stack at the beginning due to this high value is not sufficient for the optimiser to reach a best performance during the optimisation process. Learning rates very little in the range 10−3-10−4 present a good behaviour but take more time train the network to get the same results as the achieved with the selected learning rate at the cost of increase the number of epochs more than 1.5 times.

• Batch size

For the batch size, as it happens in almost all parameters, it is necessary to vary it to find the value that best fits the problem. Specifically the batch size has another peculiarity, and it is that depending on the computational power of the system, executing the training, can be increased or not. Since the batch size is the number of inputs that are trained at the same 57 time. Batch size is a parameter that is very useful when high computational capabilities are available due to depending on the size the variance during training can be reduced reducing also the training time.

To find the right value, it must be noted that for each epoch, if the learning rate is adequate, the loss falls adequately and the accuracy rises. If the epochs are too slow and the loss takes too many epochs to have any decrease its means that the batch size can be increased if is possible to speed up the process.

Finally, the batch size used in this training was 256. The tested batch size was in a range from 16 to 512.

• Optimiser

The optimiser used is SGD. Although in the beginning it was tried to use the optimiser Adams in few epochs the loss was stagnating. This may be because the Adams optimiser gives bigger jumps, so that it arrives very fast at the optimum loss but if it does not hit first then it is not able to approach since it is doing big jumps on one side of the minimum and on the other. Therefore it was found that the SGD optimiser did improve the loss continuously, and did not stall as it did with the Adams optimiser.

• Loss Function

The mean squared error function has been used as loss function. Through the loss function, it is compared the result after the last layer of the network, in an epoch, with the labels of the films trained in that epoch.

The mean square error measures the average of squared errors, that is, the difference between the estimation and what is estimated. This measurement is adjusted to what it is intended that this neural network learn and therefore is the measurement selected as loss function.

3.6.2 Autoencoder

The architecture of the autoencoder network can be seen in Figure 3.23. It is composed by two main parts, the encoder and the decoder. The encoder takes the input of the network 58 and produce an embedded used by the second part, the decoder to reproduce the input or change between modalities. Autoencoders are used in several tasks like data generation [67], data compression [68] or modalities conversion [69]. The configuration, normally, reduce sequentially the input size in the encoder to the desired encoded/embedded size. The decoder perform the opposite operations to recover the initial input size. Through this neural network a model is generated, that is the one that used to make the recommendations.

Figure 3.23: Autoencoder

This network has as input the embedding vectors and try to reproduce the distances directly in the output. As previously said, this network try to mix two different modalities in the same network. The artificial neural network proposed in last section since it has as input the features and as comparison vector, for the function of the loss, the labels. But the architecture of an autoencoder, first compact the input information encoding the data and then a decoder decode the information to recover it. Mixing both models seeks to find better results than with the Artificial Neural Network.

The network is divided into encoder and decoder. The encoder consists of three fully connected layers. And behind each layer of neurons a ReLu function is applied. The decoder has the output of the encoder as input (encoded vector). It has three other fully connected layers and ReLU function is used after the first two layers and the Sigmoid function after the last layer to fix the output value to the range 0-1 (the same as the values in the distance matrix). 59

3.6.2.1 Autoencoder hyperparameters

As mentioned above, the selection of hyperparameters is one of the most important parts in the world of neural networks. In this section the selection of these parameters for the autoencoder is explained.

• Autoencoder dimension

The autoencoder selected dimensions have been those of a simple architecture. Already the architecture of an autoencoder supposes a complexity that allows to train easily a small dataset, as it is the one of this case. Therefore, it does not need to have a large number of layers or neurons per layer.

The number of neurons of the encoder are 2048 neurons in the first layer, 1024 in the second layer and 512 in the third layer. And for the decoder is the opposite, the number of neurons increase. In the decoder the number of neurons in the first layer are 512, in the second layer 1024 and in the third layer 2048.

The readjustment of the dimensions of the autoencoder was not necessary, since the training process and the results obtained worked properly as will be explained in Section 4. The network did not stop learning and losses were falling throughout the training. These results are explained in more detail in Section 4.6.

• Epochs

As for the artificial neural network, the ideal number of epochs is one in which the network has a very low loss and can not learn anything else (preventing overfitting). In this case 60000 epochs have been used. Although the loss at epoch 60000 is not zero if it is very close to 0 giving a good performance of the network.

• Learning rate

For the selection of the learning rate, the procedure followed by the artificial neural network will be repeated. Observing the behaviour of the training and the loss at each epoch and controlling that it continues to fall without stopping learning. 60

In this case, the method of gradually decreasing the learning rate could also be used. So that as the epochs increase, and more level of detail in the learning will be needed, it would be given a learning rate that facilitates this new learning.

The learning rate used for the autoencoder training was 0.01. It have been tested a range of learning rates from 0.1, with which the network is stacked, up to a learning rate of 0.0001, which gave good performance but slowed down the learning process by achieving the same result than with 0.01 more quickly.

• Batch size

The batch size was selected by varying it to find the value that best fits the problem. The tests carried out were using a batch size in a range from 16 to 512. And the value finally selected was 256. To choose it, it was analysed that for each epoch it could be observed that the loss fall properly and that the accuracy increases. In the case that it took a long time for epoch and that the loss needed many epochs to lower a significant value, it meant that the batch size was too low.

• Optimiser

As with the previous network, the SGD optimiser has been used. It was also tested with the Adams optimiser, but the training stack after a few epochs. On the other hand, the SGD optimiser, even if slower, is safer and it allowed the training not to stop learning.

• Loss Function

The same loss function has been used as for the artificial neural network. This is the mean squared error function that minimise the predicted output of the network trying to reproduce the calculated distances vectors for each film.

3.6.3 Double autoencoder

The last proposal is a double autoencoder. This architecture fuse two different autoencoders one that learn the initial data and another that learn a different purpose. The first autoencoder 61 takes as input the distance vectors and learn to reproduce this vectors in the output, since it is trained to reproduce data of the same modality. This process is easier than the process used in the other autoencoder where a change between different modalities (different input and output values) is carry out. This first autoencoder then, learn to reproduce the distances at the output achieving a good performance due to the not very complexity of the problem. The second autoencoder follows the following procedure:

• Firstly, the decoder part of the first autoencoder is combined with a new encoder part with the same dimensions as the previous encoder but with a different input size.

• Secondly, the weights learned by the first autoencoder are copied in the decoder part and fixed to don’t change during training.

• Finally, the autoencoder takes as input the embedding vectors and learn during train- ing to encode the information properly to fix the decoder input learned in the first autoencoder. 62

Figure 3.24: Double autoencoder

In this way, the decoder part of the second autoencoder learnt more efficiently to reproduce the distance vector at the output and the encoder part learn only to fix the input of the decoder to obtain these distance vectors. When this training is done, two model are obtained, one per autoencoder. The second autoencoder model allows to predict the distances of a film whose embedded feature vector has been used as an input.

The architecture of the double autoencoder network is presented in Figure 3.24. Through this neural network a model is generated, that is the one that will be used to make the recommendations. 63

3.6.3.1 Double autoencoder hyperparameters

In this section it is defined the hyperparameters of the double autoencoder taking in accout to get a propper result during training process.

• Double autoencoder dimension

The dimensions of encoder 1 and encoder 2 are the same except for the input size and those of the decoder are the same but in opposite order. These dimensions have been chosen as for the simple autoencoder presented previously, proving that the network was capable of learning. The bigger the network is the longer it takes to train and the more computationally costly it is. And even if the network is larger, it reaches a point where it does not matter how much the dimensions are increased, the result is the same. Therefore the dimensions should be as small as possible as long as they provide the same result of a higher dimension network.

The number of neurons of the encoder 1 and the encoder 2 are 2048 neurons in the first layer, 1024 in the second layer and 512 in the third layer. And for the decoder is the opposite, the number of neurons increase. In the decoder the number of neurons in the first layer are 512, in the second layer 1024 and in the third layer 2048.

As it has been done in the previous networks, its behaviour is observed and it is decided if it needs a resizing of the layers. The network did not stop learning and losses were falling throughout the training. These results are explained in more detail in the section 4.6.

• Epochs

The number of epochs in each autoencoder is 60000. With this number of epochs, in both autoencoder, a very small loss is reached. In fact, as it will be seen in the results, they are much smaller loss values than in all previously trained neural networks. So that with 60000 epochs it is already considered that the loss is practically null and they are a sufficient number of epochs.

• Learning rate 64

Repeating the same procedure and tests as with the artificial neural network and that with the simple autoencoder, a learning rate of 0.01 has been selected for both, autoencoder 1 and autoencoder 2. This learning rate allows the network to learn throughout all epochs but not to stack.

• Batch size

The selection of the batch size in both autoencoders has been made by trying the same test as with the previous networks. That is, batch size has been tested in a range of 16 to 512 to find that the most appropriate is a batch size of 256. It can be found a detail explanation in section 3.6.1.1.

• Optimiser

The optimiser used in the autoencoder 1 is the RMSprop. The RMSprop optimizer is a method similar to the gradient descent algorithms, but with momentum. This optimizer seeks to restrict oscillations in the vertical direction. What facilitates to increase the speed of learning and that the algorithm converges faster. The equations used for the RMSprop are:

2 vdw = β · vdw + (1 − β) · dw (3.11)

2 vdb = β · vdb + (1 − β) · db (3.12)

dw W = W − α · √ (3.13) vdw + 

db b = b − α · √ (3.14) vdb + 

The beta represent the impulse value, normally set to 0.9. Using this optimiser the net- work reach a low loss value quickly, faster than with SGD. Again, Adam optimiser fails the convergence due to the nature of it problem to fits very low values using a regression solution. 65

The SGD optimizer has been used again for autoencoder 2. Since it provides good results, although it is slower. It can be found more information in section 3.6.1.1.

• Loss function

As for previous networks, it has been used as the loss function the mean squared error function minimising the distance between the predicted values and the labels in both autoencoders. 66

Chapter 4

Results

In this section all the results obtained from each block in the proposed architecture are presented. Firstly, several examples and results over the dataset fro each feature extractor selected during feature selection stage are drawn. Secondly, representations about the embed- ding block using some techniques related to dimensionality reduction were applied to represent multidimensional data in the 2D space to visualise how the features are distributed. Result about the training process to learn the embedding are also included. Finally, training results with the selected networks to create the recommendation engine and final prediction results about recommendations performed by the systems are showed.

4.1 Feature extraction

This section collect several interesting results about the performance of the selected feature extractors to get the information that will be the input of the recommendation engine. Feature extractors have been tested with several cases to compare the differences between films of different genres or to test if some detection using deep learning networks work properly. 67

4.1.1 Action recognition

Several tests have been carried out to check the accuracy of the action recogniser. A first test with short videos of predominant actions presents interesting results and high accuracy detection in various types of actions. A second wave of test with movie trailers was carry out in order to test the network in a more complex real scenario with not only one action at the same time.

4.1.1.1 Short videos action recognition

This first test consists in introducing as input a very short video or a GIF file in which only one action is observed. To carry out this first test, videos from the GIPHY [70] web page have been used. It is search an action the GIPHY web page, then the video is download and finally a prediction per video is done. In Table 4.1 the results of the first four predictions are shown. In order to access the videos used, the identifiers of the videos have been included. That should be included in the url https://media.giphy.com/media/IDENTIFIER/giphy.mp4.

These results show how for a video of very few frames the detector is able to find the essence of what is being done in it. The percentages represent the amount of that activity in the video. So the sum of all the percentages gives 1. That way if in a video there is an action that is the most outstanding, and there are not many actions, it will have a very high percentage and the others a very low percentage. While in the video there are many main actions will be predicted with a lower percentage. For example, for bowling action 6 the percentage is 82.4%, which is a very high percentage. Analysing the video, it can be seen that there are no more actions than a person playing bowling. While for example in rocking action the percentage of rocking is 28.9%, a percentage much lower than in bowling video. This is because in the video appear many other actions.

Now the predicted actions are analysed one by one. In the first video can be seen two people practising a sport of martial arts. The main prediction, with a probability that differs from the following actions, is fighting. The fight is the main action of the video, therefore it can be considered a good result. But in addition, the activities that follow are also quite successful, since within this fight one of the people is attacking the other, which is the second most likely action and is followed by the kicking action, which is the way the person is attacking. Finally, the fourth prediction is boxing which although it is not the fight they are practising it is a melee fight similar to what can be seen in the video. 68

Action Main action Predictions output Identifier 56.9 % Fighting 8.8 % Attacking 1 Fighting 13NGnjve8XVTC8 5.0 % Kicking 2.1 % Boxing 28.9 % Rocking 10.8 % Dancing 2 Rocking MC3df27bJs80REF8rn 10.4 % Performing 4.0 % Adult+male+singing 37.8 % Adult+female+singing 13.7 % Performing 3 Singing Bj4YPTkrTfYoE 9.5 % Adult+female+speaking 7.2 % Child+singing 30.4 % Smiling 9.0 % Adult+female+speaking 4 Chatting iOe7vzW2UibR0u9dS3 7.2 % Interviewing 6.4 % Grinning 38.6 % Crying 5.6 % Frowning 5 Crying 2WxWfiavndgcM 4.7 % Coughing 3.2 % Pointing 82.4 % Bowling 1.5 % Bowing 6 Bowling 3o6Ztj7flGHKPbaeME 0.9 % Walking 0.6 % Playing+videogames 85.4 Sprinting 9.8 Running 7 Running ctOePPWSU91mM 2.8 Racing 0.7 Competing

Table 4.1: Action recognition results in short videos 69

In video 2 a performance of the Beatles is shown. The most likely predicted action is rocking, which can be considered the predominant action of the video. Therefore this prediction is also accurate. Also, afterwards, it shows other actions that are also important to describe the video. First dancing, something they are doing in the video in an evident way. Then there is the action of performing, which is basically the global action of the video. And the fourth action is the adult male singing, which is what the Beatles do on stage in this video. In this case, as mentioned above, the probabilities are lower, since the video has several fundamental actions.

In video 3 Lady Gaga is shown in a closed shot singing. The first predicted action is adult female singing, which is basically the description of the video. And therefore it is a good pre- diction. In addition, the following prediction, which also has a high probability, is performing, which is the general activity that takes place in the video. With much lower probabilities we have two actions that are not entirely correct. The first one is adult female speaking. Confu- sion makes sense since the way she sings is very slow and vocalising a lot, so it can be confused with speaking, even by a person. But the problem is the fourth action, child singing, because although the action is correct is not the description of the singer. This is an error to be careful with, because although the probability is not very high, it is true that it is in a high position.

In video 4, two women are shown, facing the camera, talking. The first predicted action is smile. In this case, although it is an action that is being carried out, it is not the action that defines the video. Even so the action that defines the video, adult female speaking is the second prediction, therefore it could be considered that a good prediction has been made. However, it is not as good as the previous predictions described are. The third action is interviewing. An action that in first instance seems that it is not happening in the video. But analysing in detail, the fact that both women are looking at the camera instead of looking at each other, it could look a lot like an interview, and that could be the reason for the confusion. Finally there is the prediction of the action of grinning, something that is happening. In general, the prediction is close to what happens in the video, but it is true that it is not the best prediction that could be made.

In video 5 a man is seen up close in dismay and crying. In the prediction the first action found is crying, which can be considered a very accurate prediction since it is the description of the video. The second action is frowning. If the video is analysed carefully, it is observed that the man is effectively frowning as a gesture of pain. The next predicted action is coughing, this action is not happening in the video, and the prediction must have been confused by the man’s gestures, since his constricted expression and his frowning gesture could be a gesture 70 that occurs before coughing . Also the confusion can arise from the fact that the man is putting his hand to his mouth with his fist closing, a gesture that is usually done before coughing. And finally, the fourth predicted action is pointing, this prediction may be due to the movement of the hand, but it is the least accurate prediction of this video.

In video 6 can be seen a man playing bowling, making a run. This case is more particular, since the first prediction has a very high percentage, which makes the other predictions of lower percentage. The prediction considers that bowling is the predominant action, but also with a very high percentage. The following actions, bowing and playing video-games, have nothing to do with the video, but the percentage is very low. Therefore, when using it as a feature in the vector, it will be considered as an unimportant feature. Notwithstanding the third action it is related to what is happening in the, because before throwing the ball the man walks a couple of steps.

In the last video, video 7, can be seen a series of runners coming out of their brands and starting a sprint. This case is similar to the previous one in that the first action has a very high percentage and therefore the rest of the actions will barely have any weight in the feature vector. However, in this case all the first four actions are successful. The first action is sprinting, which is essentially what happens in the video. The second predicted action is running, which also describes a sprint. The third predicted action is racing, and considering they are in a race the prediction is quite accurate. And finally, the fourth action is competing, in the video it is observed that they are clearly competing. The predictions for this video turn out to be very good and very descriptive of what happens in it.

In conclusion, this first test of the action recogniser is that it generally makes fairly accurate predictions that makes a very good description of the video predicted. But on certain occa- sions, as in video 3, although the action is occasionally correct, it confuses the type of person who performs it.

4.1.1.2 Trailers action recognition

In this section the actions recogniser is tested on trailers. For this, four films have been predicted. In this test, the ten actions with the most predicted percentage will be analysed. Because, while in the short videos in the four action with the most percentage the video was already defined, in the trailers the complexity increases and interesting actions can appear beyond the first four. 71

The first film analysed was ”When Harry Met Sally”. The genres of this film are romance, drama and comedy. And the resume plot is: ”Harry and Sally have known each other for years, and they are very good friends, but they fear sex would ruin the friendship.” The prediction of actions of the trailer can be seen in Table 4.2

Percentage Action 6.1% Cuddling 5.3% Driving 5.0% Laughing 4.3% Walking 4.2% Kissing 3.0% Adult+female+speaking 2.5% Adult+male+speaking 1.9% Smiling 1.7% Slapping 1.6% Hugging

Table 4.2: Action recognition results in film ”When Harry Met Sally”

All predicted actions appear in the trailer, except slapping. Which may have been predicted by a scene in which two characters are hitting a ball with a baseball bat. In addition, the percentage of appearance of said actions seems to be quite correct. On the other hand, these actions are typical of a film of their genres. Therefore, not only is the prediction correct, but the result of knowing the actions that most appear can be a good description of a film, and therefore an accurate feature.

The second film analysed was ”Captain America: Civil War”, whose genres are action, ad- venture and Sci-Fi, and its resume plot is: ”Political interference in the Avengers’ activities causes a rift between former allies Captain America and Iron Man”. The prediction of actions is shown in Table 4.3.

As it happened in the previous film, the recognition of actions is defining of an action or adventure movie by actions like jumping, fighting or attacking, that have been predicted. And it predicts the dropping and adult male speaking actions with a successful appearance percentage.

But other activities appearance decreases the accuracy of the prediction. As the raining action, although it appears during a short scene, it is not the predominant action of the 72

Percentage Action 13.3% Raining 3.6% Jumping 3.6% Playing+videogames 2.9% Building 2.2% Adult+male+speaking 2.0% Fighting 1.5% Dropping 1.5% Attacking 1.4% Tattooing 1.4% Mopping

Table 4.3: Action recognition results in film ”Captain America: Civil War” trailer. Afterwards, the action of playing + video-games or mopping are actions that do not happen during the trailer. The action of building does not appear either, but there could be confusion as the action recogniser can appreciate scenarios of destroyed buildings that could be confused with a scenario of a building under construction. Although as will be seen in the other predictions, building is an action that tends to appear as false positive in predictions. The action of tattooing, which can be considered as the action of exhibiting military art. This action is partly successful since during the trailer it is showed a group of people fighting with weapons, which is very similar to the action of a military group using weapons.

Finally, it should be noted that during the trailer happen several explosions that the action detector could have detected as a burning action, and that it would be a very symbolic action of an action or adventure movie. But it does not. Although in this case the fire and explosion would be reflected in the colour histogram. In general, it has been observed that for action or adventure films, the action recognition is not as successful as for the rest of the films.

The third film analysed is ”Dealing ’with Idiots”, which genre is comedy, and its resume plot is: ”Faced with the absurd competitiveness surrounding his son’s youth league baseball team, Max Morris, a famous comedian, you decide to get to know the colourful parents and coaches of the team”. The prediction of the actions of this film can be observed in Table 4.4.

In this case the predicted actions and their percentage are very accurate. All the actions detected, except fencing, are present in the trailer. The predominant action of the film is adults, men, speaking, which is the predicted action with the highest percentage of presence. 73

Percentage Action 5.9% Adult+male+speaking 5.8% Fencing 5.3% Punting 3.6% Pitching 3.1% Kicking 3.0% Sitting 3.0% Hitting 2.2% Discussing 2.1% Playing+sports 1.8% Throwing

Table 4.4: Action recognition results in film ”Dealing ’with Idiots”

The rest of the actions are related to the baseball games that are produced in a trailer. Except sitting, which is the second action, after talking, the protagonist performs.

Furthermore, pointing, which is an action that occurs several times in the trailer, when the parents point to the players while they comment about them. Therefore, the predictions of the actions is of a high accuracy. In addition to being an accurate prediction, the actions present in the trailer are a summary of what the film is in a very successful way.

The fourth predicted film is ”Sex and the City”, its genres are comedy, drama and romance. The resume plot of this movie is: ”A New York writer on sex and love is finally getting married to her Mr. Big. But her three best girlfriends must console her after one of them inadvertently leads Mr. Big to jilt her ”. And the prediction is shown in able 4.5

This prediction is also very accurate. Since all the actions it detects appear in the trailer and also the percentages of appearance are very accurate. Although for the prediction to be perfect the first action and the second should be exchanged. In any case, the percentages of both do not differ much. The rest of the actions, except for building and playing video-games, do appear in the trailer. The action of building and playing video games reappears, which, like fencing, appear as false positives in many predictions. Of the actions that appear it is emphasised the one of speaking, in all the versions, and the one of kissing, that seem to be repeated in romantic films. The prediction of this trailer is accurate and descriptive of the trailer. 74

Percentage Action 13.2% Adult+female+singing 10.0% Adult+female+speaking 3.4% Talking 3.2% Adult+male+speaking 2.8% Kissing 2.7% Adult+male+singing 2.4% Playing+videogames 2.4% Standing 2.0% Building 1.9% Sitting

Table 4.5: Action recognition results in film ”Sex and the City”

From these results, several conclusions can be drawn. To begin, can be seen how the percent- ages are much lower than when the short videos were analysed. This is because, in longer videos, more actions are executed. And therefore the percentage of appearance of an action in the trailer is lower.

It is also observed that the defined activities are not only among the first, but that they are more. Therefore it will be convenient, as described in Section 3.3.2.2, to introduce in the action recognition feature the complete array of activities and the percentage of presence of them in the trailer. Being a percentage will equal the length of the features in the different trailers, which facilitates the normalisation of the vector.

Another advantage of this feature is that the actions that happen in the trailers are very representative of the type of film that it is.

Finally, highlight the reliability of the detector, which finds the actions that are presented in the videos, with only certain errors in actions of lower percentage. Errors such as the erroneous prediction of the fencing action that tends to appear as false positive in movies where there are people performing a very moving action. Also the action building and playing video-games arises as a false positive on many occasions. The cause may be due to several circumstances, such as the number of videos in the training was insufficient for these actions, that the actions are very complex, or that there are certain habitual actions in the trailers that are systematically confused with those of fencing, building and playing video-games. A possible improvement would be to remove those actions from the predictions since they are 75 not necessary actions to describe a movie, but they can introduce an error in the predictions.

4.1.2 RGB Histogram Feature

In order to analyses the results of the histograms, the tests have been divided into two. A first test makes use of symbolic images with very concrete situations easy to analyse and to detect visually if the histograms are working properly. A second test with movie trailers have been performed to test in the real scenario how the proposed histogram extraction works.

4.1.2.1 Symbolic images

In this section, the colour histogram of a series of very symbolic images of different character- istics have been selected to analyse the histogram behaviour. This is to differentiate between insight and outside, day and night and between landscapes scenarios. The results of this test are shown below.

Insight and outside

The image used as outside is the image 4.1, and for the insight it is the image 4.2.

Figure 4.1: Example outside image Figure 4.2: Example inside image 76

The resulted histogram for the outside scenario are shown in Figure 4.3. As it can be seen in the results, for the blue and green colours there is a peak, which reaches almost the maximum, for the bright intensity, at 255. For the red colour there is also a peak in the brightest intensity, but it is below the value of 0.8. This means that there is a lot of light which characterise an image in the outside.

In addition, although the histogram has values all along it, most values tend to be on the bright side of the histogram. There is no peak in the lowest values, the dark ones. Below the intensity value 70, the histogram values do not exceed the normalised value of 0.3. Except for the red colour, which has a peak in the first dark values of 0.6.

Figure 4.3: Outside example results.

The results of the insight image can be seen in Figure 4.4. The result of the interior image is somewhat more diffuse, since although it is insight it is a well illuminated area. Even so, the histogram values of the three colours tend to be in the left area, which is the dark values. The largest amount of colour is below the intensity of 120. It can also find peaks in the brights of the three colours, with values lower than 0.75 in blue and green. But a very high value, of almost 1, for the red colour. 77

Figure 4.4: Inside example results.

In general, the results to differentiate inside and outside images are very good, since with the histogram it can be easily differentiated if the image is in an indoor area or at the outside. The values of the colour histogram for the outside are concentrated above the value 70, while for the interior they are concentrated below the value of 120. In the outside the peaks in the brightest intensity are very high, except for the red colour, while for the exterior it is only high for the red colour.

Day and night

The image used as day is Image 4.5, and for the night it is Image 4.6. These images represent a sunny day in a city and one normal midnight in the same city. 78

Figure 4.5: Example day image Figure 4.6: Example night image

The results of the day image are drawn in Figure 4.7 and present relevant characteristics to differentiate between day and night. Although there are certain values along the histograms, this is not significant, since it does not even reach the intensity value of 0.05. And there is a peak, in the three histograms, on the right side of each one. This indicates that a large amount of brightness has been perceived, which was the desired result.

It is a simple but very effective result. It is clearly recognised, both visually and having only a peak value in high values, that the image is an image with predominance of luminosity. 79

Figure 4.7: Day example results.

The results related to the night image can be seen in Figure 4.8. In this case, the opposite situation appear in the histograms. It has a series of very low values, which do not reach the intensity value of 0.05. But the peak is on the left side, that is, the darkness side. So clearly a situation in which darkness is predominant is being described.

Figure 4.8: Night example results.

The result of the comparison between the histogram of a day image and the histogram of a night image shows important differences between them and is clearly observed and easily calculable. It does not give rise to confusion and can be used to describe a day image and another at night. In addition, the compared images are from a similar perspective of the Eiffel Tower.

Mountain and sea

The image used in this test are Image 4.9 and Image 4.10. Both present very concrete colour pattern that can be very useful to differentiate films related to the sea and films more related 80 to war, sports... The analysis is presented in the next paragraphs.

Figure 4.9: Example mountain image Figure 4.10: Example sea image

The results of the mountain image are shown in Figure 4.11 and the results of the sea image can be seen in Figure 4.12 respectively.

For this mountain image the histograms show that the blue colour appears in a dark spectrum, it is practically a peak, without much luminosity range for the highest intensity values. And the green and red colour appear in an intermediate range of luminosity, in a large range of brightness and with high intensity values.

Figure 4.11: Mountain example results.

Extracted histograms for the sea image show for the blue and red colour a peak of intensity in a centred value, tending to the right, of luminosity. While for the red colour there is hardly any intensity. 81

Figure 4.12: Sea example results.

The results of the difference between mountain and sea histograms are very interesting, since they show a very different colour histogram for each one.

In general, the results of the histograms for symbolic images are very interesting. It shows how the histograms help to differentiate between different situations that can be applied to define situations and scenarios, which is what is sought to create a good feature vector.

4.1.2.2 Trailers colour histogram

In this section it is presented and analysed the results obtained after applied the color his- togram feature extraction process defined to obtain the histogram feature, that is explained in Section 3.3.2.3.

The first film analysed is ”Batman & Robin”. Whose genres are action and Sci-Fi. It’s colour histograms are represented in Figure 4.13. 82

Figure 4.13: ”Batman & Robin” histogram results.

The result shows a very high intensity peak at a very dark value of the luminosity spectrum. In addition, from that peak the brightness intensity towards the clearings does not present values, they are null or practically null. Batman movies are usually recognised for a very dark and unseasonably warm environment. That is why this result is very indicative of the type of film being analysed.

The next film analysed is ”Someone Marry Barry”, a comedy genre film. The results are shown in Figure 4.14.

Figure 4.14: ”Someone Marry Barry” histogram results.

Results show a very broad luminosity range where there are significant intensity values. Al- though there is still a peak of intensity in dark, there are many values tending toward bright- ness. All the values are very significant, since many frames have been analysed, which make up the trailer. Important to note that there are values up to the intensity value, although low, lighter luminosity, which was not the case with the movie ”Batman & Robin”. The trailer is bright and many colours are appreciated. 83

The third analysed trailer is ”17 Again”, whose genres are comedy and drama. The results of the colour histogram of this film can be seen in Figure 4.15

Figure 4.15: ”17 again” histogram results.

The results of this movie have a similar behaviour to the previous movie. The range of intensities is relatively large but even so the intensity peak in the dark luminosity value is maintained. Even having a behaviour similar to the movie ”Someone Marry Barry” the values are easily differentiable. It can be seen how the colours of green have more intensity in high luminosity values. It is also observed that throughout all the luminosity values there is some intensity, even if it is small. This seems to be a sign of a luminous film. The trailer is a bright trailer with a big gamma of colours.

The fourth film is ”Night at the Museum”, a comedy of action, adventures. The rests of the histogram are shown in Figure 4.16

Figure 4.16: ”Night at the Museum” histogram results.

In this film the same behaviour appears as for the previous two. It can be observed that it 84 has been detected that there are luminosity values somewhat higher than the dark values. In addition, it still has the behaviour of maintaining a certain level of intensity throughout the entire luminosity, so it is low but not 0. In this case, the higher intensity range is concentrated in low luminosity values. This may be due to the fact that the film takes place indoors. For these reasons the histogram gives a good result, since it finds that it is a film with light but in an interior.

The fifth film analysed is ”A Resurrection”, a horror and thriller movie. The results of this movie can be seen in the image 4.17.

Figure 4.17: ”A resurrection” histogram results.

In this case can be observed some results similar to the movie ”Batman & Robin”, but with an important difference. In this case, although a dark film is perceived, can be seen many different values in the low luminosity range. This indicates that although it is dark it can have many colours and details. In addition it happens that it has values of intensity throughout the range of luminosity, that although they are hardly null they do not reach the 0. This supposes a big difference with films like the one of ”Batman & Robin” that in almost all the range of luminosity it have intensity equal to 0. The big difference between both films is that ”Batman & Robin” is an exterior and night film and this case is an indoor environment in the night.

The sixth film analysed is ”Say It Is not So”, a romantic comedy. The results can be seen in Figure 4.18. 85

Figure 4.18: ”Say It Is not So” histogram results.

This film shows a clear example that not all comedy films are bright. The graph shows a peak of intensity at a low brightness value. And the rest of intensity values are null, except for a pair of luminosity values. It has a dark film result, as seen watching the trailer.

The last example is the movie ”It’s Complicated”, romantic comedy. This movie has been selected to explain the intensity peaks of dark values that appear in all movies. The results of this movie are shown in Figure 4.19.

Figure 4.19: ”It’s Complicated” histogram results.

In general it seems to have a luminous film behaviour, with values throughout the spectrum of luminosity. But in this last case the appearance of a peak in a high luminosity value stands out. If it is analysed the trailer of the movie it can be quickly found what that peak is. Between certain scenes of the trailer appear transitions with complementary or explanatory texts. The background of these transitions are usually black, but in this case it is white. What makes these peaks appear. 86

In conclusion the results are very interesting and it is stated that the use of the histograms is very interesting as feature vector. The histograms are almost like an identifier of the movie, and they show certain patterns. In addition, the section 4.1.2.1 has shown how the histogram is able to differentiate between certain situations. It is also important to emphasise that all the values of the histogram, however small, if it is not 0, is of the utmost importance, since they present information of many frames.

Another important aspect is the systematic appearance of peaks in dark values of the luminos- ity spectrum. Both these transitions with explanatory text and fade out or fade in transitions also explain those peaks of black in all movies.

From the results it can be also drawn the conclusion that in general the genres of comedy, adventure or action are colourful but there are many cases in which they are dark. While the genres of horror or thriller are usually always dark. Although there are some movies that is not only dark with nuances.

4.1.3 Object detector

To analyse the results of the object detector, the tests have been divided into three. Firstly, a detection on a series of images of the different types of objects that the detector is able to find. Secondly, the result of processing the intraframes of the trailers with the object detector is analysed. Finally, the final objects feature are examine in order to check that the object are correctly represented and detected.

4.1.3.1 Symbolic images

This section shows the result of applying the object detector to symbolic images in movies. That are images where objects that usually appear in films and therefore objects that are convenient that the detector is able to find.

To begin with, the detection of a series of images with animals in them has been verified. A sample of some of the tested images can be seen in Figure 4.20. 87

Figure 4.20: Animals detection

The detection of animals is very accurate with a high probability percentage of that object is the correct object. The deep learning object detector is capable of detect a wide range of animals. Although they are only partially visible, as in the image of the horses the detector can find the objects accurately.

The following objects to test are the vehicles. The detection of a vehicle sample can be seen in Image 4.21. 88

Figure 4.21: Vehicle detection

As for animals, the detection of vehicles is very accurate. It is capable of recognising a wide range of vehicles of all kinds. Even, as it will be shown in Section 4.1.3.2, science fiction aerospace-ships and in this result it can be observed also the detection of old ships.

Below is a sample of the images with items of sports equipment tested (Figure 4.22). 89

Figure 4.22: Sport equipment detection

The trained object detector is also able of detect among a lot of types of sports equipment. From helmets, to balls of all kinds, or rackets. Another object that is of interest to be detected are weapons and other objects related to war, crime, murders. In Figure 4.23 two examples of weapons detection are shown.

Figure 4.23: Weapon detection 90

The object detector is capable of detecting guns, which will be very convenient for action or crime movies.

But the object detector also makes mistakes, for example in the image of two pictures frames shown in Figure 4.31 can be seen that although it has detected one of the frames the other it is not found. This problem in the object feature vector does not make a difference, since the vector will only indicate if there is an object present during the trailer or not. But the error of false positives, like the one seen in the image of the Figure 4.25, supposes a problem in the vector of features, since it will indicate that there is an object that is not present. This last type of error appears with little frequency, and it does not have to suppose a problem in a vector with such large dimensions, but it is important to check that it does not distort excessively the feature vectors.

Figure 4.24: Not all objects detected Figure 4.25: Wrong detection

4.1.3.2 Trailer object test

To carry out the testing of applying the object detector to the intraframes of the trailers, it has been applied to a series of trailers and the most outstanding events have been analysed.

To begin with, it highlights the great ability of the detector to find people. Figure 4.30 shows a series of frames where the figure of a person can give rise to confusion and still the detector finds them. Figure 4.26 shows a silhouette in a very dark place. In Figure 4.27 a cartoon person is shown. In Figure 4.28 the face is shown in a very blurry image. And finally Figure 4.29 shows a group of people meeting in which one of them semi-transparent appears. In all these cases the detector is able to find people on the screen and identify them as such. 91

Figure 4.26: Dark place person Figure 4.27: Cartoon person

Figure 4.28: Blurry image person Figure 4.29: Semi-transparent person

Figure 4.30: Person detection

Another curiosity that is observed in the trailers, is the object detection pattern called human- face. When it detects a human face it is usually because it is a fairly closed shot. It can be seen two examples of this in Figure 4.32.

Figure 4.31: Not all objects detected Figure 4.32: Human face detection 92

This feature is very interesting, since the feature vector will marked if there are closed shots or not in a movie.

Also in the trailers it can be seen the great ability of the detector to find vehicles. In Figure 4.33 a burning car is observed, it is not seen in full, and it is still detected. In Figure 4.34 a ship is observed in a perspective in which the ship is not seen in its entirety neither from the front nor from the side, but the detector discover the vehicle it without problems. Finally it can be observed how Figure 4.35 shows a science fiction spaceship that the detector is able to recognise.

Figure 4.33: Burning car Figure 4.34: Boat Figure 4.35: Sci-Fi ship

Another interesting detection is the clothes of the persons. In general, the detector only finds clothes that are very sharp. But when applying it to the trailers you will find clothes when they are different from the usual clothes. For example, it recognises clothes such as robes or clothes typical of science fiction movies. An example of this are the films of ”Harry Potter” and those of ”Star Wars”. Figure 4.36 shows two examples of the ”Harry Potter” movie where the detector puts the clothes label on the boys dressed in the film’s own outfit. And in Figure 4.37 shows the same behaviour in two frames of a movie in the saga of ”Star Wars”. Through this peculiarity of the object detector it is easier to find science fiction movies.

Finally, two peculiar cases are shown in which the object detector is able to find objects in cartoons, as was already shown in the case of 4.27 that it detected a person. In Figure 4.38 two images are shown. In the first one the object detector finds the animal that appears in the frame and in the second one it finds the toy of the frame. In the case of the toy, there are certain occasions in which it labels it as an animal. When analysing the trailer, in the vector, both the toy object and the animal object would be indicated. 93

Figure 4.36: ”Harry Potter” clothes detection

Figure 4.37: ”Star Wars” clothes detection

Figure 4.38: Object detections in cartoons 94

4.1.3.3 Trailer object results

Once the extraction of objects, from all the trailers in this section is done, a resume of the results is offered.

To begin with, the 10 most found objects in the movies have been analysed. The table 4.6 shows the most detected objects and the number of movies in which they appear.

Objects detected Number of films Person 4020 Clothing 4007 Human Face 4004 Vehicle 3900 Plant 3419 Poster 3350 Building 3269 Human Head 3035 Animal 2550 Personal care 2408

Table 4.6: Most detected objects

As it can be seen, there is only one movie in which there are no people. This movie is ”Misery”, in whose trailer no person appears.

It is also observed that there is a large number of films in which clothing has been detected, so the previously mentioned characteristic of detecting clothing only in films with very char- acteristic garments does not seem to be correct.

The object they call ”Human face” appears a lot, which has previously been shown to be detected in closed shots. This makes sense since it is a shot that usually appears in movies and trailers.

The objects that less appear, without counting the objects that never appear are shown in Table 4.7. These objects, which only appear in a movie are also indicated in the table.

The appearance of objects in these films has been checked, and they do appear in the trailers. 95

Objects detected Film Ski Everest Office building Hardcore Henry Swimwear Couples Retreat Rocket Around the World in 80 Days Wine The Heavy Luggage and bags The Tourist Tablet computer Paranoia Drums Coffee Town Tie Limitless Shelf This Is the End

Table 4.7: Less detected objects

For example, the trailer for the film ”Hardcore Henry” is largely developed in a scientific facility, which can be considered an office building.

In the movie ”The Heavy” a bottle of wine appears. It is important to note that it is a bottle and not a cup, since, as saw in section 4.1.3.1, wine glasses are labeled as drink and they appear more times in different movies.

In the trailer of the movie ”The Tourist” appear suitcases. In this case it is descriptive of much of the scenarios of the trailer that happens on a train and in a hotel.

The fact that the objects describe a scenario of the trailer also happens in the movie ”This is the end”, where a shelf appears in a supermarket.

Other data of interest is the number of objects that are shown in at least one movie, which is 138. The average number of objects per film is 15. The maximum number of objects in a movie that are 31, in the movie ”Resident Evil ” And the minimum number that is 3, which happens in the movies ”Jaws: The Revenge” and ”Mozart’s Sister”.

4.1.4 Optical flow

In this section it will be checked the results of applying optical flow to the trailers. First, a section with examples in short videos will be presented and then a section with test to see the 96 behaviour of the optical flow on the trailers and data from the optical flow already applied to all the trailers will be presented.

4.1.4.1 Short video test

This section shows the visual results of applying the optical flow designed for the trailer in short videos.

The first example is The Beatles dancing. The result of the optical flow of the three phases can be seen in Figure 4.39 and 4.40. The first figure is the representation of the optical flow calculated with Farneback, the second figure is the result of the representation of HSV and the last figure is said representation HSV passed to black and white. This last representation is the one that will be used to count the intensity of the pixels in a frame.

Figure 4.39: Dancing optical flow representation

Figure 4.40: Dancing optical flow HSV representation 97

This is an example of intermediate activity. In the first figure you can see how the arrows indicate the direction of movement and its length the quantity. In the second the colours indicate the quantity, and the position of them where this is happening, this representation does not indicate the direction of the movement. The third figure is equal to the second but in black and white, giving the intensity of the movement in a scene.

The second example is of two women talking. In this case there is hardly any movement, so only in the resumption Farneback optical flow representation can be appreciated it some movement, in Figure 4.41. The movement in the other two representations is so minimal, it is not appreciable, the output of them are show in Figure 4.42.

Figure 4.41: Talking optical flow

Figure 4.42: Talking optical flow HSV representation

As it was explained in this case the activity in the video is very low and the optical flow 98 captures it perfectly. The visual results of the representation of HSV and the same in black and white is an image that is seen in black.

The last example is an example with a large amount of movement. They are two people practising martial arts. As for the action of dancing, first the result of the Farneback is presented in Figure 4.43, then the representation of HSV and said representation in black and white are presented in Figure 4.44.

Figure 4.43: Fighting optical flow representation 99

Figure 4.44: Fighting optical flow HSV representation

It can be observed that in the last two representations much more movement is appreciated than for the action of dancing. This result is very descriptive, because although the video dancing has movement, the movement of the fight is much bigger and more abrupt and the optical flow is capturing it and representing it as such.

4.1.4.2 Trailers optical flow test

To evaluate the optical flow, two parameters will be taken into account, the maximum value of the normalised optical flow vector and the sum of the whole vector. To compare two vectors the difference between the two will be checked. Since the maximum value indicates the value of greater movement in the trailer, but it can also happen that the maximum value is not very high but there is a certain amount of movement throughout the trailer, having many frames with high values. To evaluate that is the sum of all the positions. The difference between vectors will also be used to check which films have a farthest optical flow vector and which ones closest.

Table 4.8 shows the results of the maximum and average value of 10 movies. The values of the second column of this table (addition) are not normalised, normalisation is done by putting all the vector features together. Once normalised the sum of all the values of a vector must 100

Film Maximum value Addition Behind Enemy Lines II :Axis of Evil 0.0729 3.1471 Akeelah and the Bee 0.0528 2.3665 Abandon 0.0743 4.7191 Birthday Girl 0.0429 3.9232 Awful Nice 0.0404 3.9514 The longest ride 0.0426 3.1023 Dear white people 0.0476 3.8509 Jimmy P. 0.0352 2.8429 Colors 0.0536 5.7143 Half of a Yellow Sun 0.0390 3.4867

Table 4.8: Optical flow comparative be 1. But to show the difference of the sum of the vectors in the table it has been decided to show only the non-normalised values for the addition column.

High results are from the movies with a lot of movement, and the lowest results comes from films with little movement.

It can be also seen cases like the trailer for the movie ”Birthday Girl” in which the data of maximum optical flow is not very high, but the sum it is. This is because the trailer has no scenes of excessive movement but maintains a medium level of movement throughout the trailer. Or examples like those in the movie ”Colours”, which has a high average maximum value but a very high sum value. It maintains high values along almost the entire vector, which makes it add a very high addition.

A high value in the maximum but a low addition would indicates that there is one or a few scenes with a lot of movement but the rest is of low movement. But this situation is complicated to find.

From these films have been made an analysis focuses on the three films: ”Behind Enemy Lines II: Axis of Evil” (action and thriller), ”Akeelah and the Bee” (drama) and ”Abandon” (drama) to show the importance of using the whole vector and not just the maximum value and the average.

The difference between the vectors of the movie ”Behind Enemy Lines II: Axis of Evil” 101

(action movie with a lot of movement) and the movie ”Abandon” (drama movie with a lot of movement) is 0.1495. The difference between the vectors of the movie ”Akeelah and the Bee” (drama movie with little movement) and the movie ”Abandon” (drama movie with a lot of movement) is 0.2005. And finally the difference between the vectors of the movie ”Akeelah and the Bee” (drama movie with little movement) and the movie ”Behind Enemy Lines II: Axis of Evil” (action movie with a lot of movement) is 0.0728.

It can be observed that there is less difference between the action film and the slow drama than the action and the fast drama, or even than the difference between the two dramas. This is because while the movie ”Abandon” maintains very high values of movement in the 200 positions, the values of the film ”Behind Enemy Lines II: Axis of Evil” descend sharply from position 10. Resembling more to the vector of slow drama movie. These values appear because in the action movie there are few frames with a lot of movement and the rest with some movement. While in the drama with a lot of movement there is high movement in many frames. This shows that it is important not only to enter in the vector the maximum, or the sum of values, but with all the values. It can be also seen how action movies are not always the ones that have more movement.

As conclusion optical flow does not have to be related to the genre, a comedy can have a lot of movement, like a horror or an action. But it’s interesting that if someone like a movement movie, the recommendation recommends one film with a similar quantity of movement. And it is important to have the whole vector to have a good optical vector feature.

4.1.4.2.1 Trailer optical flow results In this section the results of analysing all optical flows are shown. The film with a larger sum is ”The Repo Man”, with a sum of 11.0297. The film with the lowest amount is ”A Very Harold & Kumar 3D Christmas”, with a sum of 0.1414.

This result is very interesting, since it shows the effectiveness of the scene change detector. The trailer does not really move, there are three people talking. But there are many changes of scene, for each person there is a camera and each time he speaks he is focused. And in spite of the change of scene, that without the detector would suppose a peak of optical flow, the optical flow reflects perfectly that there is very little movement in the trailer.

The average value of the sum of optical flows of the films is 3.2510. The film with the highest maximum optical flow is ”The Repo Man”, the same one that has the highest sum, with a maximum of 0.2245. The average of the highest maximum of each film is 0.04091. 102

4.1.5 Joined Feature

Once the extraction of the four features of all the films is normalised, a vector per film is created. In this section, the joined features are analysed. To this end, a series of machine learning tools have been carried out to visualise the data. Clusters have also been made that offer a first classification of the films. This step has only been done to check the features, since after joining the features they will pass through the embedding, without intermediate steps.

With intent to represent the data in an affordable way for human understanding it have been treated wit PCA, which shows the data in another subspace where principal components from the multidimensional data are ordered in a new vector representation. In this case, a resizing from 1445 of the original train set to a dimension of 512 has been used. To represent the data in a graph the first two components of the PCA have been used. The results are drawn in Figure 4.45.

Figure 4.45: Join feature with PCA scatter

The main two components of the PCA can present the separation and disposition of the data. This is not a realistic scenario but it is possible have an idea of how the data is structured. Although the PCA would allow working with the data, it would not be enough.

Finally, the TSNE tool has been used to display the data in a two-dimensional subspace. It is a technique for dimensionality reduction that is particularly well suited for the visualisation of high-dimensional datasets. The technique can be implemented via Barnes-Hut approxima- tions, allowing it to be applied on large real-world datasets.The results of using TSNE in our 103 feature dataset are shown in Figure 4.46.

Figure 4.46: Join feature with two dimensional TSNE

The results of the TSNE show a significant differentiation between films but the films still very close and mixed regarding genres. To improve the separation taking into account the genres, the trained embedding will be the proposed solution to improve the data separation for training.

4.2 Embedding

To help find a better representation of the data in a subspace that facilitates the differentiation of the films, embedding is used. In this section results using deep learning networks to create an embedding of the data extracted from the trailers to learn a new representation using the genres of the films is presented.

4.2.1 Embedding training

Using the deep neural network described in Section 3.4, and taking as input the normalised feature vector the network was trained as a multi-classification problem using the genres of the films. During training process, loss and accuracy values can be monitored and its evolution. Binary cross entropy function was used as loss function due to it is a good choice 104 for classification problems where labels can be converted to one hot vectors. For accuracy two metrics have been defined:

1. The first accuracy measure if at least one of the nine predictions (nine genres) is in the correct labels.

2. The second accuracy measure if all the labels (if exists more than one in the multi-labels) have been predicted correctly.

Result of both the training loss and the two accuracy metrics defined can be seen in Figure 4.47 and Figure 4.48 respectively. In the second Figure in orange the accuracy over a single genre and the accuracy trying to predict correctly all the labels are presented .

Figure 4.47: Embedding loss Figure 4.48: Embedding accuracy

It can be seen in Figure that the loss reach a value of 0 around epoch 300 and the accuracy is 100%. This situation provide that the network have learnt to predict correctly all the possible situations then predict all genres correctly.

At this point of the project we already have a practical result. By means of the prediction of the embedding the network can finds the genres of a film. Once the network was trained, new films could be introduced, the features extracted and the prediction of their genres made. 105

100 trailers films have been downloaded and used as validation set. These films are completely different of the films in the dtaaset. The accuracy of the prediction of the genres have been tested using this validation set. The Rank 3 accuracy (have at least one correct prediction between the correct labels) was 85% and Rank 1 accuracy (get all the correct labels with the predictions) was 76%. This results show that have a 100% accuracy in training can be a good result if not overfitting is presented.

In this case, some overfitting appears using the training after 500 epochs. Then in order to reduce this problem the model at epoch 300 have been used. The results with this model are 88% percent in Rank 3 accuracy and 79% in Rank 1 accuracy. The results now are better then this model at this step is used to extract the final embedings used in the recommendation engine networks.

4.2.2 Embedding prediction

As explained in Section 3.4, after training the network the last layer is removed and a predic- tion of all the features is made obtaining the features in a new subspace of 2048 dimensions (numbers of neurons in this network layer).

To get an intuitive representation, PCA and TSNE were used to represent the data in the 2D space. In Figure 4.49 in the left embedded features using PCA and in the right using TSNE are presented.

Figure 4.49: Embedding feature representations 106

4.2.3 Comparison between using or not embedding

In this section are presented some representations performed over the joined feature and the embedded feature to find out if the embedding result in an improvement in data separation.

The first representation that has been made for the comparison is to show in the dataset where the films are with a specific genre (movies that only have one genre), where the films with that genre and another one and finally the films with that genre and two more, since as previously mentioned the maximum combination of genres is three.

The test has been made for action genre films. The results of the test are shown in Figure 4.52 and Figure 4.52. In red the films that only have the action genre are shown, in yellow the movies that have the action genre and another one and in green the movies that have the action genre and two more.

Figure 4.50: Without embedding Figure 4.51: Embedded

Figure 4.52: PCA action representation with 1 (red), 2 (yellow) and 3 (green) for action genre 107

Figure 4.53: Without embedding Figure 4.54: Embedded

Figure 4.55: TSNE action representation with 1 (red), 2 (yellow) and 3 (green)

The results for both the PCA and the TSNE show that there has been a greater distribution of the films. Although the optimal result would be that all the red points were together, the yellow ones surrounding them and finally the green ones surrounding the yellow ones. Which would mean that action-only movies are very close, something more action-based, other genre and finally action films and other two genres. But this does not happen in such a clear way because of the complexity of the different genre combinations. As it can be observed the films of two and three genres, which have action between them, are distributed throughout all the space, since they depend on the other genres that compose them. This same effect happen for the 9 genres.

The second test is to represent, in 3 dimensions, movies that have only one genre, from three genres. It have only been used three genres at the same time to better visualise the result, since with all the films with only one genre was not clear enough. The first test has been with films with only one genre in action, science fiction and horror. The result can be seen in Figure 4.58 108

Figure 4.56: Without embedding Figure 4.57: Embedded

Figure 4.58: PCA three genres representation action (red), science-fiction (yellow) and horror (green)

These results are not much better. It can be observed that without the embedding, the single- gender films of different genres were close together, so much so that in Figure 4.56 it seems that they are only in two dimensions, except for a couple of horror films that move away. On the other hand, in the ?? the films are distributed in the 3 dimensions. Differentiation between them is appreciated, which is what was sought with the embedding.

It have been repeated the test for three different genres. Those genres are thriller, crime and adventure. To better seen the result the 3D perspective have been changed. The representa- tion can be seen in Figure 4.61. 109

Figure 4.59: Without embedding Figure 4.60: Embedded

Figure 4.61: TSNE three genres representation adventure (red), crime (yellow) and thriller (green)

The results that these new graphs show are a little more clear but difficult to understand. It is observed again that without the embedding, in Figure 4.59, the films are much closer and agglutinated than with the embedding in Figure 4.60.

With these results it can be concluded that embedding is a process that has allowed to improve the layout of the vector features and can separate features of the same nature in the space. Can be also stated that the representation of the features in a 2D or 3D space is very complex due to during the process a lot of information is missing and the values used for representation can not be very representative of the entire feature vector.

4.3 Distances

After extracting the distances a vector of distances of a length of 4021 per film is obtained. That is a matrix of 4021 x 4021. In this matrix the diagonal is composed by ones, which indicates that this movie with itself is the closest one. And the rest of values indicates the distance of that film with the film in each position of the vector. So the closer to the 1 the more similar the movies are.

Two types of distances have been considered, the Cosine and the Euclidean distances. Both matrices would serve for training, and the difference between the two is that the cosine 110 distances have somewhat lower values but the distribution is very similar.

From this matrices a series of peculiarities can be drawn being the most important how many the movies that differ from the rest of the dataset movies.

4.3.1 Euclidean distances

In the matrix of euclidean distances it can be observed a series of films that are repeated in many vectors of distances with a value of 0, that is, they are as different as possible to the movie where they have obtained 0.

These films are 5. The films that are furthest from the whole dataset are ”Citizen Gangster” (distance 0 on 2049 occasions), ”Death Becomes Her” (distance 0 on 885 occasions), ”Paul Blart: Mall Cop 2” (distance 0 on 44 occasions), ”Storm Surfers 3D” (distance 0 on 1015 occasions) and ”All Roads Lead to Rome” (distance 0 on 26 occasions).

In general, this abrupt distance must be because for these 5 films the characterisation that has been made does not fit well for them. It can also be because they are very strange movies and difficult to recommend. But for example, the movie ”Death Becomes Her” is a well-known film, with a 6.4 rating and a number of votes of 76307 in IMDB. It should be possible to recommend this movie from others. But in the spectrum it is very distant.

It is also interesting to analyse the films that are recommended, according to the distances. That is, the second value after 1, associated with the same movie. After analysing these movies the opposite with the most different ones happens. In this case there is no pattern, nor a short list of films that repeat themselves. That is to say, they are hardly repeated. The most repeated film is only repeated 8 times. This data is very interesting, because it means that there is not a small range of films that are in the first recommendation. The problem of hiding movies that are not recommended is avoided. More than half of the films would be recommended at least once according to the distance vector.

As a first subjective comparison, the same film was used to make 10 recommendations for each of the proposed solutions. The film is ”E.T. the Extra-Terrestrial” and the recommendations made using only the distances can be seen in the table 4.9.

The recommendations have many adventure and comedy recommendations, considering that most movies in the dataset are dramas, this show that the embedding have worked properly. 111

Position Name Genres 1 La mujer de mi hermano Drama 2 Space Cowboys Action, Adventure, Thriller 3 Layover Drama, Romance 4 Sing Street Drama, Music 5 White Vengeance Action, Drama, History 6 Hail, Caesar! Comedy, Mystery 7 Cat People Drama, Fantasy, Horror 8 Half Baked Comedy, Crime 9 She-Devil Comedy 10 Apartment 1303 3D Horror

Table 4.9: Euclidean distances E.T recommendations

Despite this, there are still genre recommendations that have nothing to do with the film. Therefore, gender is not taken into account in a sufficient manner.

Finally, a survey was conducted to check the response of several people to the recommenda- tions. The surveys are analysed in Section 4.8, which compares the results of the surveys for all the proposed solutions.

4.3.2 Cosine distances

As with the matrix of euclidean distances, in the cosine matrix there are a series of films that are repeated in many vectors of distances with a value of 0, that is, they are as different as possible to the movie where they have obtained 0.

The number of this films are 5. The films that are furthest from the whole dataset with the cosine matrix are the same as with the euclidean matrix, ”Citizen Gangster” (distance 0 on 2049 occasions), ”Death Becomes Her” (distance 0 on 885 occasions), ”Paul Blart: Mall Cop 2” (distance 0 on 44 occasions), ”Storm Surfers 3D” (distance 0 on 1015 occasions) and ”All Roads Lead to Rome” (distance 0 on 26 occasions).

And as it was explained for the Euclidean distances, this happens because surely the extracted characteristics are not the most appropriate for these films, or because they are strange and difficult to recommend movies, or both. 112

Just as the results for the furthest films are the same for both matrices of distance, for the closest films the same thing happens. We have analysed the movies that are closest to each other. And these films do not follow a pattern, they do not repeat as much as for the most distant ones. The number of closest movies is always 2372, exactly the same value as for Euclidean distances. The extracted conclusion are the same as with the Euclidean distances, there is not a small range of films that are in the first recommendation, the recommendation are vary.

As is was done with the Euclidean distances, the results of the ”E.T.” film 10 closest films are presented below, in the table 4.10.

Position Name Genres 1 La mujer de mi hermano Drama 2 The Mummy: Tomb of the Dragon Emperor Action, Adventure, Fantasy 3 Space Cowboys Action, Adventure, Thriller 4 A Serious Man Comedy, Drama 5 Saw IV Horror, Mystery 6 White Vengeance Action, Drama, History 7 Edtv Comedy, Drama 8 Layover Drama, Romance 9 Batman Action, Adventure 10 Deadfall Crime, Drama, Thriller

Table 4.10: Cousine distances E.T recommendations

The result of this test show that the first recommendation is the same as the first recommen- dation with Euclidean distances. It is also repeated the film ”Space Cowboys”, in a different position but very close. The films ”Layover” ”White Vengeance” are repeated too in both recommendations. The recommendations have some comedy and adventures films. This could be due to the embedding, that approximate similar films by genre. Even so the genres are not being taken into account enough, since there are many recommendations of genres that have nothing to do with the movie ”E.T”, like drama, crime, thriller, horror or romance.

These results show that both, Cosine and Euclidean, distances are very similar with only minimum changes. For that reason it have only been used the Euclidean distances in the training process of the recommendation engine. 113

4.4 Recommender Objective Evaluation Metrics

To evaluate properly the performance of the recommender different objective metrics have been used. Firstly, the precision for different number of recommended elements has calculated. The equation used to get this precision is the following:

T opKRecommended ∩ P redictedRecommendations precision = (4.1) K

Where K is the number of recommendation to take in account, ”Top K Recommended” are the index of the major K values in the distance labels and ”Predicted Recommendations” are the index of the top K predicted values.

Root Mean Square Error is another metric used and is an objective metric used to have an intuition about how far are the predicted distances from the real labels. This metric s is calculated as follow:

r 1 RMSE = ΣN (d − l )2 (4.2) N i=1 i i

Where N is the total number of elements, di represent in the case of this work the predicted distances using the neural networks and li are the true distance values.

The second metric used is the Mean Absolute Error (MAE). It is a measure of difference between two continuous variables. Then, like RMSE is a different metric to measure the difference between the predicted distances and the real labels. This metric can be calculated using next equation:

N 1 X MAE = |d − l | (4.3) N i i t=1

After training all the networks have been tested using this metrics with the training set and the validation set. Precision was applied over the final predicted distance using the K top values of the predicted vector. 114

4.5 Deep Neural Network Recommender

To analyse the results of the neural network have been analysed both, the training part and the prediction after training.

In the training part, the RMSE value is analysed as metric to know the successful of the network during training. In the prediction it is used the precision value for an objective measure and a recommendation example as a subjective test.

The use of MAE and MSE metrics to check training results is a common practice, as explained in [71]. This reference also justifies the use of surveys to check the results, as will be done in Section 4.8. On the other hand, the use of precision as an evaluation measure is explained in reference [72].

4.5.1 Neuronal Network training

In training process the loss function is the MSE and the RMSE gives an intuitive value of how it is improving, it is a measurement about how far are the predicted values from the real values in mean. The smaller the better. Figure 4.62 shows the evolution of the RMSE for each epoch.

Figure 4.62: Evolution along epochs Figure 4.63: Evolution from epoch 100000

Figure 4.64: Neuronal Network RMSE 115

To properly see the evolution of the RMSE from the epoch 5000 the Figure 4.63 shows the Figure 4.62 from the epoch 100000, between 0.0045 and 0.008 RMSE values.

The results of the RMSE are near to zero but this do not means that this is a good result. It begins descending a lot until the epoch 8000. From this epoch it continues decreasing but at a much lower rate. Even so, the important fact is that it does not get stuck and the loss keeps falling.

Although from epoch 100000 it seems not to fall but there is a continuous decreasing until the final epoch. In epoch 100000 the RMSE value is 0.0074, in epoch 200000 it is 0.0059 and in epoch 300000 it is 0.0052. So the lower RMSE value is 0.0052. If these values are analysed it can be extracted that it is not a very good final error value. Extracting two of the top films to recommend can be observed that the difference between them is approximately 0.01 in mean. This means that the RMSE achieved is not sufficient to differ between the most top films and can confuse films in the range 0.993-1. But instead of recommend the top film the recommender will recommend other films not very far from the top films.

After making the prediction using the validation set a series of checks are made. First the RMSE and the MAE are calculated. This results can be seen in Table 4.11. Metric Value (training set) Value (validation set) RMSE 0.0052 0.0083 MAE 0.0043 0.0077

Table 4.11: NN metrics results

These values will be compared in the comparison section with the rest of the solutions. But at a first glance they are low values of MAE and MSE. These values are quite good, since 4021 values are being predicted, a complicated task.

The precision results for different values of K can be seen in Table 4.12. The results with the validation set are also in the table and due to the size is 100 there is not value for K = 500.

The results show that as more range of the vector is taken into account more percentage of coincidence of the prediction and the labels will be. Since by taking more movies more likely is have the top recommendation between them. But it also happens that the farther the recommendation is, the better it predicts it.

It can be also observed that the percentage of accuracy of matching films in the 10 closest to 116

Number of elements Precision (%) Precision Validation Set (%) 10 2.2345 1.3643 30 5.7731 3.8979 50 8.4078 5.5763 100 16.649 10.0936 500 24.2458 NA

Table 4.12: NN precision results the labels and the prediction is 2.1345 %. This value may be low, but it is very complicated for the network to predict numbers that are very close (top K distances differ in mean 0.01 between them). The closest movies have very similar values. This means that even for the film in position 50 the distance to the predicted film is still very small. Therefore, if the prediction proposes a film that is among the 50 closest labels, it will still be a good prediction. The accuracy percentage for 50 movies is already 5.6763%, a better result. Even so, the following sections aim to improve these results using more complex deep learning network architectures.

4.5.2 Deep Neural Network prediction

Once the prediction and the metrics have been made, we proceed to check several prediction examples with trailers outside the dataset. It has been proven that the predictions of the furthest films, the least recommended, are the ones that are best predicted. For this reason it has been proven how good the 30 closest ones are. To perform this check, the nearest 30 distances are compared with the 30 closest predictions of each film. And it is verified how many coincidences there are between those films. So it is not necessary that they are in the same position but in that vector of 30 values.

This procedure has been carried out on all the films and the average of hits for all the film has been made. This average is 1.92 films that appear in the first 30 films both in distances or labels as in the prediction.

Among the predictions, going back to the examples of the sagas, it can be found that for the movie ”Harry Potter and the Deathly Hallows: Part 2”, ”Harry Potter and the Order of the Phoenix” is recommended in the position 28. And for the movie ”Star Wars: Episode I - The Phantom Menace” is recommended ”Star Wars: Episode VII - The Force Awakens” in position 9. For the ”Star Trek” saga there was no prediction in where a film of the saga 117 predicted other film of the same saga.

Finally, the same prediction has been made with the 3 models, so that the output of the same can be subjectively compared. The prediction has been with the film ”E.T. the Extra- Terrestrial” and the results for the NN are presented in Table 4.13.

Position Name Genres 1 Florence Foster Jenkins Biography, Comedy, Drama 2 Gorillas in the Mist Biography, Drama 3 The Mummy: Tomb of the Dragon Emperor Action, Adventure, Fantasy 4 Planes: Fire & Rescue Animation, Adventure, Comedy 5 The We and the I Drama 6 The Greatest Song Romance 7 Dracula 2000 Action, Fantasy, Horror 8 Lara Croft Tomb Raider: The Cradle of Life Action, Adventure, Fantasy 9 Welcome to the Jungle Action, Adventure, Comedy 10 Saw IV Horror, Mystery

Table 4.13: Neuronal Network E.T recommendations

These recommendations seem not to be very appropriate for a family movie such as ”E.T.”. Although it can be guessed that its inclination has been for adventure films, a genre that also has ”E.T”. Even so, these recommendations should be improved, since the fact that the recommender engine recommends horror films for this film and are far from an appropriate recommendation. Although this is a subjective analysis, it depends on the personal tastes of the users. As for distances, a poll has been conducted for the Deep Neural Network model. The results of these test are presented in Section 4.8.

4.6 Autoencoder

As for the Deep Neural Network, the results are divided into training results and prediction results. In the training, the evolution of the RMSE is shown through the epochs and in the prediction the precision is shown as a result of the objectives and a recommendation with the trained model as a subjective result. 118

4.6.1 Autoencoder training

From the RMSE it can be extracted an intuitive value of how the model is improving. The smaller the better. The graph 4.65 shows the evolution of the RMSE for each epoch in the autoencoder training.

Figure 4.65: Evolution along epochs Figure 4.66: 100000-300000 epochs

Figure 4.67: Autoencoder RMSE

The RMSE values are lower than the values in the Deep Neural Network model. Since the lower RMSE value for the Deep Neural Network was 0.00523375 and for the autoencoder is 0.00007266.

In Figure 4.65 seems like if the RMSE is stack at 20000 epochs, so it have been done a zoom in to see what happens from there (Figure 4.66). In this graph can be seen that although the rate of decline is lower, the RMSE value continues to fall.

The MSE and final MAE results of this training using the validation set are shown in the table 4.14.

In principle these values are better than with the Deep Neural Network, since the values of MSE and MAE are lower and therefore there is a lower error.

Then the precision values of the predictions are checked with the labels. This result are shown in Table 4.15. 119

Metric Value Value (validation set) RMSE 0.008524 0.01203 MAE 0.006154 0.00923

Table 4.14: Autoencoder metrics results

Number of elements Precision (%) Precision Validation Set (%) 10 3.6193 1.843 30 7.0569 4.8965 50 15.4523 10.5631 100 21.209 15.908 500 36.9785 NA

Table 4.15: Autoencoder precision results

The results of the precision are better for both, training and validation sets. The improvement is really high in comparison with the deep neural network showing that this solution learn more efficiently from the data to predict the distances.

The average number of films that appear in the first 30 films both in distances or labels as in the prediction is 0.94 films.

These first objective results show that autoencoder is a much worse solution to this problem than a simple neural network.

4.6.2 Autoencoder prediction

In the test of the sagas has not found any in which recommend a movie from the sagas of ”Harry Potter,” ”Star Wars” or ”Star Trek” when we introduce a movie of the same saga.

The film ”E.T.” is then tested. and the results are shown in Table 4.16.

Despite having worse results in an objective analysis, this recommendation seems more ac- curate than that made by the NN. In the NN, although it approached with adventure films, these films, subjectively, were not good recommendations for a film like ”E.T.”. On the other hand, in this autoencoder recommendation, there are three films shown that are more clearly 120

Position Name Genres 1 Planes: Fire & Rescue Animation, Adventure, Comedy 2 Tenderness Crime, Drama, Thriller 3 Back in the Day Drama 4 Dennis the Menace Comedy, Family 5 Lust for Love Comedy, Romance 6 Alice Through the Looking Glass Adventure, Family, Fantasy 7 The Killing Fields Drama, History, War 8 Florence Foster Jenkins Biography, Comedy, Drama 9 Bad Influence Crime, Drama, Thriller 10 Lara Croft Tomb Raider: The Cradle of Life Action, Adventure, Fantasy

Table 4.16: Autoencoder E.T recommendations of the interest of someone who has liked ”E.T.”. Those are the film in position 1 (”Planes: Fire & Rescue”, the film in position 4 (”Dennis the Menace”) and the film in position 6 (”Alice Through the Looking Glass”). And maybe the movie in position 10 (”Lara Croft Tomb Raider: The Cradle of Life”) could also be of interest, but this one is not so clear, it’s more subjective.

4.7 Double autoencoder

In this section the results of the training and the prediction of the double autoencoder are shown.

4.7.1 Double autoencoder training

As it has been done in the previous sections, the evolution of the RMSE throughout the epochs is shown as a result of the training. But in the double autoencoder two models have been used. Therefore the results of the two models are shown. In Figure 4.68 the RMSE is shown in autoencoder 1. Figure 4.71 shows the results of the RMSE in autoencoder 2. 121

Figure 4.68: Evolution along epochs Figure 4.69: 100000-300000 epochs

Figure 4.70: Doble autoencoder, first autoencoder RMSE

Figure 4.71: Evolution along epochs Figure 4.72: 100000-300000 epochs

Figure 4.73: Doble autoencoder, second autoencoder RMSE

These two graphs do not allow to see with clarity the evolution of the RMSE. Although it is already intuited that it is very low. Figure 4.69 and Figure 4.71 shows the graph from epoch 100000 to epoch 300000, as an example to see clearly how the RMSE continues to go down. For the sample of the autoencoder 1 it is observed that the rmse is giving peaks that rise a lot. But globally it is appreciated that the loss decreases. This can be due to the optimisation 122 algorithm is in a local minimum of the function and is jumping from one side to the other continuously. In the second autoencoder it is appreciated that the RMSE values reach very low levels and they do not stack. In the first autoencoder the RMSE values are even lower than for the second autoencoder. The RMSE value for the first autoencoder is 2.6877e-06 and for the second autoencoder the lower RMSE value is 9.5616-05.

Table 4.17 shows the results of MSE and MAE in the validation set. These results are lower than for the other networks, this makes sense because with this network we seek to improve the result, and that the losses go down is a sign of that improvement.

Metric Value Value (validation set) RMSE 0.00977 0.00756 MAE 0.004951 0.008109

Table 4.17: Double autoencoder metrics results

The calculation of the accuracy for different numbers of values has also been carried out. The result can be observed in the table 4.18.

Number of elements Precision (%) Precision Validation Set (%) 10 5.1907 3.929 30 9.1530 7.8451 50 18.7545 16.9081 100 29.0433 25.0904 500 57.3077 NA

Table 4.18: Double autoencoder precision results

The precision values increase again. This supposes an improvement in the results. The first autoencoder learnt to predict the labels more efficiently than the previous proposed solution. The second autoencoder is able to learn the encoded version of the data improving the final results more than 50% the results of the previous autoencoder.

The average number of films that appear in the first 30 films both in distances or labels as in the prediction is 2.45 films. 123

4.7.2 Double autoencoder prediction

Then we make predictions of films to make a first subjective test. We return to using the technique of the sagas.

In this occasion it recommends movies from the ”Star Trek” saga if you enter one of them as input. This supposes an improvement of the recommendations. For the movie ”Star Trek: Nemesis” the movie ”Star Trek III: The Search for Spock” is recommended in position 22. And for the movie ”Star Trek Beyond” recommends ”Star Trek III: The Search for Spock”. So now he is able to recommend films of this saga for at least two of them. This supposes a clear improvement of the recommendation.

As has been done with the other networks, the recommendations for the film ”E.T. the Extra- Terrestrial” are shown in the table 4.19. Position Name Genres 1 He’s Way More Famous Than You Comedy 2 Flawless Comedy, Crime, Drama 3 Electrick Children Drama 4 Grown Ups 2 Comedy 5 Robots Animation, Adventure, Comedy 6 Kill List Crime, Horror, Thriller 7 The Spy Next Door Action, Comedy, Family 8 The Thaw Horror, SciFi, Thriller 9 Black Robe Adventure, Drama, History 10 Ask the Dust Drama, Romance

Table 4.19: Double autoencoder E.T recommendations

It can be seen that the recommendations have improved greatly since the recommendations of the NN. Now it offers films more in keeping with the ”E.T.” movie.

It can also be considered that the recommendation has improved with respect to the au- toencoder. Since in the autoencoder there were good recommendations, but they were only three good recommendations of the ten. On the other hand, in the double autoencoder the exceptions are the bad recommendations, such as 6th (”Kill List”), 8th (”The Thaw) and 10th (”Ask the Dust”). Although they are considered bad for not being family movies, but maybe the person receiving the recommendations did not like ”E.T.” just for being a family 124 movie and the film ”Ask the Dust” could be of interest. It is necessary to analyse the surveys to see if the opinion of the users is the same.

4.8 Subjective comparison between solutions

The comparison of the analysis of the recommendations of ”E.T.” shows an improvement from using only the distances to the double autoencoder.

The recommendation of the distances is not so accurate, there is no pattern in it, the films are not subjectively recommendable and does not seem to take gender into account enough.

The recommendation of the Deep Neural Network is is not so adequate. It slightly improves on just using the distances since it has more movies with the ”E.T” genre, but the selection of them can be improved.

The autoencoder recommendation is already a great improvement over previous ones. Of the ten films four films can be considered good recommendations. This network have learn some complex patterns in the data to relate better the film and provide an high distance value for this films.

And finally the recommendation of the double autoencoder is the best of the four. Ten films are clearly recommended and the other three will depend on the user’s tastes, because they are not so accurate.

In order to evaluate user behaviour and tastes over the recommended films a subjective test using surveys have been conducted. In this way, very interesting information about the trained models can be collected and analysed.

4.8.1 Surveys

In this section the recommendations for 5 films have been extracted using four of the tech- niques presented in this work. The Euclidean distances, the Articial Neural Network, the autoencoder and the double autoencoder. The cosine distances have not been used since the results are very similar to the results of the Euclidean distances, as previously verified. For each of these techniques, first the predictions and then the results of the surveys are shown. 125

The films that go into the 4 recommender engines were 5, ”Contratiempo”, ”Dolor y Gloria”, ”La Tribu”, ”Star Wars” and ”Titanic”. The top 10 recommended films are presented in Appendix C. And the survey template in Appendix D. 20 participants completed the survey for the four models used.

4.8.1.1 Euclidean distance surveys

The result of the recommendations for the film ”Contratiempo” can be seen in the Table C.1, for ”Dolor y Gloria” in Table C.2, ”La Tribu” in Table C.3, ”Star Wars” in Table C.4and ”Titanic” in Table C.5.

The average score of the film recommendations for each film is:

• ”Contratiempo”: 2.1

• ”Dolor y Gloria”: 1.5

• ”La Tribu”: 1.8

• ”Star Wars”: 1.5

• ”Titanic”:1.1

For the first question ”Assuming you liked the input movie, would you decide to see any of the 10 recommended ones?” the average score is 2

For the second question ”Would you use the recommendation system again?” the average score is 1.8

For the third question ”Would you recommend the recommend system to a friend or family member?” the average score is 1.5

And the average of the overall score answers is 1.8 126

4.8.1.2 Artificial Neural Network surveys

The result of the recommendations for the film ”Contratiempo” can be seen in the Table C.6, for ”Dolor y Gloria” in Table C.7, ”La Tribu” in Table C.8, ”Star Wars” in Table C.9and ”Titanic” in Table C.10.

The average score of the film recommendations for each film is:

• ”Contratiempo”: 1.8

• ”Dolor y Gloria”: 1.4

• ”La Tribu”: 1.7

• ”Star Wars”: 1.2

• ”Titanic”: 1.6

For the first question ”Assuming you liked the input movie, would you decide to see any of the 10 recommended ones?” the average score is 1.8

For the second question ”Would you use the recommendation system again?” the average score is 1.7

For the third question ”Would you recommend the recommend system to a friend or family member?” the average score is 1.3

And the average of the overall score answers is 1.6

4.8.1.3 Autoencoder surveys

The result of the recommendations for the film ”Contratiempo” can be seen in the Table C.11, for ”Dolor y Gloria” in Table C.12, ”La Tribu” in Table C.13, ”Star Wars” in Table C.14and ”Titanic” in Table C.15.

The average score of the film recommendations for each film is: 127

• ”Contratiempo”: 2.9

• ”Dolor y Gloria”: 2.2

• ”La Tribu”: 1.8

• ”Star Wars”: 1.5

• ”Titanic”: 2

For the first question ”Assuming you liked the input movie, would you decide to see any of the 10 recommended ones?” the average score is 2.3.

For the second question ”Would you use the recommendation system again?” the average score is 2.3.

For the third question ”Would you recommend the recommend system to a friend or family member?” the average score is 1.7.

And the average of the overall score answers is 2.5.

4.8.1.4 Double Autoencoder surveys

The result of the recommendations for the film ”Contratiempo” can be seen in the Table C.16, for ”Dolor y Gloria” in Table C.17, ”La Tribu” in Table C.18, ”Star Wars” in Table C.19and ”Titanic” in Table C.20.

The average score of the film recommendations for each film is:

• ”Contratiempo”: 3.3

• ”Dolor y Gloria”: 2.5

• ”La Tribu”: 2

• ”Star Wars”: 3

• ”Titanic”: 1.8 128

For the first question ”Assuming you liked the input movie, would you decide to see any of the 10 recommended ones?” the average score is 2.7.

For the second question ”Would you use the recommendation system again?” the average score is 2.5.

For the third question ”Would you recommend the recommend system to a friend or family member?” the average score is 2.3.

And the average of the overall score answers is 2.8.

4.8.1.5 Surveys analysis

Regarding the results of the survey the average score will be analyses. In the first case, the recommender with directly the euclidean distances has an score of 1.8. The neural network recommender achieve after these tests 1.6. The autoencoder reach a value of 2.5 and finally, the double autoencoder have 2.8.

It is very interesting analyses these results since the first strange thing is that the distances recommender have a greater value than the deep neural network. Users prefer recommenda- tions done with the distances recommender but the overall final result is not good enough for both.

Autoencoder overall score is 2.5 the is in the middle of the maximum score. This results shows that probably this recommender works but not all the films are completely interesting for the users.

Finally, the double autoencoder has the maximum value reaching a 2.8 overall score. This autoencoder predict better distances then more closer films are recommended. As in the simple autoencoder, probably not all the films are enough interesting for the users but this recommender is breaking the barriers of the use of the punctuation scores of the films and the number of user votes. 129

Chapter 5

Conclusions and future lines

5.1 Conclusions

In this work a movie recommender based on visual content analysis has been made. Im- age processing, computer vision, machine learning and deep learning techniques have been used fundamentally. Throughout this document, all the processes developed to create the recommender have been described.

In the first phase, four features have been extracted to describe the films in the dataset and the validity of the four has been demonstrated. In particular, the analysis of the results has focused on demonstrating not only that the extraction of the features was correct but that they were also descriptive of the films. For example, for the action detector it is shown that it finds with precision the actions of a trailer and the percentage of time that appears in it. But it also shows that similar films tend to have a series of actions that are repeated, as in romantic movies the action of kissing. This is demonstrated that happens with the four feature, they are corrected with a high accuracy and they also able to describe the movies. So the validity of these four features is demonstrated by analysing their appearance in the movies.

The embedding results are also very interesting. After 500 epochs, it reaches the maximum accuracy and the minimum loss. With this result, a film genre classifier based on its trailers with a 100 % accuracy have been obtained. This embedding is able to translate the input 130 data to another subspace that relates this features efficiently with the genres of the films.

The recommender has 3 versions. In each version a different deep learning network has been developed. The first is a sequential neural network with three layers. The second is an autoencoder and the third is a double autoencoder. The first network proposed tray to use the basis of deep learning neural networks to perform a regression over the data proposed as labels. It has been demonstrated the effectiveness of this architectures to work with propper input data but it is also stated that more complex networks learn more efficiently than a simple artificial neural network. Autoencoder architectures are a common solution to solve problems related to data generation of the same nature or different modalities. In this work a first approach takes the advantage of an autoencoder that encodes the emebedding vectors en decode the data in a different modality. Results show a great improvement in comparison with the artificial neural network. Finally, a double autoencoer takes the advantage of the combination of two separated networks. The first one learn to reproduce the labels used as inputs. The second network fits the embeddigs to the encoded input of the previous decoder part. The successful of this approach is shown in the results presenting that the combination of differentiated learning process is better to learn between modalities instead of learn directly from the two types of data.

Both the objective and the subjective results of the recommendations show that the best option is double autoencoder. The values of MSE and MAE are lower, the accuracy is higher and the recommendations are more subjectively good.

It is also important to note that the most used technique to make recommendations of multi- media content is collaborative-filtering, but content-based filtering has been used in this work and with the results, it is shown that the recommendations, without using the feedback of the users, show an interesting final result. The recommendations of collaborative-filtering tend to recommend popular movies or very ”obvious” recommendations. These recommendations en- courage the use of such content which pushes the classifications of these recommendations even further. So it tends to boost popular movies and hide unpopular movies. The content-based recommendation film developed does not use the user experience to make the recommenda- tion, but only the content of the films. This method is not widely used and could be a good option to incorporate in the recommendation systems of video on demand services to improve its recommendations and be able to recommend any film from its catalogue. 131

5.2 Future lines

As future lines a series of possible improvements of the recommender are proposed. The first is to try to use the recommender by training it with another movie dataset. This dataset need to have gender labels, if not it will have to be created manually. The second possibility of improvement is to incorporate the emotion analysis into the feature vector. Creating an emotions analyser with a higher accuracy that is able to detect emotions without having to exaggerate them. Another option is to try to improve the results of the simple neural network, although the results have already been improved with the network of double autoencoder.

Other approaches can be a combination of several datasets to have a huge film database and a direct comparison with a collaborative filtering algorithm to see the potential improvements of the combination of both. Furthermore, More features can be used to describe the films properly. Finally, Convolutional or Recurrent neural networks can be used directly over the content in order to extract features automatically and learn potential relations between data in time. Bibliography

[1] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” IEEE Trans. Pattern Anal. Mach. Intell., 2017.

[2] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recogni- tion in videos,” in Advances in Neural Information Processing Systems 27, 2014.

[3] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” CoRR, 2015.

[4] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg, “SSD: single shot multibox detector,” 2015.

[5] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Doll´ar,“Focal loss for dense object detection,” 2017.

[6] K. He, G. Gkioxari, P. Doll´ar, and R. B. Girshick, “Mask R-CNN,” CoRR, vol. abs/1703.06870, 2017.

[7] VVAA, “Think analytics.” [Web; accessed 13-05-2019].

[8] VVAA, “Gravity r&d.” [Web; accessed 13-05-2019].

[9] VVAA, “Recombee.” [Web; accessed 13-05-2019].

[10] “Netflix.” https://help.netflix.com/en/node/100639.

[11] S. Ciocca, “How does spotify know you so well?.” [Web; accessed 13-05-2019].

132 133

[12] G. S. Sim˜oes,J. Wehrmann, R. C. Barros, and D. D. Ruiz, “Labeled movie trailer dataset.” urlhttps://github.com/jwehrmann/lmtd, 2018.

[13] P. Covington, J. Adams, and E. Sargin, “Deep neural networks for youtube recommen- dations,” 2016.

[14] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & deep learning for recommender systems,” in Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 7–10, ACM, 2016.

[15] S. Okura, Y. Tagami, S. Ono, and A. Tajima, “Embedding-based news recommendation for millions of users,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 7–10, ACM, 2017.

[16] M. M¨uhling, M. Meister, N. Korfhage, J. Wehling, A. H¨orth, R. Ewerth, and B. Freisleben, “Content-based video retrieval in historical collections of the german broad- casting archive,” International Journal on Digital Libraries, 2018.

[17] M. M¨uhling,N. Korfhage, E. M¨uller,C. Otto, M. Springstein, T. Langelage, U. Veith, R. Ewerth, and B. Freisleben, “Deep learning for content-based video retrieval in film and television production,” Multimedia Tools and Applications, 2017.

[18] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large- scale video classification with convolutional neural networks,” in Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.

[19] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015.

[20] L. Yao, A. Torabi, K. Cho, N. Ballas, C. J. Pal, H. Larochelle, and A. C. Courville, “De- scribing videos by exploiting temporal structure,” 2015 IEEE International Conference on Computer Vision (ICCV), 2015.

[21] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[22] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in ECCV, 2016. 134

[23] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. C. Russell, “Actionvlad: Learning spatio-temporal aggregation for action classification,” CoRR, 2017.

[24] Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann, “Hidden two-stream convolutional networks for action recognition,” 04 2017.

[25] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733, 2017.

[26] A. Diba, M. Fayyaz, V. Sharma, A. Hossein Karami, M. Mahdi Arzani, L. Van Gool, and R. Yousefzadeh, “Temporal 3d convnets: New architecture and transfer learning for video classification,” 2017.

[27] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accu- rate object detection and semantic segmentation,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, 2014.

[28] R. Girshick, “Fast r-cnn,” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448, IEEE Computer Society, 2015.

[29] F. Bu, Y. Cai, and Y. Yang, “Multiple object tracking based on faster-rcnn detector and kcf tracker,” 2016.

[30] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems 28, 2015.

[31] A. Shrivastava, A. Gupta, and R. B. Girshick, “Training region-based object detectors with online hard example mining,” CoRR, 2016.

[32] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: object detection via region-based fully convo- lutional networks,” 2016.

[33] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” CoRR, vol. abs/1612.08242, 2016.

[34] S.-H. Tsang, “Review: Yolov2 & yolo9000.” https://towardsdatascience.com/review- yolov2-yolo9000-you-only-look-once-object-detection-7883d2b02a65.

[35] T. Lin, P. Doll´ar,R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, “Feature pyramid networks for object detection,” 2016. 135

[36] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” 2018.

[37] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinement neural network for object detection,” CoRR, vol. abs/1711.06897, 2017.

[38] Z. Zhao, P. Zheng, S. Xu, and X. Wu, “Object detection with deep learning: A review,” CoRR, vol. abs/1807.05511, 2018.

[39] P. Mehta, “What is the difference between back-propagation and forward-propagation?.” [Web; accessed 09-06-2019].

[40] P. Goyal, “What is the difference between precision and recall?.” [Web; accessed 10-06- 2019].

[41] W. Koehrsen, “Beyond accuracy: Precision and recall.” [Web; accessed 10-06-2019].

[42] A. Kosir, A. Odi, M. Kunaver, M. Tkalˇciˇc,and J. Tasic, “Database for contextual per- sonalization,” 2011.

[43] T. Bertin-Mahieux, D. P. W. Ellis, B. Whitman, and P. Lamere, “The million song dataset.,” 2011.

[44] D. Hauger, M. Schedl, A. Kosir, and M. Tkalcic, “The million musical tweet dataset - what we can learn from microblogs,” 2013.

[45] M. Schedl, “The lfm-1b dataset for music retrieval and recommendation,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016.

[46] F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,” 2015.

[47] Y. Deldjoo, M. G. Constantin, B. Ionescu, M. Schedl, and P. Cremonesi, “Mmtf-14k: A multifaceted movie trailer feature dataset for recommendation and retrieval,” in Proceed- ings of the 9th ACM Multimedia Systems Conference, 2018.

[48] M. Patacchiola, “The simplest classifier: Histogram comparison.”

[49] A. Sharma and A. K. Singh, “Color difference histogram for feature extraction in video retrieval,” 2015.

[50] K. J. K. Sivaraman and G. Somappa, “Moviescope : Movie trailer classification using deep neural networks,” 2017. 136

[51] Z. Rasheed, Y. Sheikh, and M. Shah, “On the use of computable features for film classi- fication,” IEEE Transactions on Circuits and Systems for Video Technology, 2005.

[52] W.-T. Chu and H.-J. Guo, “Movie genre classification based on poster images with deep neural networks,” in Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, 2017.

[53] S.-M. Choi, S.-K. Ko, and Y.-S. Han, “A movie recommendation algorithm based on genre correlations,” Expert Systems with Applications, 2012.

[54] C. Basu, H. Hirsh, and W. Cohen, “Recommendation as classification: Using social and content-based information in recommendation,” Proceedings of AAAI-98, 2000.

[55] K. Wakil, R. Bakhtyar, K. Ali, and K. Alaadin, “Improving web movie recommender system based on emotions,” International Journal of Advanced Computer Science and Applications, 2015.

[56] A. Ullah, J. Ahmad, K. Muhammad, I. Mehmood, M. Lee, J. Ryeol Park, and S. Baik, “Action recognition in movie scenes using deep features of keyframes,” Journal of the Korean Institute of Next Generation Computing, 2017.

[57] D. Decarlo and D. Metaxas, “Optical flow constraints on deformable models with appli- cations to face tracking,” International Journal of Computer Vision, 2000.

[58] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.

[59] A. Dosovitskiy, P. Fischer, E. Ilg, P. H¨ausser,C. Hazırba¸s,V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in IEEE International Conference on Computer Vision (ICCV), 2015.

[60] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal, T. Yan, L. Brown, Q. Fan, D. Gutfruend, C. Vondrick, et al., “Moments in time dataset: one million videos for event understanding,” IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2019.

[61] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convo- lutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, 2012. 137

[62] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari, “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” 2018.

[63] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision, 2015.

[64] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar,and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, 2014.

[65] G. Farneb¨ack, “Two-frame motion estimation based on polynomial expansion,” in Pro- ceedings of the 13th Scandinavian Conference on Image Analysis, Springer-Verlag, 2003.

[66] L. Hardinata, B. Warsito, and Suparti, “Bankruptcy prediction based on financial ratios using jordan recurrent neural networks: a case study in polish companies,” Journal of Physics: Conference Series, vol. 1025, 2018.

[67] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel, “Variational lossy autoencoder,” arXiv preprint arXiv:1611.02731, 2016.

[68] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” 2006.

[69] C. Hong, J. Yu, J. Wan, D. Tao, and M. Wang, “Multimodal deep autoencoder for human pose recovery,” IEEE Transactions on Image Processing, 2015.

[70] “Giphy.” [Web; accessed 09-06-2019].

[71] G. Shani and A. Gunawardana, “Evaluating recommender systems,” tech. rep., 2009.

[72] C. Pinela, “How to evaluate recommender systems.”

[73] E. D. Portal, “Analytical report 3: Open data and privacy.” [Web; accessed 09-06-2019]. Appendices

138 Appendix A

Ethical, social, economic and environmental aspects

A.1 Introduction

The recommendation of multimedia content is a very popular topic today. For services that offer multimedia content on demand, it is a big difference to use a good recommender or a bad one. But these recommendation systems entail a series of dangers related to the collection of user information, which are explained in this annex. On the other hand, it is very important to take into consideration the economic impact of a recommendation system, in its preparation or in its implementation and use. And finally, the environmental aspects that involve the implementation of this type of systems are analysed.

A.2 Description of relevant impacts related to the project

A.2.1 Ethic impact

The main controversy and debate that this type of systems can generate is privacy, since personal and private information is gathered from individuals. For the treatment and use

139 140 of these data, from the European Data Portal [73], a report is published that provides some guidelines to follow when working with sensitive data:

• Understand the information to be able to consider potential risks.

• Treat the information anonymously.

• Do not publish the information without consent.

• Collect non-discriminatory information. The information collected from people should not have a bias that depends on sex, race, clothing, disabilities, etc.

The advantage of our recommendation systems is that it does not need much information from the user. Only the movies users want to have a recommendation of.

Another ethical aspect to take into consideration is that the algorithms absorb and learn what the people who program them teach them. In fact, they can promote the prejudices established in society because they learn the information that humans show them. Therefore, biased training datasets can lead to biased results and thus perpetuate a set of biases.

On the other hand, a recommendation system could be used maliciously and used to induce the user to watch movies that the programmer wants them to see. For example, an advertising company could pay for a service in which the recommender induced people to watch movies related to their product. For example, if a soda brand paid for its product to appear in certain movies, it would contract the service to induce users to watch those movies. Also, in relation to induce to see films not related to the recommended ones, a political propagandist use could be made.

Finally, it should be noted that users may feel discomfort or mistrust to be aware that they are induced to watch certain movies and could be for malicious purposes.

A.2.2 Social impact

The social impact of a film recommender is not very big. It should be noted that in a person’s leisure time they spend more time really doing activities they like, not wasting their time looking for a movie to watch, just knowing which one. 141

It also makes it easier for a user to find new liking, since a recommendation may be new but also surprising. That is, the user did not expect it but would like it.

A.2.3 Economic impact

The economic cost of the recommendation system is very low. But the economic benefits that this implies are high. A video on demand service can have a very extensive catalogue and the selection of a film would be very laborious, unless the user knew exactly what he wants to see. So users would not be happy with the service and would not know how to get the full potential of it. Multimedia recommenders are a basic instrument to allow the user to find content that he wants to visualise. If the user finds the content that he wants, the video service on demand will be a useful service for that user.

A.2.4 Environmental impact

Analysing the environmental impact, this project has hardly any environmental consequences since this project does not require a large amount of material or energy. The only standouts are a computer and, in case it is wanted to streamline processes, a server. And the energy needed to use those tools. Therefore, it does not negatively affect the environment in excess from the point of view of pollution.

A.3 Conclusions

Taking into account the ethical, social, economic and environmental repercussions of the use of movie recommendations based on visual content analysis using deep learning techniques, it can be concluded that when used correctly, they have a positive impact on improving the quality of people life. It is also a low cost technology that does not negatively impact the environment. Appendix B

Economic budget

Figure B.1: TFM budget

142 143

The cost of labour for the completion of the project by a qualified technician has been es- timated at € 26.75 gross / hour. This cost is evaluated according to the estimated market values that are being paid to this type of personnel in the current market. To obtain the net cost, of this amount, the industrial benefit applied to the direct cost must be deducted, which represent 15% and 6% applied twice, also at the direct cost. Extracting a total 27% at 26.75 €/hour, a direct labour cost of 19.53 €/hour is deducted. Appendix C

Survey results

C.1 Euclidean distance recommendations

Position Name Genres 1 Road Trip [Comedy] 2 Two Night Stand [Comedy, Romance] 3 State of Play [Crime, Drama, Mystery] 4 Bad Girls from Valley High [Comedy] 5 Desperately Seeking Susan [Comedy, Drama] 6 In Secret [Crime, Drama, Thriller] 7 Red Tails [Action, Adventure, Drama] 8 The Yards [Crime, Drama, Romance] 9 Daddy’s Home [Comedy] 10 Bubba Ho-Tep [Comedy, Fantasy, Mystery]

Table C.1: Euclidean distances ”Contratiempo” recommendations

144 145

Position Name Genres 1 Walking with Dinosaurs 3D [Animation, Adventure, Family] 2 Ghoulies [Comedy, Fantasy, Horror] 3 Night Shift [Comedy] 4 G-Force [Action, Adventure, Comedy] 5 Iron Man 2 [Action, Adventure, SciFi] 6 Greetings from Tim Buckley [Drama] 7 A Royal Night Out [Comedy, Drama, Romance] 8 Last Action Hero [Action, Adventure, Comedy] 9 Flawless [Comedy, Crime, Drama] 10 Back in the Day [Drama]

Table C.2: Euclidean distances ”Dolor y Gloria” recommendations

Position Name Genres 1 Million Dollar Baby [Drama, Sport] 2 Signs [Drama, SciFi, Thriller] 3 Love & Mercy [Biography, Drama, Music] 4 The Hot Chick [Comedy, Fantasy] 5 Inglourious Basterds [Adventure, Drama, War] 6 The Wedding Ringer [Comedy] 7 The Greatest Game Ever Played [Drama, History, Sport] 8 Eat Your Heart Out [Comedy, Drama] 9 Return to Me [Comedy, Drama, Romance] 10 Mental [Comedy, Drama]

Table C.3: Euclidean distances ”La Tribu” recommendations 146

Position Name Genres 1 Enchanted April [Drama] 2 Jack the Giant Slayer [Adventure, Fantasy] 3 A Dennis the Menace Christmas [Comedy, Family, Fantasy] 4 A Million Ways to Die in the West [Comedy, Western] 5 Child’s Play [Fantasy, Horror] 6 The Calling [Thriller] 7 The Big Green [Comedy, Family, Sport] 8 Wanderlust [Comedy] 9 Bob Roberts [Comedy] 10 Minions [Animation, Comedy, Family]

Table C.4: Euclidean distances ”Star Wars” recommendations

Position Name Genres 1 A Time to Kill [Crime, Drama, Thriller] 2 Chinese Zodiac [Action, Adventure] 3 Balto [Animation, Adventure, Drama] 4 White Noise 2: The Light [Drama, Horror, Thriller] 5 2 [Animation, Adventure, Family] 6 Rush Hour [Action, Comedy, Crime] 7 The Breakfast Club [Comedy, Drama] 8 The Three Stooges [Comedy] 9 United 93 [Action, Crime, Drama] 10 Terminus [SciFi]

Table C.5: Euclidean distances ”Titanic” recommendations 147

C.2 Artificial Neural Network recommendations

Position Name Genres 1 Ernest & Celestine [Animation, Comedy, Crime] 2 Pathology [Crime, Horror, Thriller] 3 Freejack [Action, Crime, SciFi] 4 Pride & Prejudice [Drama, Romance] 5 Bella [Drama, Romance] 6 Bandidas [Action, Comedy, Crime] 7 Awakened [Drama, Mystery, Thriller] 8 Asunder [Thriller] 9 The Pact [Horror, Mystery, Thriller] 10 Cherish [Comedy, Drama, Thriller]

Table C.6: Artificial Neural Network ”Contratiempo” recommendations

Position Name Genres 1 Cinderella [Drama, Family, Fantasy] 2 The Replacements [Comedy, Sport] 3 The Golden Compass [Adventure, Family, Fantasy] 4 Freejack [Action, Crime, SciFi] 5 Warrior [Drama, Sport] 6 The Lovers [Action, Adventure, Romance] 7 Pleasantville [Comedy, Drama, Fantasy] 8 Valley [Drama] 9 Gimme Shelter [Drama] 10 Cherish [Comedy, Drama, Thriller]

Table C.7: Artificial Neural Network ”Dolor y Gloria” recommendations 148

Position Name Genres 1 Ernest & Celestine [Animation, Comedy, Crime] 2 The Replacements [Comedy, Sport] 3 Pleasantville [Comedy, Drama, Fantasy] 4 Bandidas [Action, Comedy, Crime] 5 The Golden Compass [Adventure, Family, Fantasy] 6 Before Night Falls [Biography, Drama] 7 Cherish [Comedy, Drama, Thriller] 8 Syrup [Comedy, Drama, Romance] 9 Crazy Eyes [Comedy] 10 Saving Private Perez [Adventure, Comedy, Western]

Table C.8: Artificial Neural Network ”La Tribu” recommendations

Position Name Genres 1 Pathology [Crime, Horror, Thriller] 2 Cinderella [Drama, Family, Fantasy] 3 Freejack [Action, Crime, SciFi] 4 Bandidas [Action, Comedy, Crime] 5 Before Night Falls [Biography, Drama] 6 Detention [Comedy, Horror, SciFi] 7 Cherish [Comedy, Drama, Thriller] 8 Saving Private Perez [Adventure, Comedy, Western] 9 The Pact [Horror, Mystery, Thriller] 10 The Eye of the Storm [Drama]

Table C.9: Artificial Neural Network ”Star Wars” recommendations 149

Position Name Genres 1 Cinderella [Drama, Family, Fantasy] 2 Bella [Drama, Romance] 3 Pride & Prejudice [Drama, Romance] 4 Detention [Comedy, Horror, SciFi] 5 Awakened [Drama, Mystery, Thriller] 6 Cherish [Comedy, Drama, Thriller] 7 The Hole [Short, Drama] 8 Syrup [Comedy, Drama, Romance] 9 Pleasantville [Comedy, Drama, Fantasy] 10 The Pact [Horror, Mystery, Thriller]

Table C.10: Artificial Neural Network ”Titanic” recommendations

C.3 Autoencoder recommendations

Position Name Genres 1 Red Corner [Crime, Drama, Thriller] 2 The Story of Luke [Comedy, Drama] 3 El Gringo [Action, Drama] 4 Blitz [Action, Crime, Thriller] 5 Hard Target [Action, Thriller] 6 Abraham Lincoln: Vampire Hunter [Action, Fantasy, Horror] 7 Snowden [Biography, Drama, Thriller] 8 Paul Blart: Mall Cop [Action, Comedy, Crime] 9 Signs [Drama, SciFi, Thriller] 10 Knights of Badassdom [Adeventure, Comedy, Fantasy]

Table C.11: Autoencoder ”Contratiempo” recommendations 150

Position Name Genres 1 A Perfect Day [Drama] 2 Anna Karenina [Drama, Romance] 3 Charlie Wilson’s War [Biography, Comedy, Drama] 4 Sex & Drugs & Rock & Roll [Biography, Drama, Music] 5 The Divergent Series: Allegiant [Action, Adventure, SciFi] 6 War Story [Drama] 7 Gimme Shelter [Drama] 8 Love Affair [Comedy, Drama] 9 Barb Wire [Action, SciFi] 10 American Pie 2 [Comedy, Romance]

Table C.12: Autoencoder ”Dolor y Gloria” recommendations

Position Name Genres 1 The Story of Luke [Comedy, Drama] 2 El Gringo [Action, Drama] 3 Knights of Badassdom [Adventure, Comedy, Fantasy] 4 A Guy Thing [Comedy, Romance] 5 College Road Trip [Adventure, Comedy, Drama] 6 Hard Target [Action, Thriller] 7 Paul [Adventure, Comedy, SciFi] 8 The Amazing Panda Adventure [Family, Adventure, Drama] 9 Paul Blart: Mall Cop [Action, Comedy, Crime] 10 Semi-Pro [Comedy, Sport]

Table C.13: Autoencoder ”La Tribu” recommendations 151

Position Name Genres 1 El Gringo [Action, Drama] 2 Knights of Badassdom [Adventure, Comedy, Fantasy] 3 College Road Trip [Adventure, Comedy, Drama] 4 Hard Target [Action, Thriller] 5 The Hunger [Fantasy, Horror, Romance] 6 The Amazing Panda Adventure [Family, Adventure, Drama] 7 Paul [Adventure, Comedy, SciFi] 8 Paul Blart: Mall Cop [Action, Comedy, Crime] 9 Blitz [Action, Crime, Thriller] 10 Hamlet [Drama, Romance, Thriller]

Table C.14: Autoencoder ”Star Wars” recommendations

Position Name Genres 1 Magic Valley [Drama] 2 The End of Love [Drama] 3 College Road Trip [Adventure, Comedy, Drama] 4 The Story of Luke [Comedy, Drama] 5 For a Good Time, Call... [Comedy] 6 Starbuck [Comedy, Drama] 7 Frank [Comedy, Drama, Music] 8 Cast Away [Adventure, Drama, Romance] 9 Generation Iron [Documentary, Drama, Sport] 10 Ben-Hur [Adventure, Drama, History]

Table C.15: Autoencoder ”Titanic” recommendations 152

C.4 Double Autoencoder recommendations

Position Name Genres 1 The Crazies [Horror, Mystery, Thriller] 2 By the Gun [Crime, Drama, Thriller] 3 The Hateful Eight [Crime, Drama, Mystery] 4 Haunt [Horror, Mystery] 5 Chicago [Comedy, Crime, Musical] 6 Thrashin’ [Action, Drama] 7 Saw III [Horror, Mystery] 8 Interstellar [Adventure, Drama, SciFi] 9 Lincoln [Biography, Drama, History] 10 Liberty Heights [Drama, Music, Romance]

Table C.16: Double Autoencoder ”Contratiempo” recommendations

Position Name Genres 1 Broken [Drama, Romance] 2 Dust to Glory [Documentary, Action, Adventure] 3 Greetings from Tim Buckley [Drama] 4 Magic Trip [Short, Drama] 5 Notting Hill [Comedy, Drama, Romance] 6 Jimi: All Is by My Side [Biography, Drama, Music] 7 Black Nativity [Drama, Family, Music] 8 Upstream Color [Drama, SciFi] 9 The Life Before Her Eyes [Drama, Mystery, Thriller] 10 Naomi and Ely’s No Kiss List [Comedy, Drama, Romance]

Table C.17: Double Autoencoder ”Dolor y Gloria” recommendations 153

Position Name Genres 1 Ira & Abby [Comedy, Romance] 2 Chicago [Comedy, Crime, Musical] 3 Daddy’s Home [Comedy] 4 The Identical [Drama, Music] 5 Megamind [Animation, Action, Comedy] 6 Wrong Cops [Comedy, Crime] 7 The Voices [Comedy, Crime, Horror] 8 Flesh+Blood [Adventure, Drama] 9 Redemption [Action, Crime, Drama] 10 Godzilla [Action, Adventure, SciFi]

Table C.18: Double Autoencoder ”La Tribu” recommendations

Position Name Genres 1 Interstellar [Adventure, Drama, SciFi] 2 Driven [Action, Drama, Sport] 3 Jinn [Thriller] 4 A Scanner Darkly [Animation, SciFi, Thriller] 5 The Expendables 2 [Action, Adventure, Thriller] 6 Saving Shiloh [Drama, Family] 7 Batman Begins [Action, Adventure] 8 College Road Trip [Adventure, Comedy, Drama] 9 Godzilla [Action, Adventure, SciFi] 10 Flesh+Blood [Adventure, Drama]

Table C.19: Double Autoencoder ”Star Wars” recommendations 154

Position Name Genres 1 Saving Shiloh [Drama, Family] 2 Shallow Grave [Crime, Thriller] 3 Billy Bates [Drama] 4 The Lucky One [Drama, Romance] 5 Leatherheads [Comedy, Drama, Romance] 6 L.A. Story [Comedy, Drama, Fantasy] 7 Spring Breakers [Action, Crime, Drama] 8 Bad Milo [Comedy, Horror] 9 One True Thing [Drama] 10 The Identical [Drama, Music]

Table C.20: Double Autoencoder ”Titanic” recommendations Appendix D

Survey template

155 156

Figure D.1: Survey Template Appendix E

Detectable classes by object detector

Tortoise Organ Parking meter Tick Surfboard

Container Cassette deck Traffic light Belt Boot

Magpie Apple Croissant Sunglasses Headphones

Sea turtle Human eye Cucumber Banjo Hot dog

Football Cosmetics Radish Cart Shorts

Ambulance Paddle Towel Ball Fast food

Ladder Snowman Doll Backpack Bus

Toothbrush Beer Skull Bicycle Boy

Syringe Chopsticks Washing ma- Home appliance Screwdriver chine Sink Human beard Centipede Bicycle wheel Glove Toy Bird Boat Barge

157 158

Laptop Starfish Traffic sign Cello Box

Miniskirt Popcorn Chair Jet ski Stapler

Drill Burrito Shirt Camel Christmas tree

Dress Chainsaw Poster Coat Cowboy hat

Bear Balloon Cheese Suit Hiking equip- ment Waffle Wrench Sock Desk Studio couch Pancake Tent Fire hydrant Cat Drum Brown bear Vehicle registra- Land vehicle Bronze sculpture tion plate Dessert Woodpecker Earrings Juice Lantern Wine rack Blue jay Tie Gondola Toaster Drink Pretzel Watercraft Beetle Flashlight Zucchini Bagel Cabinetry Cannon Billboard Ladle Tower Suitcase Computer mouse Tiara Human mouth Teapot Muffin Cookie Limousine Dairy Person Bidet Office building Necklace Dice Bow and arrow Snack Fountain Carnivore Oven Swimwear Snowmobile Coin Scissors Dinosaur Beehive Clock Calculator Stairs Ratchet Brassiere Medical equip- Cocktail Computer key- ment Couch Bee board Computer moni- Cattle tor Cricket ball Bat Printer 159

Winter melon Human body Gas stove Street light Marine inverte- brates Spatula Roller skates Salt and pepper Guitar shakers Kitchen utensil Whiteboard Coffee cup Pillow Mechanical fan Light switch Pencil sharpener Cutting board Human leg Face powder House Door Blender Isopod Fax Horse Hat Plumbing fixture Grape Fruit Stationary bicy- Shower Stop sign Human ear cle French fries Eraser Office supplies Power plugs and Hammer Nightstand sockets Fedora Volleyball Ceiling fan Barrel Panda Guacamole Vase Sofa bed Kite Giraffe Dagger Slow cooker Adhesive tape Tart Woman Scarf Wardrobe Harp Treadmill Door handle Dolphin Coffee Sandal Fox Rhinoceros Sombrero Whisk Bicycle helmet Flag Bathtub Tin can Paper towel Saucer Horn Goldfish Mug Personal care Harpsichord Window blind Houseplant Tap Food Human hair Human foot Goat Harbor seal Sun hat Heater Golf cart Baseball bat Stretcher Tree house Harmonica Jacket Baseball glove Can opener Flying disc Hamster Egg Mixing bowl Goggles Skirt Curtain 160

Bed Alarm clock Mouse Lifejacket Pasta

Kettle Filing cabinet Motorcycle Table tennis Penguin racket Fireplace Artichoke Musical instru- Pumpkin ment Pencil case Scale Table Pear Swim cap Musical key- Drinking straw Tableware board Infant bed Frying pan Insect Kangaroo Scoreboard Polar bear Snowplow Hair dryer Koala Briefcase Mixer Bathroom cabi- Kitchenware Knife net Kitchen knife Cupboard

Indoor rower Bottle Missile Nail Jacuzzi

Invertebrate Bottle opener Bust Tennis ball Pizza

Food processor Lynx Man Plastic bag Digital clock

Bookcase Lavender Waffle iron Oboe Pig

Refrigerator Lighthouse Milk Chest of drawers Reptile

Wood-burning Dumbbell Ring binder Ostrich Rifle stove Human head Plate Piano Lipstick Punching bag Bowl Mobile phone Girl Skateboard Common fig Humidifier Baked goods Plant Raven Cocktail shaker Porch Mushroom Potato High heels Jaguar Lizard Crutch Hair spray Red panda Golf ball Billiard table Pitcher Sports equip- Rose Fashion acces- ment sory Mammal Mirror Rabbit 161

Sculpture Skyscraper Football helmet Bread Wine glass

Saxophone Sheep Truck Platter Countertop

Shotgun Television Measuring cup Chicken Tablet computer

Seafood Trombone Coffeemaker Eagle Waste container

Submarine sand- Tea Violin Helicopter Swimming pool wich Tank Vehicle Owl Dog Snowboard Taco Handbag Duck Book Sword Telephone Paper cutter Turtle Elephant Picture frame Torch Wine Hippopotamus Shark Sushi Tiger Weapon Crocodile Candle Loveseat Strawberry Wheel Toilet Leopard Ski Trumpet Worm Toilet paper Axe Squirrel Tree Wok Squid Hand dryer Tripod Tomato Whale Clothing Soap dispenser Stethoscope Train Zebra Footwear Porcupine Submarine Tool Auto part Lemon Flower Scorpion Picnic basket Jug Spider Canary Segway Cooking spray Pizza cutter Deer Cheetah Training bench Trousers Cream Frog Palm tree Snake Bowling equip- Monkey Banana Hamburger Coffee table ment Lion Rocket Maple 162

Building Antelope Human face Microwave oven Kitchen & dining room table Fish Beaker Human arm Honeycomb Dog bed Lobster Moths and but- Vegetable Marine mammal terflies Cake stand Asparagus Diaper Sea lion Window Cat furniture Furniture Unicycle Ladybug Closet Bathroom acces- Hedgehog Falcon Shelf sory Castle Airplane Chime Watch Facial tissue Jellyfish holder Spoon Snail Candy Goose Pressure cooker Otter Shellfish Salad Mule Kitchen appli- Bull Cabbage Parrot ance Swan Oyster Carrot Handgun Tire Peach Horizontal bar Mango Sparrow Ruler Coconut Convenience Jeans Van Luggage and store Seat belt bags Flowerpot Grinder Bomb Raccoon Microphone Pineapple Spice rack Bench Chisel Broccoli Drawer Light bulb Ice cream Fork Umbrella Stool Corded phone Caterpillar Lamp Pastry Envelope Sports uniform Butterfly Camera Grapefruit Cake Tennis racket Parachute Squash Band-aid Dragonfly Wall clock Orange Racket Animal Sunflower Serving tray 163

Bell pepper Ant Dishwasher Ipod Taxi

Turkey Car Flute Accordion Canoe

Lily Aircraft Balance beam Willow Remote control

Pomegranate Human hand Sandwich Crab Wheelchair

Doughnut Skunk Shrimp Crown Rugby ball

Glasses Teddy bear Sewing machine Seahorse Armadillo

Human nose Watermelon Binoculars Perfume Maracas

Pen Cantaloupe Rays and skates Alpaca Helmet Appendix F

Detectable classes by the action recogniser

clapping preaching destroying pulling sneezing praying raining competing squatting flipping dropping stitching giggling aiming sewing burying spraying shoveling crouching clipping covering twisting chasing tapping working

flooding coaching flicking skipping rocking leaping submerging pouring washing asking drinking breaking buttoning winking playing + fun slapping tuning hammering queuing camping cuddling boarding carrying locking plugging sleeping running surfing stopping pedaling

164 165 constructing sliding leaning crushing sowing slipping filming sailing rubbing dripping sweeping driving singing punting writing screwing handwriting playing watering clawing shrugging steering hitting playing + music bending hitchhiking filling bubbling removing boxing cracking crashing joining tearing mopping scratching stealing bathing imitating gripping trimming pressing raising teaching flowing selling shouting sitting cooking digging marching hiking drawing reaching tripping stirring vacuuming protesting studying cheering kissing pointing rinsing serving buying jumping giving coughing bulldozing bicycling starting diving smashing shaking feeding clinging hugging slicing discussing emptying socializing building balancing dragging unpacking picking swerving rafting gardening sketching splashing dining kneeling performing standing licking floating dunking officiating weeding kicking cheerleading brushing photographing stacking 166 drying dressing spitting packing telephoning crying inflating dipping descending crafting spinning climbing riding falling knocking frying shredding chopping entering playing + videogames cutting reading extinguishing pushing storming paying sanding applauding sawing placing eating frowning calling smelling turning lecturing closing talking overflowing barking dancing hunting adult + male + fighting speaking child + singing adult + female + clearing waking speaking snowing opening launching barbecuing boiling shaving waxing packaging skating peeling marrying juggling fishing painting wrapping rising mowing spilling drilling wetting laughing shooting leaking punching attacking crawling sniffing knitting tying welding flying interviewing boating manicuring putting assembling stomping sprinkling plunging swinging injecting chewing baptizing grilling carving landing arresting playing + sports pitching walking operating grooming rolling towing 167 rowing blowing stroking draining sprinting bowing cleaning snapping massaging hanging gambling combing biting scrubbing planting saluting spreading roaring handcuffing speaking fueling racing guarding celebrating ascending autographing combusting unloading jogging yawning throwing adult + female + lifting colliding cramming singing drenching instructing bowling burning fencing waving folding resting wrestling swimming signing measuring blocking poking adult + male + repairing singing whistling smiling tickling baking snuggling exiting tattooing exercising smoking shopping stretching erupting loading skiing bouncing taping howling piloting drumming dusting squinting parading typing child + speaking catching grinning