Deep Learning for Video Game Genre Classification
Total Page:16
File Type:pdf, Size:1020Kb
Deep learning for video game genre classification Yuhang Jiang, Lukun Zheng 1906 College Heights Blvd, Bowling Green, KY 42101 Western Kentucky University1 Abstract Video game covers and textual descriptions are usually the very first impression to its consumers and they often convey important information about the video games. Video game genre classification based on its cover and textual descrip- tion would be utterly beneficial to many modern identification, collocation, and retrieval systems. At the same time, it is also an extremely challenging task due to the following reasons: First, there exists a wide variety of video game genres, many of which are not concretely defined. Second, video game covers vary in many different ways such as colors, styles, textual information, etc, even for games of the same genre. Third, cover designs and textual descriptions may vary due to many external factors such as country, culture, target reader pop- ulations, etc. With the growing competitiveness in the video game industry, the cover designers and typographers push the cover designs to its limit in the hope of attracting sales. The computer-based automatic video game genre clas- sification systems become a particularly exciting research topic in recent years. In this paper, we propose a multi-modal deep learning framework to solve this problem. The contribution of this paper is four-fold. First, we compiles a large arXiv:2011.12143v1 [cs.CV] 21 Nov 2020 dataset consisting of 50,000 video games from 21 genres made of cover images, description text, and title text and the genre information. Second, image-based and text-based, state-of-the-art models are evaluated thoroughly for the task of genre classification for video games. Third, we developed an efficient and ∗Corresponding Author: Lukun Zheng Email address: [email protected] (Western Kentucky University) Preprint submitted to Elsevier November 25, 2020 salable multi-modal framework based on both images and texts. Fourth, a thorough analysis of the experimental results is given and future works to im- prove the performance is suggested. The results show that the multi-modal framework outperforms the current state-of-the-art image-based or text-based models. Several challenges are outlined for this task. More efforts and resources are needed for this classification task in order to reach a satisfactory level. Keywords: video games, genre classification, multi-modal learning, transfer learning, neural networks 2010 MSC: 68T45, 68T20 1. Introduction Genres are stylistic categories of literature, music, or other forms of art or entertainments characterized by their forms, contents, subject matters, and the like. Genres are often used to identify, retrieve, and organize works of interest. Genre classification is the process of grouping works together based on certain stylistic similarities. It has been used in a wide range of areas such as music, paintings, film, books, etc. These important tasks were traditionally done manually, which has a lot of limitations on cost, time and other resources. With the growing capacity of computational powers of modern computers, many different automatic genre classification techniques have been developed and used in many domains such as movies, books, paintings, etc. In this study, we focus on the task of genre classification of video games based on the cover and textual description. To the best of our knowledge, this is the first attempt on automatic genre classification using a deep learning approach. Videos games have been one of the most popular, profitable, and influential forms of entertainment across the world. There have been many controversies surrounding video games. Many studies show that video games can be very helpful in education across different age groups and comprehension levels [44, 35, 2, 32, 27]. However, Others point out that video games can cause many problems in terms of health, time wasting, and, even, violence crimes [3, 17, 20]. 2 Despite of the massive debates and arguments in academia and real life, it is persuasive that the video game market has been exploding across the world over the years. Genre and its classification systems play a significant role in the develop- ment of video games. Even though the genre classification for video games is similar to that for other forms of media, it has its own specialties and challenges. First, unlike other media such as books and movie, video games are generally interactive with its player [10]. Second, there are constant formation and con- ceptualization of new genres over time due to the fast development of new video games. Third, existing genres may shift over time as developers, players and the media come up with new names. Finally, video games have a short history compared with other forms of media [36]. Recently, deep learning approaches have reached high performances across a wide variety of problems [41, 18, 43, 6, 26, 34, 45, 51, 21]. In particular, some deep convolutional neural networks can achieve a satisfactory level of perfor- mance on many visual recognition and categorization tasks, exceeding human performances. One of the most attractive qualities of these techniques is that they can perform well without any external hand-designed resources or task- specific feature engineering. The theoretical foundations of deep learning are well rooted in the classical neural network (NN) literature. It involves many hidden neurons and layers as an architectural advantage in addition to the in- put and output layers [41]. A deep convolutional neural network is universal, meaning that it can be used to approximate any continuous function to an ar- bitrary accuracy when the depth of the neural network is large enough [52]. In this paper, we aim to develop three deep learning algorithm for the task of video game genre classification: image-based approach using the game covers, text-based approach using the textual descriptions, and multimodal approach using both the game covers and textual description. Covers are usually the very first impression to its consumers and they often convey important information about the video games. Figure 1 presents some sample video game covers. For instance, in Figure 1(a), the cover image features 3 (a) (b) (c) (d) Figure 1: Cover images conveying important information about the genres of corresponding video games. The genres are: (a) Fighting; (b) Racing; (c) Sports; and (d) Music. two boxers which indicates that it is a fighting game. The two cars side-by-side on the road in Figure 1 (b) tells that it is a racing game. The soccer ball in Figure 1 (c) indicates that it is a sport game. Lastly, in Figure 1 (d), there is a guitar in the cover, which implies that this is a music game. In addition to the cover images, we also use the game descriptions for genre classification. Game descriptions are designed to suitably express the key con- cepts and mechanics inherent in video games. The main purpose of game de- scription is to express the objects involved in a games and the set of rules that induce transitions, resulting in a state-action space [15]. It usually contains important information about the its genre. Consider the following example of a description for the game of BOUNCING BALLS: "Use your mouse to aim and shoot. Destroy the balls by shooting them into groups of three or more. Clear all of the balls to get to the next level. The faster you clear the board the more points you'll get! The game ends when a ball hits the bottom." In this description, the keywords "aim", "shoot", "shooting", etc indicate that it is a shooter game. Hence we will also explore the use of the game descriptions for the task of genre classification. Finally, we consider a multimodal deep learning architecture based on both the game cover and description for the task of genre classification. This approach involves two steps: (1) A neural network is trained on the classification task for each modality. (2) Intermediate representations are extracted from each 4 network and combined in a multimodal learning step. We believe that relating information from both modalities will help to improve the performance in genre classification for video games. The main contributions of this study includes: (1) We created a large dataset of 50,000 video games including game cover images, description text, title text, and genre information. This dataset can be used for a variety of studies such as text recognition from images, automatic topic mining, and so on. This dataset will be made available to the public in the future. (2) State-of-the-art models are evaluated thoroughly for the task of video game genre classification. These include five image-based models and two text- based models using deep transfer learning methods utilizing the information from existing pre-trained models. (3) We developed a multi-modal deep learning archtecture based on two modal- ities: one for images and the other for texts. The information from both modalities are then combined using concatenation method. We believe that the mergence of information from both modalities would help increase the classification accuracy rate for this task. (4) A comprehensive analysis of the difficulties in this video game genre classi- fication task is provided and possible suggestions are made for future work to improve the performance. The rest of the paper is structured as follows. Section 2 presents related works about book cover classification. Section 3 elaborates on the details of the models. In section 4, we discuss the experimental results. The last section concludes the paper and discusses future work. 2. Related Works In recent years, automated genre classification has drawn a wide range of attention from different domains by leveraging the strength of the deep neural 5 network. In book genre classification, Chiang et al.