Aalto University School of Science Master’s Programme in Computer, Communication and Information Sciences

Ossi Hirvola

Detection of tiles from videos using computer vision

Master’s Thesis Espoo, May 21, 2019

Supervisor: Prof. Juho Kannala Advisor: D.Sc (Tech.) Juha Ylioinas Aalto University School of Science Master’s Programme in Computer, Communication and ABSTRACT OF Information Sciences MASTER’S THESIS Author: Ossi Hirvola Title: Detection of Mahjong tiles from videos using computer vision Date: May 21, 2019 Pages: 49 Major: Computer Science Code: SCI3042 Supervisor: Prof. Juho Kannala Advisor: D.Sc (Tech.) Juha Ylioinas Mahjong is a popular 4-player originating from China. The result of a single game of Mahjong involves considerable amount of chance, similarly to that of a few hands of . Hence, long term data analysis is often required to evaluate one’s decisions. There are internet Mahjong platforms that provide replays of past , but no solution providing digital replays for games played in person exists. In this thesis, we approached this problem by the means of object detection. Recently in object detection, convolutional neural network (CNN) based methods have been popular. To train these methods, large amounts of labeled training data is required. For this, we implemented synthetic data generator that produces synthetic images containing Mahjong tiles. We used synthetic images to train single-shot multibox detector (SSD), our object detector of choice. The SSD network that was trained solely on synthetic training data performed remarkably on synthetic validation data, but did not reach desirable accuracy on real video data. However, by introducing scarce amount of real images as part of the training data, we achieved reasonable accuracy on Mahjong tile detection from real video data. Fine tuning the synthetic data generation to better correspond to real data, as well as implementing error correction to further post process object proposals are potential future improvements. Keywords: computer vision, object detection, convolutional neural net- works, ssd Language: English

2 Aalto-yliopisto Perustieteiden korkeakoulu DIPLOMITYON¨ Tieto-, tietoliikenne- ja informaatiotekniikan maisteriohjelma TIIVISTELMA¨ Tekij¨a: Ossi Hirvola Ty¨on nimi: Mahjong-tiilien tunnistus videoista konen¨a¨oll¨a P¨aiv¨ays: 21. toukokuuta 2019 Sivum¨a¨ar¨a: 49 P¨a¨aaine: Tietotekniikka Koodi: SCI3042 Valvoja: Professori Juho Kannala Ohjaaja: Tekniikan tohtori Juha Ylioinas Mahjong on suosittu kiinalainen nelj¨an pelaajan peli. Yksitt¨aisen mahjong-pelin lopputulema riippuu paljon sattumasta, samaan tapaan kuin esimerkiksi muuta- man pokerik¨aden lopputulema. T¨am¨an vuoksi siirtojen hyvyytt¨a arvioidessa on usein tarvetta analysoida pelej¨a pitk¨an aikav¨alin yli. Monet internetin digitaaliset mahjong alustat tarjoavat pelaajalle aiempien pelien tallenteet, mutta t¨allaista ratkaisua ei ole fyysisesti pelatuille peleille. T¨ass¨a ty¨oss¨a l¨ahestymme t¨at¨a ongel- maa kohteentunnistuksen avulla. Viime aikoina konvoluutioneuroverkkoihin pohjautuvat menetelm¨at ovat olleet suosittuja kohteentunnistuksessa. N¨aiden menetelmien opettamiseen tarvitaan suuria m¨a¨ari¨a ennalta merkitty¨a opetusdataa. T¨ast¨a syyst¨a kehitimme oh- jelman, joka tuottaa mahjong-tiili¨a sis¨alt¨avi¨a synteettisi¨a kuvia. K¨ayt¨amme n¨ait¨a synteettisi¨a kuvia opettamaan Single-shot multibox detector (SSD) - kohteentunnistusmenetelm¨a¨a. Kokonaan synteettisell¨a datalla opetettu SSD neuroverkko tuotti eritt¨ain tarkko- ja tuloksia synteettisell¨a testidatalla, mutta ei saavuttanut toivottua tarkkuutta oikeille videoille. Kuitenkin lis¨a¨am¨all¨a opetusdataan hieman aitoja kuvia syn- teettisten kuvien ohelle, menetelm¨a saavutti kohtuullisen tarkkuuden mahjong- tiilien tunnistuksessa. Tulevaisuudessa tarkkuutta voidaan pyrki¨a parantamaan hienos¨a¨at¨am¨all¨a synteettisen datan generointia vastaamaan paremmin aitoja ku- via sek¨a parantamalla virheenkorjausmenetelmi¨a. Asiasanat: konen¨ak¨o, kohteentunnistus, konvoluutioneuroverkot, ssd Kieli: Englanti

3 Acknowledgements

I would like to thank Prof. Juho Kannala for his guidance and support. I would also like to thank Dr. Juha Ylioinas for his advices.

Espoo, May 21, 2019

Ossi Hirvola

4 Abbreviations and Acronyms

CNN Convolutional Neural Network SSD Single-Shot Multibox Detector mAP Mean Average Precision FPS Frames Per Second

5 Contents

Abbreviations and Acronyms 5

1 Introduction 8 1.1 Motivation ...... 9 1.2 Scope of the thesis ...... 9 1.3 Contributions ...... 10 1.4 Structure of the thesis ...... 10

2 Background 11 2.1 Object detection ...... 11 2.1.1 Datasets and tasks ...... 11 2.1.2 Traditional object detection ...... 12 2.1.3 Deep neural network based object detection ...... 13 2.1.4 Synthetic training data ...... 14 2.2 Riichi Mahjong ...... 15 2.2.1 Tiles ...... 15 2.2.2 Setup and game play ...... 15 2.2.3 Winning hand ...... 17 2.2.4 Calls ...... 19 2.2.5 Objective ...... 19

3 Single-Shot MultiBox Detector 21 3.1 Model structure ...... 21 3.2 Training ...... 22 3.3 Performance on standard benchmarks ...... 24

4 Detection of Mahjong tiles 26 4.1 Synthetic data generation ...... 26 4.1.1 Tiles ...... 26 4.1.2 Tile and camera positioning ...... 28 4.1.3 Background ...... 29

6 4.1.4 Lighting and shadows ...... 29 4.1.5 Post processing ...... 30 4.1.6 Bounding boxes ...... 31 4.2 SSD implementation ...... 31

5 Experiments 33 5.1 Training ...... 33 5.2 Synthetic validation performance ...... 33 5.3 Real data performance ...... 36 5.4 Combining real and synthetic training data ...... 37

6 Discussion 42 6.1 Result analysis ...... 42 6.2 Synthetic training data ...... 43 6.3 SSD for Mahjong tile detection ...... 43 6.4 Future work ...... 43

7 Conclusions 45

7 Chapter 1

Introduction

The game of Mahjong is one of the most popular table games, with estimated player base of 700 million people [1]. By Mahjong, we refer to the 4-player game originating from China as shown in the figure 1.1, and it should not be confused with the single player digital tile matching game. There are many regional variations of Mahjong. In this thesis, when referring to Mahjong, we specifically mean the Riichi Mahjong originating from Japan.

Figure 1.1: Common view of a game of Mahjong.

8 CHAPTER 1. INTRODUCTION 9

1.1 Motivation

Mahjong is a game of chance, meaning that the outcome of a single Mahjong game considerably involves chance, much like in Poker. Therefore, the im- mediate feedback of a decision does not necessarily determine whether the decision was good or bad. For players striving for improving themselves, ana- lyzing their games afterwards is crucial. Multiple online Mahjong platforms, such as Tenhou [2] provide complete digital game data of previously played online games for analysis. At the time, however, there are no applications that could provide digital replays of Mahjong games played in real life. The recent progress in object detection research gives reason to expect that a satisfactory result on detecting Mahjong tiles can be achieved with the current state-of-the-art object detection methods. Especially the lat- est convolutional neural network (CNN) based approaches achieve relatively accurate real-time object detection. The training of neural networks is an important part of these methods, and it often requires large amounts of la- beled training data [3]. Therefore, using synthetic images as training data has been an important area of research. Since there is no suitable Mahjong dataset publicly available at the time, we explore the possibilities of synthetic training data generation for Mahjong tile detection. The objective of this thesis is to bring digital replays of real life Mahjong games one step closer by providing a base solution for accurately detecting Mahjong tiles from video data. In addition, we examine the possibilities of synthetic training data approach for a well defined object detection problem. The video data used in this thesis was captured using inexpensive consumer web cameras.

1.2 Scope of the thesis

There are multiple approaches that could possibly be used to achieve digital replays, but in this thesis we concentrate on the object detection approach. That is, acquiring game information from video data to produce digital replay of the game. This approach can be roughly split into two phases: detecting Mahjong tiles from video data, and applying Mahjong rules to create digital replay from the extracted tile information. The scope of this thesis is the first phase of the pipeline, the detection of Mahjong tiles from video data. CHAPTER 1. INTRODUCTION 10

1.3 Contributions

• Synthetization of training data

• Accurate Mahjong tile detection using SSD

• Analysis of performance

1.4 Structure of the thesis

This thesis is divided into six chapters. First, in Chapter 2, we present a background review of object detection, and explain essential Riichi Mahjong rules. Next, in Chapter 3, we thoroughly explain the Single-Shot MultiBox Detector (SSD) object detection method, which is used for Mahjong tile detection in this thesis. In Chapter 4, we give a step by step description of our synthetic training data generation process, as well as explain our SSD implementation. The experimental setting as well as the results are presented in Chapter 5. Then, the results are further discussed in Chapter 6. Finally, we conclude the thesis in Chapter 7. Chapter 2

Background

In this chapter we discuss object detection, after which we give general ex- planation on the Riichi Mahjong rules that are essential for understanding the following chapters of this thesis.

2.1 Object detection

Object detection is the problem of classifying and localizing objects from images. That is, in addition to recognizing objects in the image, each ob- ject’s location in the image is to be estimated. Object locations are usually indicated with rectangular bounding boxes.

2.1.1 Datasets and tasks Object detection methods can roughly be categorized into two sub categories: generic object detection, and salient object detection. Generic object detec- tion focuses on localizing objects by determining the bounding boxes around the objects [4]. In salient object detection, the objective is to localize objects by segmenting them on pixel-level [5]. Evaluating object detection methods require labeled image data. Many datasets, such as PASCAL VOC2007 [6] and MS COCO [7], have been cre- ated and made publicly available to make development and evaluation of object detection methods easier. The public datasets also makes it easier to compare object detection methods between each other and reproduce their results. A dataset has predefined list of classes (for example person, cat and dog classes), that define the object categories that the dataset covers. For ev- ery image in the dataset, there are annotations with the class and location

11 CHAPTER 2. BACKGROUND 12 information of every object that fits the predefined object categories. Both VOC2007 and COCO datasets contain location information in two formats: bounding boxes and segmentations. Bounding box is the minimum rectangle that contains the whole object, and segmentation is the pixel-level informa- tion specifying which pixels of the image are of a specific object. In figure 2.1, an example image from the COCO dataset with the location information is shown. Here, 2.1 (b) visualizes the bounding boxes for cat (in orange) and dog (in blue), and 2.1 (c) shows the pixel-level segmentation.

(a) Original image. (b) Ground truth (c) Ground truth seg- bounding boxes visual- mentation visualized. ized.

Figure 2.1: Example image from COCO dataset.

2.1.2 Traditional object detection Traditional object detection methods are often based on a sliding window approach [8]. This means, that instead of examining image as a whole, algo- rithms inspect partial crops of the image, often with varying sizes and aspect ratios. Such methods include Viola-Jones framework [9], Histogram of Ori- ented Gradient (HOG) detector [10], and Deformable Parts Model (DPM) [11]. While these methods differ in the algorithms used, the three base steps are similar for all of them [4]. First, they scan the image for informative re- gions, for example by using multi-scale sliding windows. Then, they extract a set of features for each region. These sets of features are also known as feature vectors, each containing the features extracted from a single patch (or region) of an image [11]. Collection of feature vectors is also referred to as feature map in modern literature [11]. Finally, they train a classifier such as Support Vector Machine (SVM) on each location. This step is often ac- companied with additional steps to enhance the training result. For example, in HOG detector, training process includes resampling some of the negative training images to reduce the amount of false negative detections [10]. The HOG features are hand-engineered [10], meaning every feature was defined manually. Not only HOG, but many other traditional object de- CHAPTER 2. BACKGROUND 13 tection methods are based on engineered features [12]. In the late 2000s and early 2010s, learning features instead of engineering them by hand was widely researched topic in object detection. Studies have come to the conclu- sion that generally learned features outperform engineered features, as long as sufficient amount of training data is available [13][14]. This was arguably the turning point between traditional and modern object detection.

2.1.3 Deep neural network based object detection Object detection using Deep Neural Networks (DNN) has become the stan- dard approach, thanks to the recent availability of larger data sets as well as the increasingly powerful GPU computing [15]. These modern methods can learn more complex features than traditional methods by utilizing their deeper, hierarchical architectures [4]. DNN based object detection can be roughly categorized into two approaches: multi-step methods with pipeline similar to traditional object detection, and one-step methods. Multi-step approaches, such as Region with CNN features (R-CNN) [16], were pioneers for DNN based object detection. R-CNN outperformed other state-of-the- art methods at the time in terms of accuracy, but it could not match their speed, requiring tens of seconds to process a single image [16]. With the Faster R-CNN [17] that was proposed later, the speed was increased signifi- cantly, allowing processing speed of 5 FPS at the time [17]. Single-shot approaches, such as You Only Look Once (YOLO) [18] and Single-Shot MultiBox Detector (SSD) [19], have become popular due to their outstanding speed and competitive accuracy to multi-step methods. The main reason for the differences in speed compared to multi-step methods is the time reduction achieved by eliminating the overhead from handling separate components [4]. YOLO achieves 45 FPS processing speed, enabling real-time applications [18]. However, in terms of accuracy, it is outperformed by the earlier multi-step method Faster R-CNN [18]. As opposed to YOLO, SSD achieves real-time processing speed while slightly outperforming even other slower state-of-the-art methods including Faster R-CNN [19]. Many of the above mentioned methods base their network structure par- tially on existing neural network architectures. Such architectures include VGG16 [20] and ResNet [21]. VGG16 architecture pre-trained on ImageNet [22] has been popular base network in many recent object detectors, includ- ing SSD. As changing a CNN based methods base architecture into more powerful network has proven to improve its performance [4], it is possible to change base architectures to better correspond to the detection problem. For example, using ResNet as the base network of R-CNN instead of VGG16 gives significant increase in mAP, but also greatly sacrifices processing speed CHAPTER 2. BACKGROUND 14

[4].

2.1.4 Synthetic training data Lately, using synthetic images in training of deep neural networks has been widely researched area in computer vision. This means that instead of possi- bly hard to acquire real life training data, the network is trained on synthetic images data. Synthetic training data is more scalable in terms of variety and dataset size than the manual alternative [23]. Furthermore, the acquisition and manual labeling of real world data is an expensive and time consuming process [24]. Especially in self-driving car research, generating synthetic data allows training data that could otherwise be difficult or even dangerous to replicate in real life [25]. Hence, synthetic datasets have been created and made publicly available for various object detection purposes. Such datasets include urban scene semantic segmentation dataset SYNTHIA [26] and the SynthText in the wild [23] dataset containing synthetic text on real life im- ages. CHAPTER 2. BACKGROUND 15

2.2 Riichi Mahjong

Riichi Mahjong is a variation of the popular Chinese 4-player game called Mahjong, and should not to be confused with the single player tile matching game. Mahjong, including all its variants, has an estimated player base of 700 million people. Riichi Mahjong is the most popular variant of Mahjong in Japan, played by over 30 million people in Japan and even more worldwide. [1] While this thesis focuses on Riichi Mahjong, the results can be applied for other variations of Mahjong, and possibly to other table games as well. This section explains the rules as they are defined in the World Riichi Cham- pionship (2015) rule set [27]. There are minor differences between different Riichi Mahjong rule sets, but they are not interesting from the aspect of computer vision. The rules of Riichi Mahjong are often considered rather complex. Since the focus of this thesis is in computer vision rather than the game of Mahjong itself, we will not explain the rules in their entirety, but omit parts of the rules that are unnecessary for understanding the decisions we make in the thesis as well as the motivation behind the thesis. This includes, for example, the point calculation of a winning hand. For the purpose of this thesis, we will use self-explanatory English terms where possible. Please note, however, that most Riichi Mahjong terms are originated from Chinese and Japanese and are used as is even among western players.

2.2.1 Tiles There are 34 different tiles, four copies of each, totaling to 136 tiles. These tiles can be divided in two different categories: suits and honors. There are three suits with nine different tiles in each suit, each representing a numerical value from one to nine. The three suits are called characters, circles and bamboos. They can be seen in figure 2.2. Remaining tiles, the honors, can be further divided into two subcategories: winds and dragons. Honor tiles are shown in figure 2.3. As opposed to suit tiles, honor tiles do not represent any numerical value.

2.2.2 Setup and game play In the start of a Riichi Mahjong game, every player is assigned a wind and the players are seated accordingly. Seating order starts from east, and continues counter-clockwise with south, west and north. A game of Riichi Mahjong CHAPTER 2. BACKGROUND 16

Figure 2.2: Suit tiles: characters, circles and bamboos. Each suit ordered by tiles’ numerical value starting from 1 (on the left) to 9 (on the right).

Figure 2.3: Honor tiles: Winds (east, south, north and west) and dragons (red, white and green). consists of rounds. In the beginning of a round, each player builds a wall of tiles in front of them. Each wall consists of two layers, with 17 tiles in each layer. Each player starts by taking 13 tiles from the wall of tiles (cf. deck of cards) into their hand. Players do not see other players’ hands. This is the initial state of the board, which can be seen in figure 2.4. The separate part of the wall with 14 tiles is called the dead wall. This part of the wall is usually separated from the rest of the wall as shown. The tile on the wall that is face-up is related to the point calculation, and will not be further explained. Next, the east player begins their turn. Turn starts with the player taking a tile from the wall of tiles and adding it to their hand. Then the player ends their turn by removing any one tile from their hand and places it in their discard pool. After this, the next player’s turn begins. The turn order of Mahjong proceeds counter-clockwise. The image in figure 2.5 depicts a mid-round state of a Mahjong game, with players discard pools (1), wall (2) and hands (3) marked for clarification. The dead wall (4) is a part of the wall that is not played. The face-up tile in the dead wall is called a dora indicator, and while it is relevant for players due to its effect on hand value calculations, we will not cover it any further. Turn by turn, the players’ continue to draw and discard tiles from their hand, aiming to form a winning hand. The structure of a winning hand is CHAPTER 2. BACKGROUND 17

Figure 2.4: Initial state of a round of Riichi Mahjong. defined later in the subsection 2.2.3. The round ends when a player completes a winning hand. If no player is able to complete a winning hand by the time the wall is completely drawn (excluding the dead wall), the current round ends in a draw, and players exchange points and proceed to the next round by shuffling the tiles and building the walls as explained before. The exact rules on how points are exchanged at the end of a round is omitted, as it is unnecessary for this thesis. The next round starts with the players’ assigned winds changed coun- terclockwise (i.e. the player that was east will now be north and so on.). However, if the previous round was won by the east player of the round, the next round is played without changing the winds. This is also the case if the game ended a draw and the east player gained points. A game of Riichi Mahjong continues like this until every player has been assigned every wind two times. This means that at least 8 rounds are played.

2.2.3 Winning hand A complete Riichi Mahjong hand consists of four sets of three tiles and a pair of tiles, total of 14 tiles. A set can be a straight having three consecutive tiles of the same suit, or a three-of-a-kind having three identical tiles. The pair can be any two identical tiles. Example hand can be seen in 2.6. It should be noted that in addition having four sets and a pair, a winning hand in Riichi Mahjong has to satisfy at least one of possible winning con- ditions. These are, however, considered one of the harder parts of the rules (along with score calculation). Therefore, we will omit further explanation, CHAPTER 2. BACKGROUND 18

Figure 2.5: State of a mid-round board. (1) Discard pools. (2) Wall. (3) Hand. (4) Dead wall.

Figure 2.6: An example of a winning hand with the division into four sets and a pair visualized. CHAPTER 2. BACKGROUND 19 as it is unnecessary from the computer vision aspect.

2.2.4 Calls As previously explained (in section 2.2.3), a hand consists of four sets and a pair. In addition to drawing the tiles needed for completing sets, players can complete sets in their hand by calling the most recent tile discarded by other players. A tile can be called if it completes a set (a straight or a three-of-a-kind) in the hand of the calling player. Tile completing a straight can be called only if it was discarded by the player preceding the caller. For completing a set of three-of-a-kind, a tile can be called from any position. A called set is commonly referred to as open set or melded set.

(a) Melded straight, where the 4 of (b) Melded three-of-a-kind, where circles was the called tile (called from the 9 of bamboos was called from the the player left to caller). player sitting opposite to caller.

Figure 2.7: Examples of called sets.

When a call is made, the player making the call shows the tiles in the set and places them on the right corner of the table (from their perspective). The tile that was called is placed sideways with the following logic: if the tile was called from the player on the caller’s left, it is placed as the leftmost tile of the set. If it is called from the player opposite of the caller, it is placed in the middle of the set, and if the tile is called from the player on the caller’s right, it is placed as the rightmost tile of the set. An example of a called straight is shown in figure 2.7 (a), and an example of a called three-of-a-kind is shown in figure 2.7 (b). As can be seen, the placing of the called tile is absolute. That is, even if it contradicts with the order of the tile in the set, as it does here with the 4 of circles being placed as the first tile of the set, it is still placed according to where it was called from.

2.2.5 Objective The objective of Riichi Mahjong is to accumulate as many points as possible during the rounds by completing winning hands. Players also try to avoid CHAPTER 2. BACKGROUND 20 dealing into other players’ hands, as it results in a point loss. While the objective itself is simple, it is not always clear how to achieve it. Due to the nature of Mahjong, outcome of a single Mahjong game con- siderably involves chance. This results to difficulties for players aiming to improve their skills. For example, if there is a decision between two choices that are close to each other in terms of expected value (in points at the end of the game), it might not be clear which choice is better, as the variation hinders retrospective analysis. While many such problems can be solved with mathematics and a basic understanding of the game, for a thorough analysis of decisions, statistical data is required. Currently, many Internet Mahjong platforms, such as Tenhou [2], provides the replays of players’ past games. While this is a good way of analyzing the results of one’s decisions and is the primary resource for modern Mahjong literature, it only covers games of Mahjong played digitally. However, there are currently no means of extracting a replay from video data. This pipeline as a whole is out of the scope of this thesis, hence we only focus on the problem of accurately detecting Mahjong tiles from video data. Chapter 3

Single-Shot MultiBox Detector

In this section we will explain in detail the Single-Shot MultiBox Detector (SSD), which is the method we use for Mahjong tile detection in this thesis. First, we will cover the structure of the detection framework. Then we explain the training process of the network. Finally, we show comparison of the performance of SSD and other state-of-the-art object detection methods.

3.1 Model structure

The layers of the SSD network, or the base network as it is called in [19], consists of a standard image classification architecture. The original publication uses VGG-16 networks top layers as the base network. Instead of using the VGG-16 classification layers, SSD adds multiple convolutional feature layers under the base network for multi-scale detection predictions. Other single shot detectors at the time, such as YOLO [18], only use a single scale feature map. [19]

Figure 3.1: Structure of the SSD model, based on the figure published in [19].

The added feature layers have size of m×n with p channels. The amount of channels, as well as the size of the each layer is depicted in the figure 3.1,

21 CHAPTER 3. SINGLE-SHOT MULTIBOX DETECTOR 22

Figure 3.2: Visualization of a 4x4 feature map with some default boxes of a cell visualized. Based on the figure published in [19]. which also shows the overall structure of the SSD network. Each of the m×n cells of a layer has a set of default bounding boxes. These default boxes are chosen manually and the original publication uses 6 default boxes per feature layer cell. The box sizes are determined by the size of the layer according to the equation 3.1, where b is the amount of feature layers. The default bounding box width and height is then calculated in the following manner: √ √ 1 1 wk = sk ar for width, and hk = sk/ ar for height. Here ar ∈ {1, 2, 3, 2 , 3 } are the aspect ratios for the default boxes. In addition to these 5 boxes, 1 √ one more is added with the size determined from sk = sksk+1 and with an aspect ratio of 1. Each default bounding box is centered in the middle of the corresponding feature layer cell. For default bounding box visualization, refer to figure 3.2. [19] s − s s = s + max min (k − 1), k ∈ [1, b] (3.1) k min b − 1

3.2 Training

Training starts by matching default boxes to ground truths. The matching strategy of SSD starts by searching for the best matching default box for each ground truth box. The best match is determined by jaccard overlap. This is done similarly to MultiBox [28]. However, unlike in MultiBox, default boxes with jaccard overlap higher than a threshold (0.5 in the original publication) are also matched. This allows the multiple overlapping default boxes in contrast to choosing only one, as is done in MultiBox. [19] CHAPTER 3. SINGLE-SHOT MULTIBOX DETECTOR 23

1 L(x, c, l, g) = (L (x, c) + αL (x, l, g)) (3.2) N conf loc The SSD training loss function is based on the loss function used in Multi- Box [28], with additions made to take multiple object classes into account. The loss function is calculated as the weighted sum of the localization loss and the confidence loss. This is depicted in equation 3.2, where N is the amount of positive default boxes. If there are no matched default boxes, i.e. N = 0, loss is defined to be 0. The confidence loss is the softmax loss (over multiple class confidences) and the localization loss is a Smooth L1 loss. The calculations of these losses are shown in equations 3.3 and 3.4, respectively. [19]

N X X k m m Lloc(x, l, g) = xijsmoothL1(li − gˆj ), where i∈P os m∈{cx,cy,w,h}

cx cx cy cy cx (gj − di ) cy (gj − di ) gˆj = w gˆj = h (3.3) di di

w h w gj h gj gˆj = log( w )g ˆj = log( h ) di di Here l is the predicted box, g is the ground truth box, and d is the default bounding box, with (cx, cy) being its center and w and h its weight and height.

N p X X exp(c ) L (x, c) = − xp log(ˆcp)− log(ˆc0) where cˆp = i (3.4) conf ij i i i P exp(cp) i∈P os i∈Neg p i

Matching step is followed by hard negative mining. After the matching step, the default boxes are mainly negative (unmatched). To balance the amount of positive and negative default boxes, some of the negative boxes need to be ignored in the learning process. The negative boxes that are to be part of the learning process are chosen by picking the boxes with the highest confidence loss so that the ratio of the negative and positive boxes is at most 3:1. This stabilizes the training and speeds up the optimization process. [19] For increasing the training data diversity artificially, training images are augmented by randomly performing one of the following augmentation meth- ods: Using the input image as is, cropping the image with varying minimum CHAPTER 3. SINGLE-SHOT MULTIBOX DETECTOR 24 jaccard overlap, or cropping the image randomly. Also in the case of aug- mentation (i.e. not using input image unchanged), the crop is stretched to match the input resolution. With a 50% probability the crop is flipped hori- zontally (i.e. mirrored). In addition to these, the augmentation step includes randomly adding lighting noise and manipulating photo-metric features such as brightness, contrast and hue of the image. [19]

3.3 Performance on standard benchmarks

Object detection methods are usually compared by comparing their mean av- erage precision (mAP) on a various datasets. Mean average precision is the mean of per-class interpolated average precision (AP). To calculate AP for a single class, a precision-recall (PR) curve is constructed from the detections. This is done by first mapping detections to the ground truths that they over- lap most with, as long as the overlap is over a preset threshold. For VOC2007, the overlap threshold is defined as 50% intersection over union (IoU), mean- ing the area of overlap divided by the area of union [6]. For each ground truth box, however, only the detection with highest confidence is mapped. These mapped detections are considered as true positives (TP), while the rest of the detections are all false positives (FP). Precision and recall is then cumulatively calculated for each detection as shown in equation 3.5, starting from the highest scored detection. PR curve is then created by plotting these recall and precision pairs. To make the curve monotonically decreasing, the precision for every point of the curve is then set to the maximum precision of itself and all the lower confidence precisions. This interpolation of preci- sion/recall points is visualized in figure 3.3. Here, the dashed lines show the maximum precision for each false positive point. Finally, the AP of a class is the area under the PR curve. [29]

T rue positives T rue positives P recision = Recall = (3.5) All detections Ground truths It should be noted that in VOC2007 challenge, the mAP was to be calcu- lated using an approximation method instead of the explained exact method [6]. However, in VOC2012 challenge the mAP calculation done using the ex- act method [30]. In this thesis, all the mAPs are calculated using the exact method unless stated otherwise. In table 3.1, performances of multiple recent object detection methods are presented. The mAPs presented here are calculated using the VOC2007 approximation method. Here, SSD300 refers to SSD using input size of CHAPTER 3. SINGLE-SHOT MULTIBOX DETECTOR 25

Figure 3.3: Visualization of the precision/recall (PR) curve interpolation (adapted from [29]).

300x300 pixels, and SSD512 similarly refers to SSD with 512x512 input size. As can be seen, both SSD300 and SSD512 outperforms other methods in terms of mAP. As for FPS, the only method outperforming SSD300 is fast YOLO with its 155 FPS. However, fast YOLO fails to achieve similar mAPs to other methods by great margin.

Method mAP FPS Faster R-CNN (VGG16) [17] 73.2 7 Fast YOLO [18] 52.7 155 YOLO (VGG16) [18] 66.4 21 SSD300 [19] 74.3 46 SSD512 [19] 76.8 19

Table 3.1: Performance of SSD compared to other object detection methods on VOC2007 dataset (adapted from [19]). Chapter 4

Detection of Mahjong tiles

In this chapter we present the implementation of SSD using synthetic training data to detect Mahjong tiles. First, we will explain how the synthetic data was generated. Then, we discuss the SSD implementation.

4.1 Synthetic data generation

In this section, we will explain the synthetic data generating process step by step. The motivation behind generating synthetic training data is to be able to eliminate the process of manually creating training data by labeling tiles from hundreds of images. The objective of the generator is to produce images similar to video data of a Mahjong game. A cropped example of the video data that was used as a target for the generation can be seen in figure 4.1. This video data was also used in the experiments presented later. It should be noted that the partial views in the corners of the image can be ignored. They are separate cameras for the hand views in the original video data, but we limit the scope of this thesis to the top view visible in the middle. The synthetic data generator was implemented using Unity Engine1.

4.1.1 Tiles Tile 3D model was modeled with the free open source 3D modeling software Blender2. The tile model was made according to the tiles in the video data and has the same dimensions. Tile model with and without textures is shown in figure 4.2.

1Unite Engine v2018.3.6 (https://unity3d.com/) 2Blender v2.79b (https://blender.org/)

26 CHAPTER 4. DETECTION OF MAHJONG TILES 27

Figure 4.1: Example crop of the video data from which Mahjong tiles are to be detected.

(a) Tile model. (b) Tile model after texturing.

Figure 4.2: Tile model (a) without textures (in Blender), and (b) with tex- tures (in Unity). CHAPTER 4. DETECTION OF MAHJONG TILES 28

Figure 4.3: Tile set used by the generator.

Face textures for the tiles were extracted from an image of the same tile set that was used in the video data. Figure 4.3 shows the extracted tile images. There are tile image sets available for free to some extent, but to prevent possible trouble with the learning process, we decided to use an identical tile set for the training data generation. However, for more generic approach, using multiple tile sets should be considered.

4.1.2 Tile and camera positioning Tiles are positioned similarly to ordinary game of Mahjong to ensure correct learning process. We place four walls of tiles and four discard pools with varying amount of tiles. The tile positions in discard pools and wall are slightly varied by adding random noise to their XZ positions, moving tiles slightly on the board. This is because the discard pools and walls are rarely evenly placed. Furthermore, to ensure enough density of face-up tiles in the 1 wall for the learning process (to correctly detect dora indicator), we flip 3 of the top tiles in walls. The rendering is done using perspective camera, which is placed on top of the board looking downwards with varying distance from the board. Camera is then randomly rotated to look at the board from a slight angle. This gives us varying perspectives. The XZ position is also randomized so that the position of walls and discard pools in the image varies. An example image of a render using just explained tile and camera positioning is shown in figure 4.5. As can be seen from the example render, before rendering the image, partial tiles that are colliding with the camera view frustum (and therefore clipped by the image border) are removed. This is to prevent errors in learn- ing process that could happen since some tiles are similar or even identical to part of other tiles. An example of this kind of similarity are the one, two and three of characters, shown in figure 4.4. Should the three of characters CHAPTER 4. DETECTION OF MAHJONG TILES 29

Figure 4.4: Example of tile similarity, where a tile is contained in another tile.

Figure 4.5: Example render with perspective from camera angle and unevenly placed tiles. be clipped by the image border, it would not be that dissimilar from the one and two of characters.

4.1.3 Background Texture for the board, the background of the render, is randomly chosen from the VOC2007 dataset [6]. This ensures that the structure of the texture is not overfitted during learning, and therefore should allow us to use video data with any type of fabric or table cloth under the tiles. Example render using randomly chosen background image (with lighting and shadows, which are introduced later) can be seen in figure 4.6.

4.1.4 Lighting and shadows The scene is mainly lighted by a directional light. To soften the rather unrealistic looking shadows casted by the single directional light, 2-3 point CHAPTER 4. DETECTION OF MAHJONG TILES 30 lights are added randomly on top of the table to create varying lighting. The colors of all lights in the scene vary randomly using colors often used for indoor lights: white, yellow and orange.

Figure 4.6: Example render with lighting, shadows and a randomly chosen background from VOC2007 dataset.

To further increase the diversity of lighting in the render, we added invis- ible cylinders that are randomly sized and placed around the table. These cylinders are invisible to camera, but still cast shadows if placed in between a light source and an renderable object (e.g. tiles). Example render with the explained lighting and shadows present is shown in figure 4.6.

4.1.5 Post processing Post processing is the most important part of our training data generation. The post processing stack used in the generation process consists of anti- aliasing, ambient occlusion, blur, grain noise, post exposure, hue variation and bloom. By combining these post processing effects with randomly chosen parameters (from manually set intervals), a great increase in diversity of the resulted renders is achieved. For further increase in diversity, we only choose a subset of these effects for each render. However, since without anti-aliasing the resulting render has significant sawtooth artifacts that could affect the learning process, it is always used. For an example image generated by the training data generator, see figure 4.7. CHAPTER 4. DETECTION OF MAHJONG TILES 31

Figure 4.7: An example render from our training data generator.

4.1.6 Bounding boxes The bounding box of a tile is determined by the minimum rectangle (in screen space) that contains all of the tile’s top 4 vertices. That is, the bounding box is made just large enough to cover the front face of the tile. This is the smallest bounding box still including the tile’s information in it. In figure 4.8, tiles’ bounding boxes are visualized (in green).

4.2 SSD implementation

Single-shot Multibox Detector (SSD) was implemented on top of the GitHub user amdegroot’s SSD implementation [31]. This implementation achieved the results of the original SSD publication and had considerable amount users and feedback, therefore being trustworthy base for our implementation. The implementation uses Python3 with the open source deep learning platform PyTorch [32]. Similarly to the original SDD publication, VGG-16 is used as the base network. The amdegroot’s implementation only supports SSD300, the version of the algorithm with input image resolution of 300x300 pixels. However, since the tiles are barely recognizable with human eye when using 300x300 reso- lution, we changed the implementation to the version of the algorithm with 512x512 input image resolution (SSD512). This version of the algorithm is slower than the SSD300 (59 FPS compared to 22 FPS using GeForce GTX TITAN X [19]), but this difference is irrelevant for the purposes of this thesis. CHAPTER 4. DETECTION OF MAHJONG TILES 32

Figure 4.8: Example render with tiles’ bounding boxes visualized.

The SSD512 also achieves slightly better results: 76.8 mAP on the VOC2007 dataset, compared to the 74.3 mAP of SSD300 [19]. The amdegroot’s imple- mentation had hardcoded support for selected datasets only (COCO, VOC). Therefore, a custom dataset support for our synthetic data was added. As mentioned in earlier chapters, SSD algorithm includes augmentation of input images to increase the diversity of the training data [19]. This is useful and important when training data is scarce, as it often is with real world applications. However, for our purposes with rather specific use case and virtually infinite amount of training data, these augmentations (such as cropping and scaling) turned out to significantly reduce the accuracy of de- tections in our early empiric tests. Therefore all the augmentations from the algorithm was removed. Instead, the diversity of training data is achieved with the extensive functionality of our training data generator, with its vary- ing post processing, lighting, and camera positioning. Chapter 5

Experiments

In this chapter, we first explain our experimental setup and the results of our experiments. First, we explain how the SSD network was trained on the synthetic training data. Then, we present the results for the synthetic validation data, and the results of our empirical real data experiments. Fi- nally, we present the results for an additional experiment with training data consisting of both synthetic and real data.

5.1 Training

The training of the SSD network was done using NVIDIA Quadro P5000 GPU. Initial learning rate was set to 2.5 × 10−4, batch size to 8 and weight decay to 5 × 10−4. Stochastic gradient descent (SGD) with momentum was used for adjusting learning rate during the training. For this, the momentum parameter was set to 0.9. The training data consists of 3000 images generated synthetically by our training data generator. An example of our training data can be seen in figure 5.1. Here, common abbreviations for the tiles are used: characters are abbreviated as m, circles as p and bamboos as s. Honor tiles are abbreviated as z and the numerical value is in [1, 4] for winds starting from the east wind, and 5, 6 and 7 for white, green and red dragon, respectively. Face down tiles are abbreviated as FD.

5.2 Synthetic validation performance

The synthetic validation data was generated in the same manner as the training data, and was not used for the training in any way, as expected. The network was tested over the validation data every 10k iterations. This

33 CHAPTER 5. EXPERIMENTS 34

Figure 5.1: An example of training data image with bounding boxes and classes visualized. was done until 50k iterations, after which the average loss over the last 10k iterations did not change remarkably compared to previous 10k iterations. As can be seen in table 5.1, for 1500 images of validation data, the trained SSD512 network achieves 0.995 mAP after 30k training iterations. For each class, the bolded entries represent the best AP result(s) for the class in ques- tion. As can be seen, while the network recognizes most of the tiles extremely well, it has sometimes trouble recognizing character tiles. Out of character tiles, especially the similar 1, 2 and 3 of characters are occasionally mixed up. An example image (also included in validation data) where mix up of character tiles happens is shown in figure 5.2. The detections to visualize were chosen in the following manner: Detections with less than 0.3 confidence are scrapped. If a detection does not overlap with any other detections (with 0.5 overlapping threshold), it is chosen. If it does overlap with other detections, it is only chosen if it has the highest confidence out of the detections it overlaps with. Correct detections are shown in green, with classes and confidences omitted for clarity. There are two detections with their respective classes and confidences shown in red. Both of these two detections are incorrectly detected the 3 of characters (m3) instead of 2 of characters (m2). CHAPTER 5. EXPERIMENTS 35

Iter 10k 20k 30k 40k 50k mAP 0.987 0.959 0.995 0.991 0.991 m1 0.987 0.941 0.991 0.983 0.983 m2 0.947 0.907 0.977 0.970 0.968 m3 0.929 0.848 0.963 0.951 0.947 m4 0.975 0.941 0.990 0.983 0.982 m5 0.941 0.928 0.983 0.973 0.972 m6 0.944 0.915 0.986 0.972 0.970 m7 0.992 0.951 0.993 0.991 0.990 m8 0.980 0.929 0.995 0.988 0.986 m9 0.970 0.940 0.984 0.983 0.983 p1 1.000 0.980 1.000 0.999 0.997 p2 0.997 0.982 1.000 0.997 0.997 p3 0.999 0.979 1.000 0.999 0.999 p4 0.999 0.979 1.000 0.998 0.998 p5 0.994 0.979 0.995 0.994 0.994 p6 0.999 0.990 1.000 0.999 0.998 p7 0.995 0.975 0.998 0.996 0.996 p8 0.999 0.987 1.000 0.999 0.999 p9 1.000 0.985 1.000 0.999 0.999 s1 0.997 0.970 0.999 0.997 0.995 s2 0.997 0.976 0.999 0.995 0.994 s3 0.996 0.942 0.999 0.996 0.996 s4 0.998 0.958 1.000 0.997 0.996 s5 0.991 0.975 0.998 0.997 0.997 s6 0.994 0.963 0.996 0.995 0.993 s7 0.996 0.976 0.998 0.997 0.997 s8 0.993 0.971 0.998 0.992 0.992 s9 0.996 0.961 1.000 0.995 0.994 z1 0.990 0.960 0.993 0.993 0.994 z2 0.989 0.960 0.998 0.994 0.993 z3 0.998 0.966 0.999 0.994 0.994 z4 0.990 0.967 0.996 0.991 0.991 z5 0.993 0.955 0.998 0.990 0.990 z6 0.993 0.970 0.999 0.997 0.996 z7 0.999 0.969 0.999 0.998 0.998 FD 0.999 0.971 1.000 0.999 0.998

Table 5.1: Results of the SSD network on the validation data every 10k training iterations. CHAPTER 5. EXPERIMENTS 36

Figure 5.2: Example result of a validation data image, with correct detections visualized in green and wrong detections in red.

5.3 Real data performance

In this section, we evaluate the performance of our trained SSD512 network empirically. In figure 5.3, one example frame of a video (including detec- tions) is shown. Similarly to validation data visualization, the same detection choosing/scrapping logic was used here. The correct detections are in green while incorrect detections are in red. For clarity, classes and confidences are omitted for correct detections. As can be seen, in addition to the character tiles that were occasionally detected incorrectly from validation data, some tiles seem to be mixed up in an unexpected way. For example, green dragon (z6) is rather often incorrectly detected as south wind (z2) or 1 of bamboos (s1). Furthermore, as visible in the figure, some detections appear in the middle of the table. These kinds of detection errors were frequent during our empirical experiments. It should be noted that the detections are made based only on the one frame currently being detected. That is, there is no algorithm that would, for CHAPTER 5. EXPERIMENTS 37 example, determine the detection by the average of detections of a location during a time window of last n frames. Neither is there an algorithm that takes Mahjong rules into account, for example by taking the duplicate count of 4 of each tile into account. While these kind of additions would undeniably increase the accuracy of detections from video data, they has been left out due to the scope of this thesis.

Figure 5.3: An example frame of video data, with correct detections visual- ized in green and wrong detections in red.

Since the video data was shot with cheap web cameras (Logitech C270), the video quality is relatively poor. Therefore we also evaluated performance on pictures taken with a modern phone camera. An example picture is shown in figure 5.4. However, the detection errors also persisted when using these better quality images.

5.4 Combining real and synthetic training data

Since the network did not perform as well as expected on the real data, an additional experiment was carried. In this experiment, the training data consists of 1000 synthetic images and 35 manually labeled images. CHAPTER 5. EXPERIMENTS 38

Figure 5.4: Example result for a image with better quality. CHAPTER 5. EXPERIMENTS 39

The SSD network was trained similarly to the network in previous ex- periments, but due to the scarce amount of real data images in comparison to synthetic images (only ˜3.5% of the whole training data), some data aug- mentation was applied to real data during the training. Each time one of the real training images were used, it was randomly rotated either 90, 180, 270 or 0 degrees. In addition to this, with a 0.5 chance a crop of 90% of the image was used instead of the full image. In case of crop, the bounding boxes were disregarded if their jaccard overlap with the crop area was less than 0.9. The crop sizing of 90% is to obtain images with tiles in different positions of the image without excessively altering the tile size. We evaluated the network with 11 manually labeled real data images that were not part of the training data. The results for various versions of the network during the training is shown in table 5.2. As can be seen, the performance of the network on real data improved significantly with the introduction of hybrid training data even when the ratio of real data was only 35:1000. Even though the results are only of 11 test images, it is clear that the performance is much better than when trained with only synthetic images in the previously presented experiment. The best mAP (of 0.972) was achieved after 20k iterations. After this the mAP didn’t increase and the network began overfitting. In figure 5.5 an example from our validation set is shown. Incorrect detections are shown here in red, and correct detections in green. The classes and confidences of the correct detections are omitted for clarity, but they were mainly near 1.00 with the exception of the lower character tiles having slightly lower confidences. CHAPTER 5. EXPERIMENTS 40

Iter 5k 10k 15k 20k mAP 0.936 0.964 0.970 0.972 m1 0.978 1.000 1.000 1.000 m2 0.846 0.913 0.946 0.950 m3 0.529 0.753 0.811 0.781 m4 0.842 0.925 0.933 0.992 m5 0.729 0.824 0.873 0.875 m6 0.842 0.933 0.996 0.996 m7 0.935 0.966 0.966 0.967 m8 1.000 1.000 1.000 1.000 m9 0.850 0.950 0.950 0.992 p1 0.962 0.962 0.962 0.962 p2 1.000 1.000 0.995 0.979 p3 1.000 1.000 1.000 1.000 p4 1.000 1.000 1.000 1.000 p5 1.000 1.000 1.000 1.000 p6 0.860 0.923 0.938 0.946 p7 0.974 0.994 0.994 0.994 p8 0.997 1.000 1.000 1.000 p9 1.000 1.000 1.000 1.000 s1 0.973 0.967 0.963 0.965 s2 0.905 0.905 0.905 0.905 s3 0.947 0.956 0.962 0.962 s4 0.977 0.968 0.962 0.965 s5 0.902 0.921 0.923 0.914 s6 1.000 1.000 1.000 1.000 s7 0.998 1.000 0.994 1.000 s8 0.994 1.000 1.000 1.000 s9 0.994 1.000 0.998 0.998 z1 0.854 0.957 0.961 0.963 z2 0.938 0.952 0.943 0.954 z3 1.000 1.000 1.000 1.000 z4 0.953 0.968 0.969 0.969 z5 1.000 1.000 1.000 1.000 z6 0.996 0.997 0.999 0.999 z7 1.000 1.000 1.000 1.000 FD 0.993 0.993 0.993 0.998

Table 5.2: Results of the SSD network on the real validation data every 5k training iterations. CHAPTER 5. EXPERIMENTS 41

Figure 5.5: Example image from real validation data, with correct detections shown in green and incorrect detections in red. Chapter 6

Discussion

In this chapter, we will further evaluate the results in detail, and discuss possible future improvements to Mahjong tile detection.

6.1 Result analysis

In the experiments on synthetic validation data, mAP of 0.995 was achieved. This shows that accurate Mahjong tile detection is definitely achievable using current state-of-the-art object detection methods. Our empiric experiments on real video data, however, did not show even nearly as good performance as the synthetic validation data experiments did. There are many possible reasons for the difference in performance between these two experiments, but the main reason is likely the fact that our synthetic training data did not resemble the real data well enough. However, since the network version that achieved this result also had the best performance on real data, it is reasonable to assume that moderate changes to the synthetic data generation could lead to significant improvement of performance on real data. In our additional experiment, introducing 35 real images as part of the otherwise synthetic training data, mAP of ˜0.97 was achieved over 11 real evaluation images. While especially the lower character tiles were still oc- casionally mixed up, the overall change in performance with the addition of real images was significant. There were also false positive detections in the middle panel of the table similarly to when the network was trained solely on synthetic data. These FPs are most probably due to the scarce amount of real images compared to the amount of synthetic images, and should disappear by increasing the ratio of real images to synthetic images.

42 CHAPTER 6. DISCUSSION 43

6.2 Synthetic training data

It has become apparent that the generation of synthetic training data that resemble the real data accurately enough is a challenging problem in its own. While it undeniably is a considerable approach when training networks that require excessive amounts of training data that would otherwise be difficult and expensive to acquire, the effort required made it inconvenient for the detection of Mahjong tiles. Assumably the time required to manually create a set of training and validation data would have been less than the time used in developing the synthetic data generator for this thesis. However, once synthetization system is built, it may be easier to further add features to this system instead of manually labeling new images with desirable features to existing dataset. For example, when generalizing Mahjong tile detection to any tile and board styles.

6.3 SSD for Mahjong tile detection

The performance of SSD on datasets with large amount of complex classes (such as VOC and COCO) is remarkable. Furthermore, for our specific and rather restricted use case of detecting Mahjong tiles, SSD produces detections with desirable accuracy as long as it is given sufficient training data. Even though the performance on real data was not as good as it was on synthetic data, the impressive accuracy on synthetic validation data itself makes it plausible to assume that desirable accuracy could also be achieved on real data using SSD.

6.4 Future work

Since the results on synthetic validation data are remarkably accurate, it is reasonable to assume that by improving the synthetic data generation to correspond to real data better, the results on real data would similarly improve. Another improvement would be to introduce more labeled real data as part of the training data, as even a small amount lead into significant improvement on detection accuracy. In this thesis we use a specific set of Mahjong tiles. In the future, in- cluding several tile designs of each tile for the training data, similarly to different breeds of dogs or cats are included under the same class in tra- ditional datasets. This could result in general detection of Mahjong tiles CHAPTER 6. DISCUSSION 44 regardless of the tile set used, provided it would not have a negative effect on the average precision. As for the process of digitalizing Mahjong games from video data as a whole, significant improvements could be made by applying the rules of Mahjong as an error correction mechanic to verify and eliminate detections. For example, when detecting tiles that are in called sets, many incorrect detections can be eliminated due to the rules regarding calling. Currently wrong detections are picked based on the confidence alone, even if there was a less confident correct detection. If the more confident incorrect detections can be eliminated, the less confident correct detection will be picked. In addition to call rules, one could use the fact that only the latest tile in discard pools can be called and therefore the rest of the tiles will stay in the discard pool until the end of the round. In this thesis, the video data was evaluated frame by frame, only taking the current frame into account. Further error correction could be made by also using information from other frames near the current frame. For exam- ple, averaging over a window of previous frames could improve the accuracy of detections. Chapter 7

Conclusions

In this thesis we aimed for accurate detection of Mahjong tiles from video data. We designed and implemented an image generation system for syn- thetic Mahjong tile images as an alternative to manually labeling images for training data. We trained SSD object detector using these synthetic images and evaluated its performance on both synthetic and real data. The network performed remarkably well on synthetic validation data, but did not achieve desirable accuracy on real video data. Therefore, we carried additional ex- periment in which SSD detector was trained with data consisting of both synthetic and real images. Training with 1000 synthetic images and only 35 real images resulted in a network, that achieved Mahjong tile detection on reasonable accuracy. From the results of the experiments, it can be concluded that by intro- ducing error correction logic and/or by adding more real images into training data, a sufficient accuracy for extracting the game state to produce digital replay could be achieved. In future, possibly by improving synthetic data generation further, a generalized Mahjong tile detection for various tile and board styles could be made.

45 Bibliography

[1] Scott D. Miller. Riichi Mahjong: the ultimate guide to the Japanese game taking the world by storm. Psionic Press, 2015.

[2] Tenhou online Mahjong platform. URL http://tenhou.net/.

[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks. pdf.

[4] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. Object detection with deep learning: A review, 2018. URL http://arxiv.org/ abs/1807.05511. cite arxiv:1807.05511.

[5] A. Borji, M. Cheng, H. Jiang, and J. Li. Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12):5706–5722, Dec 2015. ISSN 1057-7149. doi: 10.1109/TIP.2015.2487833.

[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal- network.org/challenges/VOC/voc2007/workshop/index.html, .

[7] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per- ona, Deva Ramanan, Piotr Doll´ar,and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1.

46 BIBLIOGRAPHY 47

[8] Thomas Dean, Mark A. Ruzon, Mark Segal, Jonathon Shlens, Sud- heendra Vijayanarasimhan, and Jay Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.

[9] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, pages I–I, Dec 2001. doi: 10.1109/CVPR.2001.990517.

[10] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. volume 1, pages 886–893, 07 2005. doi: 10.1109/ CVPR.2005.177.

[11] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627– 1645, Sep. 2010. ISSN 0162-8828. doi: 10.1109/TPAMI.2009.167.

[12] D. Parikh and C. L. Zitnick. Finding the weakest link in person detec- tors. In CVPR 2011, pages 1425–1432, June 2011. doi: 10.1109/CVPR. 2011.5995450.

[13] Mateusz Budnik, Efrain-Leonardo Gutierrez-Gomez, Bahjat Safadi, Denis Pellerin, and Georges Qu´enot. Learned features versus engineered features for multimedia indexing. Multimedia Tools and Applications, 76(9):11941–11958, May 2017. ISSN 1573-7721. doi: 10.1007/s11042-016-4240-2. URL https://doi.org/10.1007/ s11042-016-4240-2.

[14] Xiaofeng Ren and Deva Ramanan. Histograms of sparse codes for object detection. In Proceedings of the 2013 IEEE Conference on Computer Vi- sion and Pattern Recognition, CVPR ’13, pages 3246–3253, Washington, DC, USA, 2013. IEEE Computer Society. ISBN 978-0-7695-4989-7. doi: 10.1109/CVPR.2013.417. URL https://doi.org/10.1109/CVPR.2013. 417.

[15] Andrew G. Howard. Some improvements on deep convolutional neural network based image classification. CoRR, abs/1312.5402, 2014.

[16] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmenta- tion. In Proceedings of the 2014 IEEE Conference on Computer Vision BIBLIOGRAPHY 48

and Pattern Recognition, CVPR ’14, pages 580–587, Washington, DC, USA, 2014. IEEE Computer Society. ISBN 978-1-4799-5118-5. doi: 10.1109/CVPR.2014.81. URL https://doi.org/10.1109/CVPR.2014.81.

[17] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, June 2017. ISSN 0162-8828. doi: 10.1109/TPAMI.2016.2577031.

[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, June 2016. doi: 10.1109/CVPR.2016.91.

[19] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. Ssd: Single shot multi- box detector. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 21–37, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46448-0.

[20] S. Liu and W. Deng. Very deep convolutional neural network based image classification using small training sample size. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pages 730–734, Nov 2015. doi: 10.1109/ACPR.2015.7486599.

[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep resid- ual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.

[22] J. Deng, W. Dong, R. Socher, L. Li, and and. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, June 2009. doi: 10. 1109/CVPR.2009.5206848.

[23] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text locali- sation in images. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[24] Mahdi Rad, Markus Oberweger, and Vincent Lepetit. Feature mapping for learning fast and accurate 3d pose inference from synthetic images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. BIBLIOGRAPHY 49

[25] Alex Bewley, Jessica Rigley, Yuxuan Liu, Jeffrey Hawke, Richard Shen, Vinh-Dieu Lam, and Alex Kendall. Learning to drive from simulation without real world labels. 12 2018.

[26] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic seg- mentation of urban scenes. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3234–3243, June 2016. doi: 10.1109/CVPR.2016.352.

[27] Sylvain Malbec. World Riichi Championship Rules. Jun 2015.

[28] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. Scalable object detection using deep neural networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR ’14, pages 2155–2162, Washington, DC, USA, 2014. IEEE Computer Society. ISBN 978-1-4799-5118-5. doi: 10.1109/ CVPR.2014.276. URL https://doi.org/10.1109/CVPR.2014.276.

[29] Paul Henderson and Vittorio Ferrari. End-to-end training of object class detectors for mean average precision. In ACCV, 2016.

[30] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html, .

[31] Amdegroot. Ssd implementation using pytorch, Mar 2019. URL https: //github.com/amdegroot/ssd.pytorch/.

[32] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.