Multi-Label Classification for Automatic Tag Prediction in the Context of Programming Challenges
Total Page:16
File Type:pdf, Size:1020Kb
Multi-label Classification for Automatic Tag Prediction in the Context of Programming Challenges Bianca Iancu Gabriele Mazzola Del University of Technology Del University of Technology Del, South Holland Del, South Holland [email protected] [email protected] Kyriakos Psarakis Panagiotis Soilis Del University of Technology Del University of Technology Del, South Holland Del, South Holland [email protected] [email protected] ABSTRACT problem. is is a more challenging problem than the usual One of the best ways for developers to test and improve multi-class classication, where each data point can only their skills in a fun and challenging way are programming have one label aached. An example of a data sample for our challenges, oered by a plethora of websites. For the in- problem is shown in Figure 1. A similar problem would be experienced ones, some of the problems might appear too to predict these tags based on the source code of a solution, challenging, requiring some suggestions to implement a so- but this would restrict the application domain. To be more lution. On the other hand, tagging problems can be a tedious specic, we are interested in applying the system in the con- task for problem creators. In this paper, we focus on automat- text of online programming challenges platforms where no ing the task of tagging a programming challenge description solution or source code is available. us, we only consider using machine and deep learning methods. We observe that the textual description of the problem statements as input to the deep learning methods implemented outperform well- our system. known IR approaches such as tf-idf, thus providing a starting point for further research on the task. 1 INTRODUCTION In recent years, more and more people have started to show interest in competitive programming, either for purely ed- ucational purposes or for preparing for job interviews. Fol- lowing this trend, we can also see an increase in the number of online platforms providing services such as programming challenges or programming tutorials. ese problems are based on a limited number of strategies to be employed. Un- derstanding which strategies to apply to which problem is arXiv:1911.12224v1 [cs.LG] 27 Nov 2019 the key to designing and implementing a solution. However, it is oen dicult to directly infer from the textual problem Figure 1: Data sample example statement the strategy to be used in the solution, especially with lile or no programming experience. us, assisting the programmer by employing a system that can recom- By gathering data from two of the main online program- 1 2 mend possible strategies could prove very useful, ecient ming challenges platforms, Topcoder and CodeForces , and educational. we approach the problem through both machine learning To this end, we propose a system to automatically tag a and deep learning solutions, experimenting with dierent programming problem given its textual description. ese architectures and data representations. Considering the com- tags can be either generic, such as ’Math’, or more specic, plexity of this problem, which is dicult even for humans, such as ’Dynamic Programming’ or ’Brute Force’. Each prob- our hypothesis is that deep learning methods should be an lem can have multiple tags associated with it, thus this re- 1hps://www.topcoder.com/ search is focusing on a multi-class multi-label classication 2hps://codeforces.com eective way of approaching this problem, given enough Regarding deep learning approaches, in [11] the authors data. Based on the aforementioned, the research question analyzed the limitations of BP-MLL, a neural network (NN) that we are trying to answer is the following: ”Could deep architecture aiming at minimizing the pairwise ranking error. learning models learn and understand what are the strategies Additionally, they proposed replacing the ranking loss mini- to be employed in a programming challenge, given only the mization with the cross-entropy error function and demon- textual problem statement?” strated that neural networks can eciently be applied to e rest of the paper is organized as follows: in Section the multi-label text classication seing. By using simple 2 we describe the related work carried out in the literature, neural network models and employing techniques such as both regarding multi-label classication, as well as text rep- rectied linear units, dropout and AdaGrad, their work out- resentation methods. Subsequently, in Section 3 we describe performed state-of-the-art approaches for this task. Further- the process of gathering, understanding and pre-processing more, the research carried out in [8] analyzed the task of the data, while in Section 4 we present the data representa- extreme multi-label text classication (XMTC). is refers to tion methods and the models that we employ. Following that, assigning the most relevant labels to documents, where the in Section 5 we discuss the experimental setup and present labels can be chosen from an extremely large label collection the results, followed by a discussion and reection regard- that could even reach a size of millions. e authors in [8] ing those in Section 6. Finally, we conclude our research in represented the rst deep learning approach to XMTC using Section 7. a Convolutional Neural Network. e authors showed that the proposed CNN successfully scaled to the largest datasets 2 RELATED WORK used in the experiments, while consistently producing the best or the second-best results on all the datasets. Multi-label classifcation When it comes to evaluation metrics in multi-label classi- As far as we are aware, no previous work has been carried out cation scenario, a widely employed metric in the literature is regarding multi-label classication in the context of tagging the Hamming loss [6][13][16][22]. Furthermore, more tradi- programming problem statements. Additionally, most of tional metrics are also used such as the F-measure [6][13][16], the papers tackling this task employ traditional machine as well as the average precision [16][21]. learning methods, rather than deep learning approaches. In [21] the authors propose a multi-label lazy learning approach called ML-kNN, based on the traditional k-nearest neighbor algorithm. Moreover, in [9] the text classication problem is approached, which consists of assigning a text document Text representation into one or more topic categories. e paper employed a Apart from the aforementioned, a lot of literature work has Bayesian approach into multiclass, multi-label document gone into experimenting with dierent ways of representing classication in which the multiple classes from a document text. According to [17], the word representation that has were represented by a mixture model. Additionally, in [6] the been traditionally used in the majority of supervised Natural authors modeled the problem of automated tag suggestion Language Processing (NLP) applications is one-hot encoding. as a multi-label text classication task in the context of the is term refers to the process of encoding each word into ”Tag Recommendation in Social Bookmark Systems”. e a binary vector representation where the dimensions of the proposed method was built using Binary Relevance (BR) and vector represent all the unique words included in the corpus a naive Bayes classier as the base learner which were then vocabulary. While this representation is simple and robust evaluated using the F-measure. Furthermore, a new method [10], it does not encode any notion of similarity between based on the nearest-neighbor technique was proposed in words. For instance, the word ”airplane” is equally dierent [5]. More specically, the multi-label categorical K-nearest to the word ”university” as the word ”student”. e proposal neighbor approach was proposed for classifying risk factors of Word2Vec [10] solves this issue by encoding words into reported in SEC form 10-K. is is an annually led report a continuous lower dimensional vector space where words published by US companies 90 days aer the end of the scal with similar semantics and syntax are grouped close to each year. other. is led to the proposal of more types of text embed- A dierent strategy for approaching the multi-label classi- dings such as Doc2Vec [7] which encodes larger sequences cation problem proposed in the literature is active learning, of text, such as paragraphs, sentences, and documents, into examined in [2]. Furthermore, in [18] the authors proposed a single vector rather than having multiple sectors, ie. one a multi-label active learning approach for text classication per word. One of the more recent language models which based on applying a Support Vector Machine, followed by a produced state-of-the-art results in several NLP tasks is the Logistic Regression. Bidirectional Encoder Representations from Transformers 2 (BERT) [1]. Its main innovation comes from applying bidi- to increase classication performance on textual data since rectional training of an aention model, the Transformer, to they do not provide any additional information and only language modeling. increase the dimensionality of the problem [15]. Following the collection process, we merge the two crawled datasets 3 DATA and preprocess them. In this section, we describe the steps carried out to obtain At rst, HTML tags ¹< :∗? >º are removed since they do the dataset that we have worked with. More specically, we not provide any descriptive information about the problem. focus on presenting the data source and the data collection Aerward, since mathematical denitions are in LATEXformat process, followed by an overview of the preprocessing that on CodeForces and as plain text on TopCoder we decide we perform. Additionally, we show several descriptive sta- to remove them from both to avoid introducing dierences tistics regarding the data and explain the steps we take for between the two sets of data.