Multi-label Classification for Automatic Tag Prediction in the Context of Programming Challenges

Bianca Iancu Gabriele Mazzola Del‰ University of Technology Del‰ University of Technology Del‰, South Holland Del‰, South Holland [email protected]‰.nl [email protected]‰.nl

Kyriakos Psarakis Panagiotis Soilis Del‰ University of Technology Del‰ University of Technology Del‰, South Holland Del‰, South Holland [email protected]‰.nl [email protected]‰.nl

ABSTRACT problem. Œis is a more challenging problem than the usual One of the best ways for developers to test and improve multi-class classi€cation, where each data point can only their skills in a fun and challenging way are programming have one label aŠached. An example of a data sample for our challenges, o‚ered by a plethora of websites. For the in- problem is shown in Figure 1. A similar problem would be experienced ones, some of the problems might appear too to predict these tags based on the source code of a solution, challenging, requiring some suggestions to implement a so- but this would restrict the application domain. To be more lution. On the other hand, tagging problems can be a tedious speci€c, we are interested in applying the system in the con- task for problem creators. In this paper, we focus on automat- text of online programming challenges platforms where no ing the task of tagging a programming challenge description solution or source code is available. Œus, we only consider using machine and deep learning methods. We observe that the textual description of the problem statements as input to the deep learning methods implemented outperform well- our system. known IR approaches such as tf-idf, thus providing a starting point for further research on the task.

1 INTRODUCTION In recent years, more and more people have started to show interest in , either for purely ed- ucational purposes or for preparing for job interviews. Fol- lowing this trend, we can also see an increase in the number of online platforms providing services such as programming challenges or programming tutorials. Œese problems are based on a limited number of strategies to be employed. Un- derstanding which strategies to apply to which problem is arXiv:1911.12224v1 [cs.LG] 27 Nov 2019 the key to designing and implementing a solution. However, it is o‰en dicult to directly infer from the textual problem Figure 1: Data sample example statement the strategy to be used in the solution, especially with liŠle or no programming experience. Œus, assisting the programmer by employing a system that can recom- By gathering data from two of the main online program- 1 2 mend possible strategies could prove very useful, ecient ming challenges platforms, Topcoder and CodeForces , and educational. we approach the problem through both machine learning To this end, we propose a system to automatically tag a and deep learning solutions, experimenting with di‚erent programming problem given its textual description. Œese architectures and data representations. Considering the com- tags can be either generic, such as ’Math’, or more speci€c, plexity of this problem, which is dicult even for humans, such as ’Dynamic Programming’ or ’Brute Force’. Each prob- our hypothesis is that deep learning methods should be an lem can have multiple tags associated with it, thus this re- 1hŠps://www.topcoder.com/ search is focusing on a multi-class multi-label classi€cation 2hŠps://codeforces.com e‚ective way of approaching this problem, given enough Regarding deep learning approaches, in [11] the authors data. Based on the aforementioned, the research question analyzed the limitations of BP-MLL, a neural network (NN) that we are trying to answer is the following: ”Could deep architecture aiming at minimizing the pairwise ranking error. learning models learn and understand what are the strategies Additionally, they proposed replacing the ranking loss mini- to be employed in a programming challenge, given only the mization with the cross-entropy error function and demon- textual problem statement?” strated that neural networks can eciently be applied to Œe rest of the paper is organized as follows: in Section the multi-label text classi€cation seŠing. By using simple 2 we describe the related work carried out in the literature, neural network models and employing techniques such as both regarding multi-label classi€cation, as well as text rep- recti€ed linear units, dropout and AdaGrad, their work out- resentation methods. Subsequently, in Section 3 we describe performed state-of-the-art approaches for this task. Further- the process of gathering, understanding and pre-processing more, the research carried out in [8] analyzed the task of the data, while in Section 4 we present the data representa- extreme multi-label text classi€cation (XMTC). Œis refers to tion methods and the models that we employ. Following that, assigning the most relevant labels to documents, where the in Section 5 we discuss the experimental setup and present labels can be chosen from an extremely large label collection the results, followed by a discussion and reƒection regard- that could even reach a size of millions. Œe authors in [8] ing those in Section 6. Finally, we conclude our research in represented the €rst deep learning approach to XMTC using Section 7. a Convolutional Neural Network. Œe authors showed that the proposed CNN successfully scaled to the largest datasets 2 RELATED WORK used in the experiments, while consistently producing the best or the second-best results on all the datasets. Multi-label classifcation When it comes to evaluation metrics in multi-label classi€- As far as we are aware, no previous work has been carried out cation scenario, a widely employed metric in the literature is regarding multi-label classi€cation in the context of tagging the Hamming loss [6][13][16][22]. Furthermore, more tradi- programming problem statements. Additionally, most of tional metrics are also used such as the F-measure [6][13][16], the papers tackling this task employ traditional machine as well as the average precision [16][21]. learning methods, rather than deep learning approaches. In [21] the authors propose a multi-label lazy learning approach called ML-kNN, based on the traditional k-nearest neighbor algorithm. Moreover, in [9] the text classi€cation problem is approached, which consists of assigning a text document Text representation into one or more topic categories. Œe paper employed a Apart from the aforementioned, a lot of literature work has Bayesian approach into multiclass, multi-label document gone into experimenting with di‚erent ways of representing classi€cation in which the multiple classes from a document text. According to [17], the word representation that has were represented by a mixture model. Additionally, in [6] the been traditionally used in the majority of supervised Natural authors modeled the problem of automated tag suggestion Language Processing (NLP) applications is one-hot encoding. as a multi-label text classi€cation task in the context of the Œis term refers to the process of encoding each word into ”Tag Recommendation in Social Bookmark Systems”. Œe a binary vector representation where the dimensions of the proposed method was built using Binary Relevance (BR) and vector represent all the unique words included in the corpus a naive Bayes classi€er as the base learner which were then vocabulary. While this representation is simple and robust evaluated using the F-measure. Furthermore, a new method [10], it does not encode any notion of similarity between based on the nearest-neighbor technique was proposed in words. For instance, the word ”airplane” is equally di‚erent [5]. More speci€cally, the multi-label categorical K-nearest to the word ”university” as the word ”student”. Œe proposal neighbor approach was proposed for classifying risk factors of Word2Vec [10] solves this issue by encoding words into reported in SEC form 10-K. Œis is an annually €led report a continuous lower dimensional vector space where words published by US companies 90 days a‰er the end of the €scal with similar semantics and syntax are grouped close to each year. other. Œis led to the proposal of more types of text embed- A di‚erent strategy for approaching the multi-label classi- dings such as Doc2Vec [7] which encodes larger sequences €cation problem proposed in the literature is active learning, of text, such as paragraphs, sentences, and documents, into examined in [2]. Furthermore, in [18] the authors proposed a single vector rather than having multiple sectors, ie. one a multi-label active learning approach for text classi€cation per word. One of the more recent language models which based on applying a Support Vector Machine, followed by a produced state-of-the-art results in several NLP tasks is the Logistic Regression. Bidirectional Encoder Representations from Transformers 2 (BERT) [1]. Its main innovation comes from applying bidi- to increase classi€cation performance on textual data since rectional training of an aŠention model, the Transformer, to they do not provide any additional information and only language modeling. increase the dimensionality of the problem [15]. Following the collection process, we merge the two crawled datasets 3 DATA and preprocess them. In this section, we describe the steps carried out to obtain At €rst, HTML tags (< .∗? >) are removed since they do the dataset that we have worked with. More speci€cally, we not provide any descriptive information about the problem. focus on presenting the data source and the data collection A‰erward, since mathematical de€nitions are in LATEXformat process, followed by an overview of the preprocessing that on CodeForces and as plain text on TopCoder we decide we perform. Additionally, we show several descriptive sta- to remove them from both to avoid introducing di‚erences tistics regarding the data and explain the steps we take for between the two sets of data. In addition, all non-ASCII de€ning the tag taxonomy for the dataset. characters, digits, stop-words, punctuation and one character words (i.e. variable names) are removed for the same reason. Data sources & collection Œe next step of the preprocessing pipeline is to concate- We investigate di‚erent competitive programming platforms nate all textual €elds into one (i.e. title and description are to build a dataset of programming statements and tags. Specif- concatenated into the €eld ”text”). Furthermore, we convert ically, the platforms of interest to us are: TopCoder3, Hack- every character in this new €eld to lowercase. Finally, we errank 4, CS Academy5, Codewars6, and Codeforces7. Due run an exact duplicate removal since we observed that some to legal reasons as well as the complexity of the platform challenges on both crawled websites appear more than once. interfaces, we manage to successfully scrape only two of Œen we remove words with less than 10 occurrences in the these platforms: entire corpus assuming that they are not descriptive enough. • Codeforces on 13/09/2019. Codeforces is a website Œus, the €nal dataset is a JSON €le with an array of coding that o‚ers programming contests to people inter- challenges and each entry has two €elds, the ”text” and the ested in participating. Users are free to upload a ”tags” associated with it. challenge statement, together with several tags spec- Descriptive statistics ifying the strategies that should be used to devise a solution. We scrape a total of 5,341 problems, to- A‰er performing the preprocessing steps, we compute sev- gether with their tags. eral descriptive statistics. More speci€cally, we compute the • Topcoder on 17/09/2019. Topcoder is a crowdsourc- total number of problems, the average word count per prob- ing platform in the sense that each problem is made lem and the average tags per problem, all separately for the available to all the developers and they provide so- CodeForces and Topcoder data, as well as for the combined lutions. It also provides an archive with free access, dataset. Œese statistics are provided in Table 1. containing a set of problems with tags, similarly to CodeForces Topcoder Combined dataset CodeForces. We scrape a total of 4,508 problems, Problem count 4,592 4,115 8,707 together with their tags. Avg word count/problem 120.93 94.81 108.58 We implement the scraping of these two platforms in Python, Avg tags/problem 1.76 1.46 1.62 creating two ad-hoc scripts, one per platform. We use Beau- Table 1: Descriptive statistics tifulSoup48 to parse the HTML of the pages, to extract only meaningful information. As we can observe, the number of problems gathered from Data preprocessing the two data sources, as well as the additional statistics, are Œe crawled data contain a lot of redundant information or quite balanced. Œere are no signi€cant di‚erences when even data that can impact the training process in a negative it comes to the average word count per problem and the way (e.g. HTML tags and LATEXsymbols). What is more, some average tags per problem between the CodeForces data and NLP techniques, such as stop-word removal, have proven the Topcoder data. For a breakdown of the average number of words per tag, the reader can refer to Appendix A. 3hŠps://www.topcoder.com/ 4hŠps://www.hackerrank.com/ Tag taxonomy 5hŠps://csacademy.com/ 6hŠps://www.codewars.com/ Since each of the websites uses a di‚erent set of tags, it is 7hŠp://codeforces.com/problemset necessary to create a taxonomy that maps the original tags 8hŠps://pypi.org/project/beautifulsoup4/ to new common target labels. To decide on the class labels 3 to be included in the taxonomy, we start by analyzing the A‰er performing this additional aggregation, we decrease data. To this end, we generate a correlation matrix for all the number of tags from 17 to 9. For a complete list of the original tags present in the dataset to infer which of these, the reader can refer to Appendix C. Œe main semantic them can be grouped in a more general label. Œe original decisions taken while creating the €nal taxonomy for the number of labels was 16 for the Topcoder data and 35 for data are the following: the CodeForces data. A‰er performing the label aggregation (1) We remove several general tags since we consider we result in 17 common tags for the two data sources. For a that they are too general and vague and, thus, do not complete list of these, the reader can refer to Appendix B. give any concrete information regarding the method- Following preliminary experimentation, we observe that ology associated with the problem statement. Such the performance is unsatisfactory which brings us to suspect tags are, for example, ’implementation’, ’program- that we have too few data instances given the associated ming’, ’arrays’ or ’interactive’. number of labels. Œus, we proceed by further aggregating (2) We group problem statements originally labeled with and reducing the number of tags. As before, we perform a tags related to strings, string manipulations, and reg- visual analysis of the correlation matrix in Figure 2 between ular expressions into one category under the ’String the di‚erent tags to decide which of them can be aggregated. Operations’ name. (3) Œe term ’ternary search’ was found to correspond to similar operations as ’binary search’9. Œus, we group their associated problem statements into one label, ’Binary Search’. During the second aggrega- tion stage, we further combine ’Binary Search’ with the ’Search’ label under the common name ’Search and Binary Search’, as although they use di‚erent methods, they refer to a similar class of problems. (4) Since we observed that tags like ’shortest path’ and ’dfs’ were associated with both ’graphs’ and ’trees’, we merge the laŠer two into the ’Graphs’ label. Ad- ditionally, since graphs are a data structure, we fur- ther combine them with ’Data Structures’ in the sec- ond aggregation stage into the ’Data Structures and Figure 2: Problem tag correlation Graphs’ label. (5) Regarding the ’Math’ tag we initially intended to remove it, since we consider it to be quite general. Additionally, we plot a bar chart of the frequencies of the However, a‰er further analysis we observe that it is class labels in Figure 3 to understand whether some classes present in a high number of data points, hence we are present in too few instances to be properly learned by decide to keep it in the taxonomy. In the second ag- the models. gregation stage, we combine it with the ’Probabilities and Combinatorics’ tags, since they are a mathemat- ics €eld as well, under the common name ’Math and Probabilities’. Œe distribution of the €nal taxonomy associated with our dataset is shown in Figure 4. We can observe that there is an imbalance present in the dataset, with approximately 2500 data points di‚erence between ’Math and Probabilities’ (the most common tag) and ’Geometry’ (the least common tag). Œe imbalance is further taken into account when im- plementing the machine learning and deep learning models. Additionally, it is important to note that while we de€ne 9 classes for classifying the data points, according to the descriptive statistics presented before, in our data there are

Figure 3: Distribution of the original taxonomy 9hŠps://www.geeksforgeeks.org/ternary-search/ 4 less than 2 tags per data point on average. Œus, the models One-hot encoding. Another representation technique we will have to learn to predict signi€cantly less 1s than 0s for experiment with is one-hot encoding. Œis method repre- the associated labels. Œerefore, to account for this maŠer, sents textual data in discrete vectors with binary values. As in the methodology presented in the next section, we weigh mentioned previously, one-hot encoding treats all the words the evaluation metric, as well as the loss function for the in equal fashion and does not preserve any information about deep learning models to give more importance to correctly the word meaning. Another thing to note is that it also leads predicting 1s during the training. to a feature space with very high dimensionality. In particu- lar, the number of dimensions of the one-hot vector is equal to the number of unique words in the programming problem statements dataset. Word2Vec. One of the most popular representation tech- niques in the €eld of NLP is the Word2Vec [10] model. In particular, in their work, the authors propose two di‚erent feedforward networks, Continuous Bag-of-Words (CBOW) and Skip-gram, that project words into a continuous vector space which preserves their semantics and syntax. Œe main di‚erence between the two models is that CBOW aŠempts to predict the target word based on its context, whereas Skip-gram predicts the context of the current word. In our work, we used the Skip-gram method since it was found to outperform CBOW in [10]. More details can be found in the original Word2Vec paper [10]. Figure 4: Distribution of the €nal taxonomy Doc2Vec. Œe €nal representation that was tried out is the Doc2Vec embeddings proposed by [7]. In simple terms, it ex- tends the continuous vector space idea from Word2Vec [10] 4 METHODOLOGY to larger text sequences, such as sentences, paragraphs, and In this section, we €rst describe the data representation tech- documents. To be more exact, we map each programming niques that we used to encode our data. Following that, the problem statement to a pre-de€ned number of dimensions. machine and deep learning techniques used to classify the Similarly to Word2Vec, two models are proposed called Dis- programming problem statements into the target set of tags tributed Memory (PV-DM) and Distributed Bag of Words are outlined. It has to be underlined that we only provide (PV-DBOW) which correspond to the CBOW and Skip-gram the intuition and main concepts of these methods since they from Word2Vec [10] respectively. In this paper, we use the do not constitute the main focus of our work. PV-DBOW model to obtain the document embeddings. More information about Doc2Vec are available in [7]. Data representation tf-idf. To obtain features for our baseline classi€er we Models employ the tf-idf approach. Œe Python library that we For carrying out the multi-label classi€cation, we employ used10 employs the following formula: both machine learning and deep learning approaches, thus comparing the performances and analyzing the inƒuence of tf-idf(t,d) = tf(t,d) ∗ idf(t) deep learning on this task. Œe machine learning models we use for classi€cation where tf(t,d) is the number of times term t appears in docu- are the following: ment d, and idf(t) = log 1+n +1 where n is the total number 1+df (t) (1) Decision Tree: a non-parametric supervised learning of documents and df(t) is the number of documents contain- technique that performs classi€cation by learning de- ing term t. By doing this, we obtain a number of features for cision rules based on the data features. Œe model is each document equal to the total number of unique terms. implemented using the scikit-learn11 Python library. For a more detailed overview of tf-idf and its usage we refer (2) Random Forest: a model that €ts several decision tree the reader to [12]. classi€ers on di‚erent sub-samples of the data and

10hŠps://scikit-learn.org/stable/modules/feature extraction.html 11hŠps://scikit-learn.org/stable/modules/tree.html 5 averages their outputs. It is also implemented using Œerefore, the results we present later on in this section are the scikit-learn12 Python library. obtained on the test set: the models do not have access to (3) Random Classi€er: We implement our multiclass this data, but we use it to make decisions and tune a limited multilabel random classi€er by predicting vectors number of hyperparameters. where each entry has an equal probability of being either zero or one. We, therefore, expect to have, on Evaluation Metrics average, 4.5 ones predicted per sample. We present here the evaluation metrics we adopt to evaluate Œe deep learning model that we employ is a Long-Short the di‚erent models. Term Memory (LSTM) network, which is an improvement Weighted Hamming Score. Œe standard (non-weighted) to the traditional Recurrent Neural Networks (RNNs). By Hamming Score is de€ned as ”1 − HammingLoss”, where the using the Gating mechanism, it addresses the short-term Hamming Loss represents the fraction of the wrong labels out memory challenge that RNNs have. Œus, LSTMs can pre- of the total number of labels. However, as mentioned before, serve long-term memory, which is an important aspect to the datapoints have, on average, less than 2 tags (out of 9). consider since we are dealing with a text classi€cation task. For this reason, we decide to weight the Hamming Score When generating the labels, it is essential to preserve the metric by weighting di‚erently the errors in predicting 1s or long-term dependencies in the problem statements. In order 0s. Œe implementation of our custom metric is as follows: to obtain the €nal predicted labels we use a logistic activa- tion function. Œe reason that logistic activation function W eightedHammingScore=1−W eightedHammingLoss was used over so‰max is that the laŠer is more commonly used for multi-class classi€cation problems where there is where only one target label. We implement the LSTM architectures using PyTorch13, a Python-based open-source deep learning W eightedHammingLoss=W1∗Ratio Miscl1+W0∗Ratio Miscl0 library. and Ratio Miscli is the ratio of misclassi€ed entries per label i. W and W are set to 0.18 and 0.82, respectively, to account 5 EXPERIMENTS 0 1 for the label unbalance present in our dataset. In this section, we provide an overview of the experimental setup. More speci€cally, we discuss the training and test Average Precision, Recall, F1. We implement the average sets, the evaluation metrics that we employ, as well as the precision, recall, and F1 score using the sklearn.metrics li- training setup and the model hyperparameters. Additionally, brary 15 16 17. Our implementation weights the average ob- we present the experimental results. tained per label based on its support.

Training & Test sets Average Number of Ones per Sample. As an additional met- ric, we keep track of the average number of ones predicted Being aware of the fact that we have liŠle data available, per sample by the model. We do this mainly to make sure 8,707 data points a‰er pre-processing and duplicate removal, our models are not predicting either all ones or all zeros, and we decide to perform a single split between training and test, to account for the label unbalance of the dataset. without creating a validation set. We perform a 90-10 split, creating a training set consisting of 7826 training samples, Training and 881 test samples. Œis decision is backed up by the fact that (i) it is not straightforward to balance the sets since we As explained earlier in this section, we perform a 90-10 split are in a multilabel seŠing (one problem can have zero, one, or of our data, obtaining training and test set. Œe test set is multiple tags) and (ii) our goal is not devising a state-of-the- only used to perform early stopping (to halt the training if art architecture for the given problem, but rather verifying the model does not improve on the test set for more than 10 the applicability of neural networks to tackle it. epochs) and to tune a very limited number of hyperparame- We mitigate the e‚ect of this decision by avoiding ex- ters. To obtain a test set with distribution as similar as possi- tensive hyperparameter tuning using the test set, which ble to the training set, we employ the iterative strati€cation would otherwise cause our model to over€t on the test set14. 15hŠps://scikit-learn.org/stable/modules/generated/sklearn.metrics. 12hŠps://scikit-learn.org/stable/modules/generated/sklearn.ensemble. average precision score.html RandomForestClassi€er.html 16hŠps://scikit-learn.org/stable/modules/generated/sklearn.metrics. 13hŠps://pytorch.org/ recall score.html 14hŠps://www.kdnuggets.com/2019/05/careful-looking-model-results- 17hŠps://scikit-learn.org/stable/modules/generated/sklearn.metrics. cause-information-leakage.html f1 score.html 6 Model Weighted Hamming Score Avg Precision Avg Recall Avg F1 Loss Avg # of Ones per sample N. of trainable params tf-idf + Random Forest 0.27 0.719 0.111 0.171 - 0.22 - tf-idf + Dec. tree 0.423 0.326 0.331 0.323 - 1.68 - Random classi€er 0.518 0.228 0.525 0.309 - 4.62 - one-hot + LSTM 0.708 0.407 0.7 0.502 0.163 2.61 402,009 Doc2Vec + FFNN 0.754 0.387 0.79 0.483 0.173 3.75 649 Word2Vec + LSTM 0.762 0.375 0.793 0.489 0.158 4 20,505 Table 2: Model results on test set

approach18, which aims to solve the problem of balancing Results the split in the case of a multiclass multilabel seŠing. Table 2 shows the performance of our models on the test We implement our models using the deep learning frame- set. We want to highlight again that although this data has 13 work Pytorch and using some of the already implemented been used for early stopping and limited tuning, the models 19 20 models of scikit . All of our code can be found on GitHub . have never seen it during training. It can be seen that W2V + For the training of our architectures we employ binary LSTM achieves the best Weighted Hamming Score which is, 21 cross-entropy loss . However, similarly to the customiza- in general, higher for the neural network models compared tion performed on the Hamming Loss, we weight the binary to the two baselines found in the €rst three rows of the table. cross-entropy loss to account for the label unbalance.

Model hyperparameters 6 DISCUSSION & REFLECTION With regards to the tf-idf representation, the input dimen- In this section we discuss our study, the €ndings arising from sionality is 6,259, which is equal to the number of unique our experiments and the challenges faced while conducting terms in the entire corpus. For both Decision Tree and them. Random Forest we employ the default values of Sklearn, To our knowledge, this is the €rst research work that and set the number of estimators for the Random Forest focuses on predicting tags based on the textual description classi€er to 500. of programming problem statements. As a result, we cannot Similarly, with one-hot encoding, the input dimensional- compare the performance we obtain with any state-of-the-art ity equals the number of uniques terms in the corpus (6,259). result. Œe best performing LSTM network for this data represen- When it comes to interpreting the results shown in Table tation has 16 hidden units, followed by a fully connected 2, we consider as baselines the tf-idf with Decision Tree, layer for the output classi€cation (9 classes). We use zero- since it achieves beŠer performance than Random Forest padding on the input sequences to create batches and speed in terms of the Weighted Hamming Score and the Random up training. Œe learning rate is 0.01. Classi€er. We can observe that the deep learning models Œe Doc2Vec approach results in a smaller input repre- achieve signi€cantly higher performance compared to our sentation, with a dimensionality of 30. Œe architectures has baselines, performing beŠer on all the considered metrics. two fully connected layers, with 16 and 9 neurons respec- Regarding the data representation methods employed for the tively. Œe non-linear activation function used is ReLu. Œe deep learning models, Word2Vec outperforms both Doc2Vec learning rate is again 0.01. and one-hot encoding. Œis is a sensible result, as Word2Vec Œe Word2Vec data representation results in an input preserves the semantics and syntax of the words, as opposed representation of 300. Œe LSTM has 16 hidden units and it to one-hot encoding. Additionally, it allows for the sequential is followed by a fully connected layer that maps the hidden modeling of the words in the problem statements by employ- state to an output of dimensionality 9, for the €nal classi€ca- ing an LSTM, as opposed to Doc2Vec which encodes the data tion. Again, we employ zero-padding on the input sequences. at a paragraph level. However, as we can observe in Table 2, Œe learning rate is 0.005. the number of trainable parameters is considerably higher for the Word2Vec model than for the Doc2Vec one, while only achieving a slight improvement in performance. Œus, depending on the use case, Doc2Vec might be preferred over 18hŠp://scikit.ml/api/skmultilearn.model selection.iterative strati€cation. Word2Vec. For instance, if a system needs to be retrained reg- html ularly to account for new data, using Doc2Vec instead might 19hŠps://scikit-learn.org/stable/supervised learning.html 20hŠps://github.com/serg-ml4se-2019/group11-tagging-algorithm- be a reasonable choice. Nevertheless, both the Word2Vec problems and Doc2Vec models have a substantially lower number of 21hŠps://pytorch.org/docs/stable/nn.html#bceloss trainable parameters compared to one-hot encoding. 7 Moreover, we notice that in the case of the W2V+LSTM tried such as avoiding preprocessing the dataset and learning model, our best performing one in terms of Weighted Ham- character embeddings, similar to what was performed in [3]. ming Score, the average predicted number of ones per sample is quite high in comparison to the true labels (4 compared to REFERENCES 1.62). Œis could potentially be solved by adding a custom [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. regularization term to this model to force the number of Bert: Pre-training of deep bidirectional transformers for language predicted ones to remain lower. understanding. arXiv preprint arXiv:1810.04805, 2018. [2] Andrea Esuli and Fabrizio Sebastiani. Active learning strategies for Additionally, during training, we have observed that our multi-label text classi€cation. In European Conference on Information deep learning models were over€Šing on the training set Retrieval, pages 102–113. Springer, 2009. a‰er a few epochs, even a‰er experimenting with regular- [3] Ben Gelman, Bryan Hoyle, Jessica Moore, Joshua Saxe, and David ization techniques such as drop-out and weight decay. Our Slater. A language-agnostic model for semantic source code labeling. reasoning behind this issue is that we do not have enough In Proceedings of the 1st International Workshop on Machine Learning and So‡ware Engineering in Symbiosis, pages 36–44. ACM, 2018. data since the problem that we aimed at solving serves a very [4] Rahul Gupta. Data augmentation for low resource sentiment analysis speci€c purpose. Œis observation is also apparent in our using generative adversarial networks. In ICASSP 2019-2019 IEEE baseline creation where more complex classi€ers like Ran- International Conference on Acoustics, Speech and Signal Processing dom Forest perform worse than a simpler Decision Tree. For (ICASSP), pages 7380–7384. IEEE, 2019. an example of the over€Šing problem, we refer the reader to [5] Ke-Wei Huang and Zhuolun Li. A multilabel text classi€cation algo- rithm for labeling risk factors in sec form 10-k. ACM Transactions on Appendix D, where we include a training curve showing an Management Information Systems (TMIS), 2(3):18, 2011. over€Šing behaviour (one-hot encoding + LSTM). Œat said, [6] Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. Multil- we notice that there is an emerging trend in deep learning abel text classi€cation for automated tag suggestion. In Proceedings of literature that aims at solving problems that cannot be repre- the ECML/PKDD, volume 18, page 5, 2008. sented by large amounts of data, called small sample learning [7] ‹oc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. page 9. [14]. In particular, models try to learn concepts rather than [8] Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. Deep paŠerns since a concept can generalize beŠer, thus avoiding learning for extreme multi-label text classi€cation. In Proceedings of the over€Šing. Such an approach could be investigated as future 40th International ACM SIGIR Conference on Research and Development work for this task. in Information Retrieval, pages 115–124. ACM, 2017. [9] Andrew McCallum. Multi-label text classi€cation with a mixture model trained by em. In AAAI workshop on Text Learning, pages 1–7, 1999. 7 CONCLUSION [10] Tomas Mikolov, Kai Chen, Greg Corrado, and Je‚rey Dean. Ecient Estimation of Word Representations in Vector Space. arXiv:1301.3781 In this work we investigated how deep learning techniques [cs], January 2013. arXiv: 1301.3781. perform in a novel multi-class multi-label text classi€cation [11] Jinseok Nam, Jungi Kim, Eneldo Loza Menc´ıa, Iryna Gurevych, and problem. Our €ndings show that deep learning approaches Johannes Furnkranz.¨ Large-scale multi-label text classi€cation€”revis- signi€cantly outperform traditional widely accepted IR tech- iting neural networks. In Joint european conference on machine learning and knowledge discovery in databases, pages 437–452. Springer, 2014. niques like tf-idf. Moreover, we experimented with di‚erent [12] Juan Ramos et al. Using tf-idf to determine word relevance in doc- combinations of text representations and neural network ument queries. In Proceedings of the €rst instructional conference on architectures, €nding Word2Vec + LSTM as the option that machine learning, volume 242, pages 133–142. Piscataway, NJ, 2003. yields the best performance. Finally, we were able to ex- [13] Jesse Read, Bernhard Pfahringer, and Geo‚rey Holmes. Multi-label perience €rst hand the challenges and issues arising when classi€cation using ensembles of pruned sets. In 8th IEEE international conference on data mining, pages 995–1000. IEEE, 2008. having a limited amount of data, an issue that is common in [14] Jun Shu, Zongben Xu, and Deyu Meng. Small sample learning in big the deep learning literature. data era. arXiv preprint arXiv:1808.04572, 2018. Regarding possible future research directions, the need to [15] Catarina Silva and Bernardete Ribeiro. Œe importance of stop word collect more data for this task is of crucial importance. Œis removal on recall values in text categorization. In Proceedings of the could be done by either gathering data from more program- International Joint Conference on Neural Networks, 2003., volume 3, pages 1661–1666. IEEE, 2003. ming challenges websites or slightly di‚erent domains that [16] Konstantinos Trohidis, Grigorios Tsoumakas, George Kalliris, and can provide similar data samples. Another way to get more Ioannis P Vlahavas. Multi-label classi€cation of music into emotions. training data is by arti€cially augmenting the dataset either In ISMIR, volume 8, pages 325–330, 2008. by the use of synonyms or adversarial networks [4][20]. Œe [17] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word Representa- availability of more training data will then allow researchers tions: A Simple and General Method for Semi-supervised Learning. In Proceedings of the 48th Annual Meeting of the Association for Com- to use more advanced text embeddings such as BERT [1] and putational Linguistics, ACL ’10, pages 384–394, Stroudsburg, PA, USA, XLNet [19] which are the current state-of-the-art in the NLP 2010. Association for Computational Linguistics. event-place: Uppsala, €eld. Last but not least, di‚erent lines of work can also be Sweden. 8 [18] Bishan Yang, Jian-Tao Sun, Tengjiao Wang, and Zheng Chen. E‚ective APPENDIX multi-label active learning for text classi€cation. In Proceedings of the A. DATASET DESCRIPTIVE STATISTICS 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 917–926. ACM, 2009. [19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and ‹oc V Le. Xlnet: Generalized autoregressive pre- training for language understanding. arXiv preprint arXiv:1906.08237, 2019. [20] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In Širty-First AAAI Conference on Arti€cial Intelligence, 2017. [21] Min-Ling Zhang and Zhi-Hua Zhou. A k-nearest neighbor based algorithm for multi-label classi€cation. GrC, 5:718–721, 2005. [22] Min-Ling Zhang and Zhi-Hua Zhou. Ml-knn: A lazy learning approach to multi-label learning. Paˆern recognition, 40(7):2038–2048, 2007.

Figure 5: Average number of words per tag for the Code- Forces data

Figure 6: Average number of words per tag for the Topcoder data

B. LIST OF INITIAL 17 TAGS A‰er the initial aggregation of the tags present in the two datasets, scraped from Topcoder and CodeForces, the 17 resulting problem statements tags are the following: (1) Dynamic Programming (2) Greedy (3) Sorting (4) Recursion (5) Graphs (6) String Operations 9 (7) Search and Binary Search (8) Constructive Algorithms (9) Math and Probabilities Figure 8 shows the aggregation that was performed on the initial 17 common tags in order to reduce them to the €nal 9 tags.

D. OVERFITTING PROBLEM

Figure 7: Average number of words per tag for the full com- bined data

(7) Data Structures (8) Divide and Conquer Figure 9: Over€tting example: One-hot encoding + LSTM. (9) Geometry (10) Bit Manipulation (11) Brute Force (12) Binary Search (13) Search (14) Game Œeory (15) Constructive Algorithms (16) Math (17) Probabilities and Combinatorics

C. LIST OF FINAL 9 TAGS A‰er the €nal aggregation of the initial 17 tags, the 9 €nal problem statement tags are the following:

Figure 8: Final taxonomy

(1) Dynamic Programming (2) Greedy and Sorting (3) Data Structures and Graphs (4) String Operations (5) Geometry (6) Brute Force 10