Word Embeddings for Multi-Label Document Classification
Total Page:16
File Type:pdf, Size:1020Kb
Word Embeddings for Multi-label Document Classification Ladislav Lenc†‡ Pavel Kral´ †‡ ‡ NTIS – New Technologies † Department of Computer for the Information Society, Science and Engineering, University of West Bohemia, University of West Bohemia, Plzen,ˇ Czech Republic Plzen,ˇ Czech Republic [email protected] [email protected] Abstract proaches including computer vision and natural language processing. Therefore, in this work we In this paper, we analyze and evaluate use different approaches based on convolutional word embeddings for representation of neural nets (CNNs) which were already presented longer texts in the multi-label document in Kim(2014) and Lenc and Kr al´ (2017). classification scenario. The embeddings Usually, the pre-trained word vectors ob- are used in three convolutional neural net- tained by some semantic model (e.g. word2vec work topologies. The experiments are (w2v) (Mikolov et al., 2013a) or glove (Penning- ˇ realized on the Czech CTK and English ton et al., 2014)) are used for initialization of Reuters-21578 standard corpora. We com- the embedding layer of the particular neural net. pare the results of word2vec static and These vectors can then be progressively adapted trainable embeddings with randomly ini- during neural network training. It was shown in tialized word vectors. We conclude that many experiments that it is possible to obtain bet- initialization does not play an important ter results using these vectors compared to the ran- role for classification. However, learning domly initialized vectors. Moreover, it has been of word vectors is crucial to obtain good proven that even “static” vectors (initialized by results. pre-trained embeddings and fixed during the net- work training) usually bring better performance 1 Introduction than randomly initialized and trained ones. Text classification (or categorization) is one of the However, the experiments were often realized core tasks in natural language processing (NLP) on rather shorter texts and in single-label classifi- field. The applications of text classification are cation task. In this paper, we would like to analyze numerous as for instance sentiment analysis, au- and evaluate the use of word embeddings for rep- tomatic categorization of e-mails or spam filter- resentation of longer texts in multi-label classifi- ing. The main goal of document classification is cation scenario. The embeddings are used in three to assign one or more labels to a given document. different convolutional neural network topologies. The categorization then helps the users to find ap- The experiments are realized on the Czech CTKˇ propriate documents. Nowadays, computers are and English Reuters-21578 standard corpora. The heavily utilized for this task and can save a great Czech language has been chosen as a represen- amount of human labor. tative of highly inflectional Slavic language with In this paper, we concentrate on multi-label doc- a free word order. English is used to compare the ument classification which means that one docu- results of our method with state of the art. ment can belong to more classes simultaneously. We compare the results with word2vec static More formally, given a set of documents D and a and trainable embeddings with randomly initial- set of all possible labels C we create a model that ized word vectors. We conclude that the initial- assigns a set C C to the document d D. ization does not play an important role for classi- d ⊂ ∈ Multi-label classification is often solved using fication of these documents. We further analyze an ensemble of binary classifiers (Tsoumakas and and compare the word2vec embeddings with the Katakis, 2006). However, nowadays, neural nets learned ones from the semantic point of view and outperform majority of artificial intelligence ap- discuss the results. 431 Proceedings of Recent Advances in Natural Language Processing, pages 431–437, Varna, Bulgaria, Sep 4–6 2017. https://doi.org/10.26615/978-954-452-049-6_057 The rest of the paper is organized as follows. An alternative multi-label classification ap- The following section contains a short review of proach is proposed by Yang and Gopal(2012). the usage of neural networks for document classi- The conventional representations of texts and cat- fication including word embeddings. Section3 de- egories are transformed into meta-level features. scribes topologies of the convolutional networks. These features are then utilized in a learning-to- Section4 deals with experiments realized on the rank algorithm. Experiments on six benchmark CTKˇ and Reuters corpora and then analyzes and datasets show good abilities of this approach in discusses the obtained results. In the last section, comparison with other methods. we conclude the experimental results and propose Three different types of word embeddings with some future research directions. CNNs are compared in Kim(2014) on 7 NLP tasks including sentiment analysis and question classifi- 2 Related Work cation. The author proposes a novel CNN topol- ogy and shows that word2vec initialization and Nowadays, neural nets belong to the state-of-the a subsequent learning plays a crucial role for all art approaches on many natural language process- sentence-level single-label classification tasks. ing tasks as for instance POS tagging, chunking, named entity recognition, semantic role labeling 3 Network Topologies or document classification (Collobert et al., 2011). First, we mention traditional feed-forward neu- In this section we describe the three CNN network ral nets as shown for instance in (Manevitz and topologies used in our experiments. The network Yousef, 2007). The authors obtain F-measure inputs are the sequences of word indices into a vo- cabulary V of the size V . In order to ensure the about 78% on the standard Reuters dataset with | | a simple multi-layer perceptron with three lay- fixed length, all documents are padded or short- ers. The standard backpropagation algorithm for ened to a specified length M. These are then rep- multi-label learning of an MLP was improved resented as real-valued word vectors of dimension in (Zhang and Zhou, 2006). The authors use a E in the embedding layer. The embedding layer is novel error function which gives better results on either initialized randomly or by pre-trained word 1 functional genomics text categorization. vectors from word2vec . In the case of word2vec Nam et al.(2014) propose a novel learning strat- initialization, the layer is either further learned egy of feed-forward nets for multi-label text clas- during the training process or is kept static. The in- sification task. The authors use cross-entropy al- put and the embedding layer is similar in all three gorithm for training with rectified linear units ac- following network topologies. The concrete val- tivation (Srivastava et al., 2014). The documents ues of the main hyper-parameters of the following are represented by tf-idf and multi-label classifi- networks are specified in Section4. cation is realized by a simple thresholding of the 3.1 Convolutional Network 1 output layer. The networks are evaluated on sev- eral multi-label datasets and obtain results compa- This architecture was proposed in Lenc and Kral´ rable with the state of the art. (2017) for multi-label document classification. Both, standard convolutional networks and re- The embedding layer is followed by a convo- current convolutional neural nets are also success- lutional layer, where we use NC convolution ker- nels of the size k 1. It uses rectified linear unit fully used for text categorization. The authors (Lai × et al., 2015) demonstrated that recurrent CNNs (ReLU) activation function. The following layer performs max pooling over the length M k + 1 outperform CNNs on four corpora in single-label − resulting in N 1 E vectors. The output of this document classification task. C × Another CNN based method with word embed- layer is then flattened and connected with a fully dings as inputs (Kurata et al., 2016) leverages the connected layer with d1 neurons. The output layer co-occurrence of labels in the multi-label classi- uses sigmoid activation function and its size cor- responds to the number of categories C . The fi- fication. Some neurons in the output layer cap- | | ture the patterns of label co-occurrences, which nal result is obtained by thresholding of the output improves the classification accuracy. This method layer. is evaluated on the natural language query classi- 1It is possible to use other semantic model, however based fication in a document retrieval system. on our previous experiments, we keep word2vec. 432 This architecture will hereafter be referenced as CNN1. 3.2 Convolutional Network 2 The second architecture is a modified version of a successful net proposed by Kim(2014). Contrary to the first topology, this network uses two-dimensional convolutional kernels of various widths. The sizes of the kernels are k E which × means that it takes the whole length of the em- bedding. The original version is used for single- label classification and therefore it uses softmax activation function in the output layer. In our case, sigmoid is more appropriate due to the multi-label classification task. Similarly as in the previous topology, we also added one fully connected layer before the output. The output layer of the size C | | is also thresholded to determine the set of assigned categories. This network will be further called CNN2. The Figure 1: Two-level CNN architecture (2L-CNN). architecture is described in detail in Kim(2014). The threshold values for both CNN1 and CNN2 the Theano deep learning library (Bergstra et al., are set on the development corpus. 2010). It has been chosen mainly because of good 3.3 Two-level CNN performance and our previous experience with this The first level of this network is the CNN1 topol- tool. For evaluation of the multi-label document ogy described above. However, the output is classification results, we use the standard recall, not thresholded as in the former case.