Utilizing Pre-Trained Word Embeddings to Learn Classification Lexicons with Little Supervision
Total Page:16
File Type:pdf, Size:1020Kb
Utilizing Pre-Trained Word Embeddings to Learn Classification Lexicons with Little Supervision Frederick Blumenthal Ferdinand Graf d-fine GmbH d-fine GmbH [email protected] [email protected] Abstract to determine the process flow. In both cases, sound A lot of the decision making in financial insti- text classification algorithms help saving costs and tutions, regarding particularly investments and efforts. risk management, is data-driven. An important To tackle the problem of document classification, task to effectively gain insights from unstruc- classical methods combine hand-engineered fea- tured text documents is text classification and in particular sentiment analysis. Sentiment tures, e.g. word-count based features, n-grams, lexicons, i.e. lists of words with corresponding part-of-speech tags or negations features, with a sentiment orientations, are a very valuable re- non-linear classification algorithm such as Support source to build strong baseline models for sen- Vector Machine (Joachims, 1998). A detailed sur- timent analysis that are easy to interpret and vey of classical sentiment analysis models, a spe- computationally efficient. We present a novel cial case of text classification, has been compiled method to learn classification lexicons from a by Pang et al.(2008) and Liu(2012). labeled text corpus that incorporates word sim- ilarities in the form of pre-trained word em- Since the reign of deep learning, various neural beddings. We show on two sentiment analy- network architectures such as convolutional neu- sis tasks that utilizing pre-trained word embed- ral networks (CNN) (Kim, 2014; dos Santos and dings improves the accuracy over the baseline method. The accuracy improvement is partic- Gatti, 2014), character level CNNs (Zhang et al., ularly large when labeled data is scarce, which 2015), recursive neural networks (Socher et al., is often the case in the financial domain. More- 2013), recurrent neural network (RNN) (Wang over, the new method can be used to generate et al., 2015; Liu et al., 2016) and transformers sensible sentiment scores for words outside the (Vaswani et al., 2017) have been utilized in text labeled training corpus. classification models to yield state-of-the-art re- sults. 1 Introduction Recently, a steep performance increase has been achieved by very large pre-trained neural lan- A vast amount of information in business and es- guage models such as ELMo (Peters et al., 2018), pecially in the finance area is only available in BERT (Devlin et al., 2018), XLNet (Yang et al., the form of unstructured text documents. Auto- 2019) and more (Howard and Ruder, 2018; Rad- matic text analysis algorithms are increasingly be- ford et al., 2018; Akbik et al., 2018). These mod- ing used to effectively and efficiently gain insights els generate powerful text representations that can from this type of data. A particularly important be either used as context-aware word embeddings text analytics task is document classification, i.e. or the models can be directly fine tuned to specific the task to assign a document to a category within tasks. a set of pre-defined categories. For example, an- nual reports, news articles and social media ser- One disadvantage of these pre-trained language vices like twitter provide textual information that models, however, is the high demand of mem- can be used in conjunction with structured data ory and computing power, e.g. a sufficiently large to quantify the creditworthiness of a debtor. To GPU to load the large models. In finance, many give another example, intelligent process automa- documents that can be the subject of text classi- tion may require the categorization of documents fication applications (e.g. annual reports or leg- islative documents), are very large, so that the are employed to generate large amounts of labeled computational cost becomes very relevant. An- data. For example, Mohammad and Turney(2013) other disadvantage is that because of their com- compiled a large twitter corpus where noisy labels plexity, many state-of-the-art deep learning mod- are inferred from emoticons and hashtags. Count- els are hard to interpret and it is very difficult to re- based methods such as pointwise mutual informa- trace the model predictions. Model interpretabil- tion (PMI) generate sentiment scores for words ity, however, seems to be particularly important based on their frequency in positive and negative for many financial institutions and interpretable training sentences (Mohammad and Turney, 2013; models with transparent features are often favored Kiritchenko et al., 2014). over more complex models even if the complex A more direct approach to learn sentiment lexi- models are more accurate. cons from labeled corpora is to use supervised ma- A powerful resource for building interpretable chine learning. The basic idea is to design a text text classification models are classification lexi- classification model that contains a parametrized cons and in particular sentiment lexicons. A senti- mapping from word token to sentiment score and ment lexicon is a list of words (or n-grams) where an aggregation of word-level sentiment scores each word is assigned a sentiment orientation. The to document scores. The parametrized mapping sentiment orientation can be binary, i.e. each word which yields the sentiment lexicon is learned dur- in the lexicon is labeled as positive or negative, or ing training. Severyn and Moschitti(2015) pro- continuous where a continuous sentiment score is posed a linear SVM model and showed that the assigned to the words (e.g. in the interval [-1, 1]). machine learning approach outperforms count- More generally, a classification lexicon is a list of based approaches. A simple linear neural network words where each word is assigned a vector with model has been proposed by Vo and Zhang(2016). one score for each class. A similar model with a slightly more complex neu- ral network architecture is used by Li and Shah Sentiment lexicons have been an integral part of (2017). They use data from StockTwits, a social many classical sentiment analysis classifiers (Mo- media platform designed for sharing ideas about hammad et al., 2013; Vo and Zhang, 2015). Ap- stocks, which they also use to generate sentiment- proaches based on sentiment lexicons seem to be specific word embeddings.1 Prollochs¨ et al.(2015) particularly popular in the finance domain (Kear- design a linear model and add L1 regularization ney and Liu, 2014). In addition, it has been to optimally control the size of the sentiment lexi- shown that even modern neural network models cons. can profit from incorporating sentiment lexicon features (Teng et al., 2016; Qian et al., 2016; Shin We see two main challenges for the generation of et al., 2016). Using classification lexicon features new domain specific classification lexicons via a can be thought of as a way of inducing external pure supervised learning approach. information that has been learned from different • The generation of robust classification lex- data sets or compiled by experts. icons requires large amounts of supervised Three approaches to sentiment lexicon generation training data. Manual labeling of data is very are usually distinguished in the literature, namely expensive and a distant (or weak) labeling ap- the manual approach, the dictionary-based ap- proach may not be possible for all applica- proach and the corpus-based approach, see for tions. example (Liu, 2012, Chapter 6). A popular • Using small or medium size supervised train- finance specific lexicon has been compiled by ing data, one may encounter many words at Loughran and McDonald(2011) from 10-K fill- prediction time that are not part of the train- ings , but see also the General Inquirer (Stone ing corpus. et al., 1962) and the Subjectivity Lexicon (Wilson et al., 2005). 1The objective of sentiment-specific word embeddings, first proposed by Maas et al.(2011), is to map words (or Fairly recently, models have been designed to gen- phrases) close to each other if they are both semantically sim- ilar and have similar sentiment. A sentiment lexicon could be erate sentiment lexicons from a labeled text cor- considered as one-dimensional or two-dimensional word em- pus. In many cases distant supervision approaches beddings. To tackle these problems, we propose a novel su- For supervised learning of the classification lexi- pervised method to generate classification lexi- con, a data set with labeled text sentences is used, N cons by utilizing unsupervised data in the form i.e. a data set D = f(tn; yn)gn=1 that consists of of pre-trained word embeddings. This approach sentences (or other pieces of text) tn with corre- allows to build classification lexicons with very sponding class label yn 2 f1;:::;Cg. In this set- small amounts of supervised data. In particular, ting, the overall idea is to design a classification it allows extending the classification lexicon to model that consists of an elementwise mapping s words outside the training corpus, namely to all from word token to word-level class scores and a words in the vocabulary of the pre-trained word function f that aggregates the word class scores to embedding. sentence-level class probabilities, The remainder of this paper is structured as fol- p(t) = f s(x1); s(x2);:::; s(xjtj) ; (2) lows. Section2 gives a short introduction to supervised learning of classification lexicons in with p 2 [0; 1]C and jtj denotes the number of general and then introduces the novel model ex- words in sentence t. The objective is to learn the tension to utilize pre-trained word embeddings. functions s and f such that the model as accu- We show empirically in Section3 that the use rately as possible predicts the sentence class labels of pre-trained word embeddings improves predic- of the training data.