Is Language Modeling Enough? Evaluating Effective Embedding Combinations

Is Language Modeling Enough? Evaluating Effective Embedding Combinations Rudolf Schneider∗, Tom Oberhauser∗, Paul Grundmann∗, Felix A. Gers∗ Alexander Loser¨ ∗, Steffen Staaby ∗Beuth University of Applied Sciences Berlin, Luxemburger Str. 10, 13353 Berlin fruschneider, toberhauser, pgrundmann, gers, [email protected] yUniversity Stuttgart, Universitaetsstrasse 32, 70569 Stuttgart, Germany yUniversity of Southampton, University Road, Southampton, SO17 1BJ, United Kingdom [email protected] Abstract Universal embeddings, such as BERT or ELMo, are useful for a broad set of natural language processing tasks like text classification or sentiment analysis. Moreover, specialized embeddings also exist for tasks like topic modeling or named entity disambiguation. We study if we can complement these universal embeddings with specialized embeddings. We conduct an in-depth evaluation of nine well known natural language understanding tasks with SentEval. Also, we extend SentEval with two additional tasks to the medical domain. We present PubMedSection, a novel topic classification dataset focussed on the biomedical domain. Our comprehensive analysis covers 11 tasks and combinations of six embeddings. We report that combined embeddings outperform state of the art universal embeddings without any embedding fine-tuning. We observe that adding topic model based embeddings helps for most tasks and that differing pre-training tasks encode complementary features. Moreover, we present new state of the art results on the MPQA and SUBJ tasks in SentEval. Keywords: representation learning, meta embedding, evaluation, neural language representation models 1. Introduction universal embeddings with specialized embeddings to ex- Universal embeddings, such as BERT (Devlin et al., 2019) ecute common downstream tasks for clinical decision sup- or ELMo (Peters et al., 2018), are an effective text represen- port systems (CDSS). Examples are paragraph classifica- tation (Nguyen et al., 2016; Conneau and Kiela, 2018). Of- tion, subjectivity classification, question type classification, ten, they are trained on hundreds of millions of documents sentiment analysis and textual similarity. with an language modeling objective and contain millions Problem definition. We hypothesize that specialized to even billions of parameters. These pre-trained vec- neural text representations may complement universal em- tors lead to significant increases in performance in various beddings. Given is a set of both universal and specialized downstream natural language processing tasks (Mikolov et embeddings with different pre-training tasks for the English al., 2013a; Joulin et al., 2017; Akbik et al., 2018; Peters et language (see Table 1). These embeddings encode words, al., 2018; Radford et al., 2018). Contrary to universal em- entities, or topics. We study which combination of embed- beddings, specialized embeddings for tasks like entity link- dings is complementary using the SentEval3 (Conneau and ing (Pappu et al., 2017; Gillick et al., 2019) or paragraph Kiela, 2018) benchmark. Thus, we investigate if universal classification (Arnold et al., 2019) exist. Often, special- embeddings capture the same features as specialized em- ized embeddings are trained with objectives and training beddings. datasets different from universal embeddings. This circum- Probing embeddings with SentEval. We probe single stance raises the question if universal embeddings capture embeddings and combinations with SentEval in a transfer- all useful features for downstream tasks or if specialized learning setting on nine different tasks. SentEval focuses embeddings may provide complementary features. on news and customer reviews. The language in these do- Example: Clinical Decision Support Systems. Medi- mains differs vastly from the medical domain. Moreover, cal literature databases, such as PubMed1 or UpToDate2, SentEval concentrates on single sentence evaluation, which help doctors answer their questions. These systems bene- does not fully utilize the capabilities of contextualized em- fit from methods that enrich texts with semantic concepts, bedding models (Peters et al., 2018; Devlin et al., 2019; like entity recognition, sentence classification, topic clas- Arnold et al., 2019). sification, or relation extraction (Demner-Fushman et al., Novel datasets. We tackle the shortcomings of SentEval 2009; Berner, 2007). Medical language is highly special- by integrating the WikiSection-Diseases4 (Arnold et al., ized and often ambiguous in clinical documents (Leaman 2019) dataset into the SentEval framework. WikiSection et al., 2015). Documents, such as medical research pa- also enables an in-depth evaluation of contextualized em- pers, doctors’ letters or clinical notes, are heterogeneous beddings since its paragraph classification task is multi- in terms of structure, vocabulary, or grammatical correct- sentence based. As the language in CDSS resources (e.g., ness (Starlinger et al., 2017). We propose to complement PubMed) differs from the Wikipedia-based WikiSection 1https://www.ncbi.nlm.nih.gov/pubmed/ 3https://github.com/facebookresearch/SentEval 2https://www.uptodate.com/ 4https://github.com/sebastianarnold/WikiSection Name Pre-Training Task Domain Publication Class ELMo (EL) Language Modeling Web (Peters et al., 2018) Universal BERT (BE) Language Modeling Web (Devlin et al., 2019) Universal FastText (FT) Language Modeling Web (Mikolov et al., 2018) Universal Pappu (PA) Entity Linking Wikipedia (Pappu et al., 2017) Specialized SECTOR (Wikipedia) (SW) Neural Topic Modeling Wikipedia (Arnold et al., 2019) Specialized SECTOR (PubMed) (SP) Neural Topic Modeling Medical - Specialized Table 1: Comparison of neural text embeddings. dataset, we propose the PubMedSection5 dataset. PubMed- fact that similarly spelled words often also have a similar Section is a novel medical topic classification dataset cre- meaning. ated with a method inspired by distant supervision. Universal text embeddings encode the meaning of frequent In-depth experimental evaluations on 11 tasks. We words (Peters et al., 2018; Devlin et al., 2019; Radford study properties of single and combined text embeddings et al., 2019). However, they perform worse in compari- and their performance on the nine tasks from SentEval and son with domain adapted representations in specialized do- on the two medical datasets, WikiSection and PubMedSec- mains (Sheikhshabbafghi et al., 2018; Lee et al., 2019). tion. Our focus is on examining the differences between Furthermore, universal text embeddings might miss essen- universal and specialized embeddings and effective embed- tial aspects about named entities. The reason is that most ding combinations. training methods base on the co-occurrence of words in rel- The remainder of this paper is structured as follows: Sec- atively short local contexts. This hinders the models to cap- tion 2 reviews embeddings and work on integrating embed- ture more global features of texts such as genre, topic, re- dings. Section 3 introduces our novel datasets while Sec- ceiver, the authors’ intention or they miss to learn the pre- tion 4 describes our setup. In Section 5 we show and dis- cise meaning of a word in special domains such as medicine cuss quantitative results from our comprehensive analysis. (Sheikhshabbafghi et al., 2018; Lee et al., 2019). Also, We conclude in Section 6 and propose future research di- computing embedding models for highly regulated domains rections. is often hard and not feasible due to the lack of training data (Berner, 2007; Starlinger et al., 2017) or high compu- 2. Related Work tational costs. In the following we investigate universal and specialized 2.2. Specialized Text Embeddings embeddings shown in Table 1 and discuss methods for com- Neural topic modeling. Arnold et al. (2019) introduce bining embeddings. a specialized embedding using a coherent topic modeling 2.1. Universal Text Embeddings task for pre-training. This model encodes both structural Recently, researchers explore universal text embeddings and topical facets of documents (see work of MacAvaney et trained on extensive Web corpora, such as the Common al. (2018)) and assigns each sentence in a document a dense Crawl6 (Mikolov et al., 2018; Radford et al., 2019), the distributed representation of occurring latent topics (Blei, billion word benchmark (Chelba, 2010; Peters et al., 2018) 2012). For this purpose, the model consolidates the topical and Wikipedia (Bojanowski et al., 2017). Universal text structure and context over the entire document. It leverages embeddings often perform language modeling tasks where sequence information on the granularity of paragraphs and the model is asked to predict a missing word given a sentences using a Bidirectional LSTM architecture (Graves, small window of neighboring words (Mikolov et al., 2013b; 2012) with forget gates (Gers et al., 2000). In addition, this Joulin et al., 2017; Mikolov et al., 2018; Pennington et model captures long-range topical information. However, al., 2014). Another common task is to predict the next, it does not focus on disambiguating single words. There- or masked word of a sentence given previously predicted fore, we suggest complementing universal text embeddings words as context (Peters et al., 2018; Devlin et al., 2019; (disambiguation task) with neural topic models (paragraph Radford et al., 2019). For the encoder-decoder architec- classification task). ture, Kiros et al. (2015) propose an encoder network that Neural entity embeddings.

Is Language Modeling Enough? Evaluating Effective Embedding Combinations

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support