A Study of Semantic Augmentation of Word Embeddings for Extractive Summarization

A study of semantic augmentation of word embeddings for extractive summarization Nikiforos Pittaras Vangelis Karkaletsis DIT, NKUA IIT, NCSR-D IIT, NCSR-D [email protected] [email protected] Abstract ing applications. As a result, semantic augmentation approaches can introduce existing knowledge In this study we examine the effect of se- to the neural pipeline, circumventing the need for mantic augmentation approaches on ex- the neural model to learn all useful information tractive text summarization. Wordnet from scratch. hypernym relations are used to extract In this study, we examine the effect of seman- term-frequency concept information, sub- tic augmentation and post-processing techniques sequently concatenated to sentence-level on extractive summarization performance. Specif- representations produced by aggregated ically, we modify the input features of a deep neu- deep neural word embeddings. Multi- ral classification model by injecting semantic fea- ple dimensionality reduction techniques tures, simultaneously employing feature transfor- and combination strategies are examined mation post-processing methods towards dimen- via feature transformation and clustering sionality reduction and discrimination optimiza- methods. An experimental evaluation on tion. Specifically, we aim to address the following the MultiLing 2015 MSS dataset illus- research questions. trates that semantic information can introduce benefits to the extractive summariza- • Can the introduction of semantic information tion process in terms of F1, ROUGE-1 and in the network input improve extractive sum- ROUGE-2 scores, with LSA-based post- marization performance? processing introducing the largest im- provements. • Does the semantic augmentation process benefit via dimensionality reduction post- 1 Introduction processing methods? In recent years, the abundance of textual informa- The rest of the paper is structure as follows. In tion resulting from the proliferation of the Internet, section2 we cover existing related work relevant online journalism and personal blogging platforms to this study. This is followed by a description of has led to the need for automatic summarization our approach (section3). In section4 we outline tools. These solutions can aid users to navigate the our experimental methodology and discuss on re- saturated information marketplace efficiently via sults and findings. Finally, we present our conclu- the production of digestible summaries that retain sions in section5. the core content of the original text (Yogan et al., 2016). At the same time, advancements intro- 2 Related work duced by deep learning techniques have provided efficient representation methods for text, mainly 2.1 Text representations via the development of dense, low-dimensional Extensive research has investigated methods of vector representations for words and sentences representing text for Natural Language Processing (LeCun et al., 2015). Additionally, semantic in- and Machine Learning tasks. formation sources have been compiled by humans Vector Space Model (VSM) approaches project in a structured manner and are available for use to- the input to a n-dimensional vector representa- wards aiding a variety of natural language process- tion, exploiting properties of vector spaces and lin- 63 Proceedings of the Multiling 2019 Workshop, co-located with the RANLP 2019 conference, pages 63–72 Varna, Bulgaria, September 6, 2019. http://doi.org/10.26615/978-954-452-058-8_009 ear algebra techniques for cross-document opera- Palomar, 2009). Other popular handcrafted fea- tions. Approaches like the Bag-of-Words (Salton tures used are syntactic / grammar information et al., 1975) have become popular baselines, map- such as part-of-speech tags, as well as sentence- ping the occurence of an input term (e.g. a word) wise features such as sentence position and length. to their occurence frequencies in the text. Mod- Finally, similarity scores to title, centroid clusters ifications to the model include refinements in the and predefined keywords can be used to score / term weighting strategy such as DF and TF-IDF rank sentences towards salience identification and normalizations (Yang, 1997; Salton and Buckley, extraction (Neto et al., 2002; Yogan et al., 2016). 1988), term preprocessing such as stemming and Other works adopt a topic-based approach, us- lemmatization (Jivani et al., 2011), and others. ing topic modelling techniques towards sentence Further, sentence and phrase-level terms are exam- salience detection. For example, the work in ined (Scott and Matwin, 1999), along with n-gram (Aries et al., 2015) builds topics via a cluster- approaches, which consider n-tuple occurences of ing process, using a word and sentence-level vec- terms instead (Brown et al., 1992; Katz, 2003; Post tor space model and the cosine similarity mea- and Bergsma, 2013). sure. Clustering techniques have been applied to Other approaches encode term co-occurence in- this end, for sentence grouping and subsequent formation via representation learning, relying on salience identification (Radev et al., 2000). the distributional hypothesis (Harris, 1954) to cap- Graph methods have also been exploited; In ture semantic content. At the same time, the need (Lawrie et al., 2001), the authors adopt a graph- to circumvent the curse of dimensionality (Hastie based probabilistic language model towards build- et al., 2005) of term-weight feature vectors has led ing a topic hierarchy for predicting representative to the production dense, rather than sparse rep- vocabulary terms. The MUSE system (Litvak and resentations. Early such examples used analytic Last, 2013) combines graph-modelling with ge- matrix decompositions on co-occurence statistics netic algorithms towards sentence modelling and (Jolliffe, 2011; Deerwester et al., 1990; Horn and subsequent ranking, while the work in (Mihalcea Johnson, 2012), while more recently, vector em- and Tarau, 2004) builds sentence graphs using a beddings are iteratively optimized learned by an- variety of feature bags and similarity measures and alyzing large text corpora using local word con- proceeds to extract central sentences via multiple text in a sliding window fashion (Mikolov et al., iterations of the TextRank algorithm. 2013a,b), or using pre-computed pairwise word co-occurences (Pennington et al., 2014). More re- 2.3 Semantic enrichment fined methods break down words to subword units (Bojanowski et al., 2017), where learning repre- Semantic information has been broadly exploited sentations for the latter enables some success in towards aiding NLP tasks, using resources such handling out-of-vocabulary words. as Wordnet (Miller, 1995), Freebase (Bollacker et al., 2008), Framenet (Baker et al., 1998) and others. Such external knowledge bases have seen 2.2 Extractive summarization widespread use, ranging from early works on ex- In summarization, contrary to the abstractive ap- panson of rule-based discrimination techniques proach where output summaries are generated (Scott and Matwin, 1998), to synonym-based fea- from scratch (Yogan et al., 2016), the extractive ture extraction (Rodriguez et al., 2000) and large- method relies on sentence salience detection to scale feature generation from WordNet synset re- retain a minimal subset of the most informative lationships edges for SVM classification (Mansuy sentences in the original text (Gupta and Lehal, and Hilderman, 2006). 2010). VSM approaches have been widely uti- In extractive summarization, semantic informa- lized in sentence modelling for this task, with a tion has been used as a refinement step in the sen- variety of methods for determining term weights tence salience detection pipeline. For example, in based on word frequency, probability, mutual in- (Dang and Luo, 2008), the authors utilize Word- formation or tf-idf features, sentence similarity, as Net synsets as a keyphrase ranking mechanism, well as a variety of feature combination methods based on candidate synset relevance to the text. (Mori, 2002; McCargar, 2004; Nenkova and Van- Other approaches (Vicente et al., 2015) use seman- derwende, 2005; Galley et al., 2006; Lloret and tic features from Wordnet and named entity extrac- 64 tion, followed by a PCA-based post-processing In addition to embedding training, we examine the step for dimensionality reduction. Wordnet is also performance of pre-trained Fastext (Joulin et al., utilized in (Li et al., 2017) where the authors use 2016) embeddings, produced by a model that the resource for sentence similarity extraction, us- captures subword information via character em- ing synset similarity on the word level and treating beddings, enabling handling of out-of-vocabulary the resulting scores as additional features for sum- words. Additionally, we employ direct sentence- marization and citation linkage. level modelling alternatives via the doc2vec (Le Our approach bears some similarities with the and Mikolov, 2014) extension of word2vec, as work of (Vicente et al., 2015), extending the inves- well as a sentence-level TF-IDF baseline. tigation to additional post-processing techniques to PCA, examining post-processing application 3.3 Semantic representation strategies, and adopting deep neural word embed- In order to capture and utilize semantic infor- dings as the lexical representation, while ground- mation in the text, we use the WordNet seman- ing on a number of baselines. In the following tic graph (Miller, 1995), a lexical database for section, we will describe our approach in detail, English,

Load more