Ltrcsum: Telugu Human-Annotated Abstractive Summarization Corpus Collection and Evaluation

LTRCsum: Telugu Human-annotated Abstractive Summarization Corpus Collection and Evaluation Priyanka Ravva, Ashok Urlana, Pavan Baswani Department of Computer Science KCIS, LTRC, IIIT-H {priyanka.ravva, ashok.urlana, pavan.baswani}@research.iiit.ac.in Lokesh Madasu, Gopichand Kanumolu, Manish Shrivastava Department of Computer Science KCIS, LTRC, IIIT-H {maadasulokesh, chandukanumolu007}@gmail.com, [email protected] Abstract 1 Automatic text summarization is a way of obtaining the shorter version of the 2 original document. Commonly used abstractive summarization datasets typically 3 consider summaries as headlines or concatenation of the bullet points or single 4 sentences summaries or a combination of extractive and abstractive formations. 5 Such summaries often have disjoint sentences and may not cover all the relevant 6 aspects of the original article. Moreover, the sentence(s)/phrase(s) in summaries 7 are more often copied. We present LTRCsum, a novel human-annotated Telugu 8 abstractive summarization corpus consists of 29309 text-summary pairs. We discuss 9 the various challenges involved in data creation for a low-resource language along 10 with the novel summarization guidelines. This work also addresses the evaluation 11 of the created corpus based on Relevance, Readability, and Creativity parameters. 12 Various quality baselines are implemented by incorporating the five existing models. 13 ML with intra attention + Word2Vec model outperformed all the baselines with 14 R-1, R-2 respectively 45.27 and 29.22. 15 1 Introduction 16 The availability of high-quality datasets constrains the progress of deep learning approaches for 17 automatic summarization. Contemporary works focused on harvesting the data (Hermann et al., 18 2015; Narayan et al., 2018; Grusky et al., 2018; Rush et al., 2017; Napoles et al., 2012) from the 19 web to accommodate suitable summarization datasets. Most often, the web scraped datasets consist 20 of summaries that distills the source material down to its most important point to inform the reader 21 (Hermann et al., 2015). However, a few datasets associated with summaries explicitly concentrating 22 on ‘what is the article about?’ by providing a brief overview Narayan et al. (2018). Such summaries 23 often omit one or more relevant aspects present in the original article. The study of Hasan et al. 24 (2021) evident that the presence of extra information in CNN Daily mail and XSUM datasets makes 25 the abstractive summarization task more challenging. Often the extra information present in the 26 summaries either by aiming to create a shorter summary in constrained word limit or includes the 27 personal opinions/interpretations of the professional writers in the specified domain. 28 Widely used abstractive summarization datasets typically consider summaries as one of the following 29 forms: headlines (Napoles et al., 2012), (Rush et al., 2017), concatenation of the bullet points 30 (Hermann et al., 2015), single sentence summaries (Narayan et al., 2018) or combination of extractive Submitted to the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. Do not distribute. 31 & abstractive formations (Grusky et al., 2018). These summaries are often created in a shorter format 32 to attract readers or for marketing purposes. Such summaries often have disjoint sentences and 33 may not cover all the relevant aspects of the original article. Moreover, the sentence(s)/phrase(s) in 34 summaries are often copied. By avoiding such shortcomings in LTRCsum, we attempt to create a 35 high-quality human-annotated corpus for the Telugu language. The majority of the summarization 36 datasets were available for English only. Most of the Indian languages do not have high-quality 37 summarization datasets (see Table 1). Despite the widely spoken languages in southern parts of India, 38 with more than 80 million native speakers, Telugu is considered an under-resourced language by 39 the natural language processing community. In specific, for the Indian languages summarization 1 40 task, apart from XL-SUM (Hasan et al., 2021) corpus (directly scraped from BBC ), no other 41 benchmark datasets are available because of two reasons. The former is that creating an abstractive 42 summarization dataset is too expensive & time-consuming task, and the latter is the unavailability 43 of standard manual summarization guidelines. As per our knowledge, LTRCsum is the first and the 44 largest human-annotated multi-sentence abstractive summarization corpus for the Telugu language. 45 In this paper, we present the LTRCsum corpus consist of 29309 human-annotated text-summary pairs. 46 In contrast to traditional datasets, we attempted to create a high-quality abstractive summarization 47 corpus by considering relevance& coverage, readability, and creativity parameters. We have proposed 48 novel annotation guidelines and filtered the summaries by following the rigorous quality assessment 49 criteria to preserve the dataset’s quality. As detailed in Table 5, LTRCsum maintains high coverage, 50 where the average length of the summary is approximately 50% of the length of the original article. 51 Along with that, each summary is coherent with novel sentence formations. To speed up the manual 52 summarization process, we have introduced the summarization tool and integrated it with intrinsic 53 evaluation metrics (token compression ratio and novel n-gram ratio) to further reduce the copying 54 percentage and increase the novelty in summary. 55 Moreover, we adapted several baselines to evaluate the effectiveness of the LTRCsum corpus. The 56 benchmark models are trained with various word embeddings (Word2vec, fastText skip-gram and 57 cbow). We have performed experiments with simple seq2seq architectures to pointer generator- 58 mechanism and reinforcement learning approaches. Along with that, we have used transformer 59 architectures and also finetuned with multilingual pretrained models (mT5). The ML with intra 60 attention mechanism approaches shows promising results. 61 Our major contributions can be summarized as follows: 62 • We discuss in detail the pipeline for constructing the human-annotated abstractive summa- 63 rization corpus for the Telugu language and released 29,309 text-summary pairs. 64 • We also release the manual summarization and evaluation guidelines for this task. 65 • To assess the quality of the human-annotated dataset, we compare the LTRCsum with 66 existing Telugu datasets. 67 • We implement five quality baselines for automatic abstractive summarization task and 68 perform extensive analysis. 69 2 Related Work 70 Performing text summarization for low resource languages has been a long-standing problem in 71 Natural Language Processing. The mechanisms involved in abstractive and extractive summarization 72 differ significantly but have some common threads such as salience and coverage etc. Indian Language 73 summarization has been attempted using different approaches varying from statistical to linguistic- 74 based and pure machine learning to Hybrid methods. 75 Extractive Summarization Approaches: The main challenge in extractive summarization is to 76 effectively choose the important sentences and arrange them in proper order. To achieve this, 77 initial attempts were made towards Telugu summarization by using the heuristic-based approaches 78 (Damodar et al., 2021), frequency-based and clustering approaches (Khanam and Sravani, 2016). Due 79 to the out-of-order extraction in the k-means extraction method, the summaries with the frequency- 80 based approach made sense. The challenges of ranking the relevant sentences in the documents were 81 addressed with the help of TextRank (Manjari, 2020) and PageRank (Damodar et al., 2021) algorithms. 1https://www.bbc.com/ 2 82 In addition, multi-document summarization (Pingali et al., 2008), (Y Madhavee Latha, 2020) task 83 also attempted for Telugu. In contrast to our work, the majority of the Indian language summarizers 84 were built to perform extractive (Kallimani et al., 2010), (Renjith and Sony, 2015), (Pattnaik and 85 Nayak, 2020), (Sarkar, 2012), (Thaokar and Malik, 2013), (Hanumanthappa et al., 2014), (Bhosale 86 et al., 2018), (Burney et al., 2012) summarization tasks due to ease in the implementation. 87 Abstractive Summarization Approaches: Some of the abstractive summarizers were designed by 88 using the information extraction methods (Kallimani et al., 2011) and automatic keyword extraction 89 (Naidu et al., 2018) by using the POS tags to generate the headlines. The summarizer includes the 90 components word cues, keyword extraction, sentence selection, sentence extraction, and summary 91 generation module to analyze the textual data to determine the key features of the summary. Recent 92 works on building neural abstractive summarization systems attempted by directly scraping the 2 93 text-summary pairs from BBC Telugu . Unlike our implementation with baseline models (See et al., 94 2017), (Sutskever et al., 2014), (Paulus et al., 2017), authors of (Helen Barratt, 2009) have only implemented the multilingual abstractive summarizer with the mT5 transformer model. Table 1: Comparative Study of Telugu Summarization Datasets Type of Dataset Author Source Categories Techniques used summarization size Frequency based Khanam and Sravani (2016) Extractive 1 News website Politics approach Keyword extraction Kallimani et al. (2010) Abstractive 1 News website News document approach K-means clustering and Shashikanth

Load more