LTRCsum: Telugu Human-annotated Abstractive Summarization Corpus Collection and Evaluation

Priyanka Ravva, Ashok Urlana, Pavan Baswani Department of Computer Science KCIS, LTRC, IIIT-H {priyanka.ravva, ashok.urlana, pavan.baswani}@research.iiit.ac.in

Lokesh Madasu, Gopichand Kanumolu, Manish Shrivastava Department of Computer Science KCIS, LTRC, IIIT-H {maadasulokesh, chandukanumolu007}@gmail.com, [email protected]

Abstract

1 Automatic text summarization is a way of obtaining the shorter version of the 2 original document. Commonly used abstractive summarization datasets typically 3 consider summaries as headlines or concatenation of the bullet points or single 4 sentences summaries or a combination of extractive and abstractive formations. 5 Such summaries often have disjoint sentences and may not cover all the relevant 6 aspects of the original article. Moreover, the sentence(s)/phrase(s) in summaries 7 are more often copied. We present LTRCsum, a novel human-annotated Telugu 8 abstractive summarization corpus consists of 29309 text-summary pairs. We discuss 9 the various challenges involved in data creation for a low-resource language along 10 with the novel summarization guidelines. This work also addresses the evaluation 11 of the created corpus based on Relevance, Readability, and Creativity parameters. 12 Various quality baselines are implemented by incorporating the five existing models. 13 ML with intra attention + Word2Vec model outperformed all the baselines with 14 R-1, R-2 respectively 45.27 and 29.22.

15 1 Introduction

16 The availability of high-quality datasets constrains the progress of deep learning approaches for 17 automatic summarization. Contemporary works focused on harvesting the data (Hermann et al., 18 2015; Narayan et al., 2018; Grusky et al., 2018; Rush et al., 2017; Napoles et al., 2012) from the 19 web to accommodate suitable summarization datasets. Most often, the web scraped datasets consist 20 of summaries that distills the source material down to its most important point to inform the reader 21 (Hermann et al., 2015). However, a few datasets associated with summaries explicitly concentrating 22 on ‘what is the article about?’ by providing a brief overview Narayan et al. (2018). Such summaries 23 often omit one or more relevant aspects present in the original article. The study of Hasan et al. 24 (2021) evident that the presence of extra information in CNN Daily mail and XSUM datasets makes 25 the abstractive summarization task more challenging. Often the extra information present in the 26 summaries either by aiming to create a shorter summary in constrained word limit or includes the 27 personal opinions/interpretations of the professional writers in the specified domain.

28 Widely used abstractive summarization datasets typically consider summaries as one of the following 29 forms: headlines (Napoles et al., 2012), (Rush et al., 2017), concatenation of the bullet points 30 (Hermann et al., 2015), single sentence summaries (Narayan et al., 2018) or combination of extractive

Submitted to the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. Do not distribute. 31 & abstractive formations (Grusky et al., 2018). These summaries are often created in a shorter format 32 to attract readers or for marketing purposes. Such summaries often have disjoint sentences and 33 may not cover all the relevant aspects of the original article. Moreover, the sentence(s)/phrase(s) in 34 summaries are often copied. By avoiding such shortcomings in LTRCsum, we attempt to create a 35 high-quality human-annotated corpus for the . The majority of the summarization 36 datasets were available for English only. Most of the Indian languages do not have high-quality 37 summarization datasets (see Table 1). Despite the widely spoken languages in southern parts of , 38 with more than 80 million native speakers, Telugu is considered an under-resourced language by 39 the natural language processing community. In specific, for the Indian languages summarization 1 40 task, apart from XL-SUM (Hasan et al., 2021) corpus (directly scraped from BBC ), no other 41 benchmark datasets are available because of two reasons. The former is that creating an abstractive 42 summarization dataset is too expensive & time-consuming task, and the latter is the unavailability 43 of standard manual summarization guidelines. As per our knowledge, LTRCsum is the first and the 44 largest human-annotated multi-sentence abstractive summarization corpus for the Telugu language.

45 In this paper, we present the LTRCsum corpus consist of 29309 human-annotated text-summary pairs. 46 In contrast to traditional datasets, we attempted to create a high-quality abstractive summarization 47 corpus by considering relevance& coverage, readability, and creativity parameters. We have proposed 48 novel annotation guidelines and filtered the summaries by following the rigorous quality assessment 49 criteria to preserve the dataset’s quality. As detailed in Table 5, LTRCsum maintains high coverage, 50 where the average length of the summary is approximately 50% of the length of the original article. 51 Along with that, each summary is coherent with novel sentence formations. To speed up the manual 52 summarization process, we have introduced the summarization tool and integrated it with intrinsic 53 evaluation metrics (token compression ratio and novel n-gram ratio) to further reduce the copying 54 percentage and increase the novelty in summary.

55 Moreover, we adapted several baselines to evaluate the effectiveness of the LTRCsum corpus. The 56 benchmark models are trained with various word embeddings (Word2vec, fastText skip-gram and 57 cbow). We have performed experiments with simple seq2seq architectures to pointer generator- 58 mechanism and reinforcement learning approaches. Along with that, we have used transformer 59 architectures and also finetuned with multilingual pretrained models (mT5). The ML with intra 60 attention mechanism approaches shows promising results.

61 Our major contributions can be summarized as follows:

62 • We discuss in detail the pipeline for constructing the human-annotated abstractive summa- 63 rization corpus for the Telugu language and released 29,309 text-summary pairs.

64 • We also release the manual summarization and evaluation guidelines for this task.

65 • To assess the quality of the human-annotated dataset, we compare the LTRCsum with 66 existing Telugu datasets.

67 • We implement five quality baselines for automatic abstractive summarization task and 68 perform extensive analysis.

69 2 Related Work

70 Performing text summarization for low resource languages has been a long-standing problem in 71 Natural Language Processing. The mechanisms involved in abstractive and extractive summarization 72 differ significantly but have some common threads such as salience and coverage etc. Indian Language 73 summarization has been attempted using different approaches varying from statistical to linguistic- 74 based and pure machine learning to Hybrid methods. 75 Extractive Summarization Approaches: The main challenge in extractive summarization is to 76 effectively choose the important sentences and arrange them in proper order. To achieve this, 77 initial attempts were made towards Telugu summarization by using the heuristic-based approaches 78 (Damodar et al., 2021), frequency-based and clustering approaches (Khanam and Sravani, 2016). Due 79 to the out-of-order extraction in the k-means extraction method, the summaries with the frequency- 80 based approach made sense. The challenges of ranking the relevant sentences in the documents were 81 addressed with the help of TextRank (Manjari, 2020) and PageRank (Damodar et al., 2021) algorithms.

1https://www.bbc.com/

2 82 In addition, multi-document summarization (Pingali et al., 2008), (Y Madhavee Latha, 2020) task 83 also attempted for Telugu. In contrast to our work, the majority of the Indian language summarizers 84 were built to perform extractive (Kallimani et al., 2010), (Renjith and Sony, 2015), (Pattnaik and 85 Nayak, 2020), (Sarkar, 2012), (Thaokar and Malik, 2013), (Hanumanthappa et al., 2014), (Bhosale 86 et al., 2018), (Burney et al., 2012) summarization tasks due to ease in the implementation.

87 Abstractive Summarization Approaches: Some of the abstractive summarizers were designed by 88 using the information extraction methods (Kallimani et al., 2011) and automatic keyword extraction 89 (Naidu et al., 2018) by using the POS tags to generate the headlines. The summarizer includes the 90 components word cues, keyword extraction, sentence selection, sentence extraction, and summary 91 generation module to analyze the textual data to determine the key features of the summary. Recent 92 works on building neural abstractive summarization systems attempted by directly scraping the 2 93 text-summary pairs from BBC Telugu . Unlike our implementation with baseline models (See et al., 94 2017), (Sutskever et al., 2014), (Paulus et al., 2017), authors of (Helen Barratt, 2009) have only implemented the multilingual abstractive summarizer with the mT5 transformer model. Table 1: Comparative Study of Telugu Summarization Datasets

Type of Dataset Author Source Categories Techniques used summarization size Frequency based Khanam and Sravani (2016) Extractive 1 News website Politics approach Keyword extraction Kallimani et al. (2010) Abstractive 1 News website News document approach K-means clustering and Shashikanth and Sanghavi (2019) Abstractive 1 Unknown Unknown Frequency based approach biographies, natural disasters, Template based Kallimani et al. (2016) Abstractive 30 Unknown reviews of products, approach cultural events, cricket , Sakshi, Andhrajyothy, Keyword extraction Naidu et al. (2018) Extractive 450 Unknown , approach Andhrabhoomi politics, Telugu news entertainment, Bharath et al. (2022) Abstractive 2000 Seq2seq + attention websites sports, business, national Eendadu, Sakshi, Heuristic based Mamidala et al. (2021) Extractive 360 , Unknown approach Namaste Telangana RNN + attention Y Madhavee Latha (2020) Abstractive 8 Unknown Sports mechanism National, Multilingual pretrained Hasan et al. (2021) Abstractive 13, 205 BBC International model (mT5) 95 96 Comparative Study of Existing Datasets:

97 We have performed a comparative study of existing Telugu summarization datasets (see Table 1) 98 and observed that the majority of the datasets contained less than 1000 text-summary pairs, which 99 is the constraint to implement the neural methods. Most of the datasets were related to the news 100 domain, and apart from XLSUM (Hasan et al., 2021) no other data/code is publicly accessible. All the 101 datasets were crawled from the web, and none of the datasets followed any quality assessment (human 102 evaluation) criteria to assess the dataset quality. We have also performed the comparative analysis 103 of existing low-resource summarization datasets for non-Telugu languages (see the supplementary 104 material). Also, we performed the comparative analysis (see section 4.2 ) with the XLSUM corpus to 105 measure the quality of the LTRCsum corpus.

106 3 Corpus Creation

107 3.1 Source

3 108 For raw data, we crawled news articles from popular Telugu news websites such as Prajasakthi , 4 5 109 Surya , and Vartha . This includes news articles from various domains like sports, movies, politics, 2https://www.bbc.com/telugu 3https://prajasakti.com/ 4https://telugu.suryaa.com/index.html 5https://www.vaartha.com/

3 110 business, crime, health and technology etc. For ease of handling and auto filtering, we discarded 111 articles containing any non-Telugu content. After performing the necessary preprocessing steps, the 112 articles with more than three sentences were considered for the summarization task.

113 3.2 Manual Summarization

114 To perform the manual summarization task, we selected 347 highly proficient Telugu native speakers. 115 All of them are pursuing graduation in the reputed universities in Andhra Pradesh and Telangana 116 in India. To the best of our knowledge, there are no guidelines publicly available for manual 117 summarization and evaluation of summarization corpus. Thus the first task at hand was to analyze 118 the properties of summaries in detail and identify important parameters that an abstractive summary 6 119 must have; namely Relevance and Coverage , Readability and Creativity. These parameters resulted 120 in the summarization guidelines for the task. Also, while creating the summarization guidelines, we 121 identified the most frequent errors with respect to each parameter.

122 Guidelines and corresponding common mistakes in summary creation:

123 1. Relevance and Coverage: All the pertinent information conveyed in the source article 124 should be captured in summary while discarding the irrelevant information. Redundant 125 information or information unrelated to the major topic of the article may be considered 126 irrelevant.

127 • Missing important information: A summary has to cover all the important aspects 128 of the original article. 129 • Including irrelevant information: A summary should not include any irrelevant 130 information such as personal opinion(s), out of context, and inappropriate factual 131 details. 132 • Redundant information: Summary should not contain any repetitive 133 phrases/sentences.

134 2. Readability: If the summary is understandable by a native speaker without looking at the 135 source article, it is considered “Readable”. Bad grammar, pronouns that cannot be resolved 136 within the summary, and unnatural sentential/phrasal structures would make the summary 137 difficult to understand. Also, the summary should stand as an independent article, and the 138 reader should not need the original article to understand it fully.

139 • Disjoint sentences: While paraphrasing, sentences should be joined in such a way 140 that the composite sentence must be meaningful. 141 • Anaphora issue: In summary, all the pronouns should be used only after mentioning 142 the original noun. 143 • Disordering of sentences: The summary should be coherent to convey the proper 144 context in the original article. 145 • Not readable: The summary should be free from any syntax and semantic errors.

146 3. Creativity: Since this is an abstractive summarization task, we require the summaries to 147 have novelty in terms of sentential structures such as lexical choices (vocabulary used, is 148 other than the given article), phrasal constructions and sentence formations.

149 • Missing novel sentence structure: The summary should contain novel sentence 150 structures (with or without using novel words) compared to the original article. 151 • Lengthy summary: The summary should be a new shorter text that conveys the most 152 crucial information of the original article. 153 • Half abstractive and half extractive: The summary should not be a combination of 154 extractive and abstractive summary formations. 155 • Sentence level summary: The summary should not be created by just altering 156 words/phrases in individual sentences.

157 The above-mentioned common mistakes are explained further, with an example in the supplementary 158 material. As a pilot study, we assigned five samples to each annotator for sample summarization and 159 evaluation before the actual task. Here, a sample is an article. In the actual task, each annotator was

6Relevance and Coverage are merged as one metric

4 160 assigned 50 samples. Annotators were given instructions for preprocessing the original article and 161 create an abstractive summary.

162 4 Corpus Evaluation

163 The summaries collected from the annotators were then evaluated to measure the quality. As 164 detailed in Table 2, we propose summary evaluation guidelines to assess the quality of the annotated 165 data. To deal with the exceptions, ‘where a single mistake can lead to reduced scores in more 166 than one parameter’, we have identified a few agreements between the evaluators, discussed in the 167 supplementary material.

Table 2: Abstractive Summarization Evaluation Metrics Relevance Readability Creativity No relevant information Summary consists entirely Summary is not at Score 0 is covered or the entire of sentences copied verbatim all understandable summary is irrelevant from the ‘original’ article Most of the summary Only one relevant piece contains un-natural Copied most of the sentences of information is covered Score 1 sentence structure and from the ‘original article’ and or most of the summary frequent contains rare novel words. is irrelevant grammatical errors Approximately half of the Approximately half of the summary contains novel Half of the relevant summary contains sentence structures and the information is present un-natural sentential/phrasal Score 2 remaining half is copied or half of the irrelevant structures, which make from the original article or information is covered the summary difficult few sentences are generated to understand but have inaccurate meaning Most of the summary is One piece of relevant Summary is understandable novel, but some of the information is missing, but contains some errors Score 3 non-factual content or one piece of irrelevant in grammar, punctuation or is copied verbatim information is added spelling mistakes (copied as it is) Summary is The entire summary, Everything is relevant understandable and free except the Score 4 and all the relevant from grammatical, factual information information is covered punctuation and (names, dates etc.), spelling mistakes is novel

168 4.1 Evaluation Process

169 The evaluation process consists of two phases, the former is assigning scores (ranging from 0 to 4) to 170 Relevance, Readability, and Creativity metrics. The latter one is to provide necessary feedback to the 171 annotators for improving summarization quality.

172 Assigning Scores: Our evaluator’s team consists of 4 highly proficient native Telugu speakers. Since 173 manually evaluating all the samples in each submission (i.e. 50 samples allotted to each annotator) 174 is tedious, we used the cluster-based sampling method (Helen Barratt, 2009) for evaluation. Each 175 submission is divided into three sub-portions (of 17, 17 & 16 samples), and evaluators were instructed 176 to evaluate a minimum of 30% samples in each portion.

177 Followup & Feedback: Along with the scores, the evaluators were instructed to provide feedback 178 for each submission. The objective of providing feedback is to make the annotators understand the 179 significance of human-annotated corpus by explaining the minimum criteria of a valid abstractive 180 summary and to train them to meet the criteria. During the feedback and follow-up process, our 181 evaluators encountered the following most frequent summarization errors:

182 • Unnatural sentence formations (though meaningful, the sentence structures were too complex 183 and rare to find in the Telugu literature).

184 • Sentences with syntactic and semantic errors.

185 • Misunderstanding the original meaning of the article or adding personal opinions or biases.

186 • Length of the summary, always a concern.

5 187 • Misidentifying the relevant and irrelevant parts of information from the original article.

188 Automation of Data Creation and Evaluation Process 189 In order to resolve the frequent errors in the data summarization task, we introduced the annotation and 190 evaluation tools (refer to the supplementary material) to automate the summarization and evaluation 191 process. These tools can be helpful to reduce time consumption and the overall task complexity. The 192 data creation annotation tool is flexible to organize all text files and load the contents of any file for 193 modifications. It provides an easy way to navigate a specific file and modify the contents with an 194 auto-saving option. Similarly, the evaluation tool can load the article-summary pair and store the 195 scores and feedback given to the respective pair. Later, the feedback can be generalized to provide 196 feedback/suggestions for the entire submission.

197 Integrating the Intrinsic Evaluation Metrics in the Web Interfaces 198 Even though the annotation tool significantly reduces the time required to finish the overall task, it 199 fails to obtain quality summaries due to copied sentence(s)/phrase(s) from the respective original 200 articles. Hence, both tools were integrated with the intrinsic evaluation metrics (token compression 7 201 and Novel n-gram (trigram) ratio) to restrict the copying percentage. Offline google input tools were 202 included in the annotation tool to avoid the usage of third-party tools.

203 • Token/Sentence Compression Ratio: The number of tokens in the summary should be 204 less than or equal to 60% of the number of tokens/sentences in the article.

205 • Novel n-gram Ratio: The summary should contain at least 25% of novel n-grams compared 206 to the article. Here, we have considered the n-grams as trigrams.

207 While creating the summary, the annotator can see the maximum number of words/sentences allowed 208 to create a summary and novel n-gram percentage in the web interface.

209 Quality Assessment: With the above-mentioned procedure we collected a total of 83,370 text- 210 summary pairs in all stages (refer Table.3). Among these, 54,061 pairs did not meet the following 211 criteria to accept them as high-quality abstractive summaries. 212 Criteria-1: All the evaluated article-summary pair scores were averaged parameter-wise (relevance, 213 readability, and creativity) for each portion (17, 17, 16) of the submission. The portion which had ≥ 214 3 for all three parameters was then filtered for applying criteria-2. 215 Criteria-2:

216 • The number of sentences in the article should be ≥ 4.

217 • Token Compression Ratio w.r.t article should be ≤ 60%, Sentence Compression Ratio 218 ≤ 50% and 25 ≤ trigram novelty ≤ 95.

219 • Number of tokens in article ≥ 40 and in the summary ≥ 10.

220 With the above-mentioned criteria(s), we obtained 29,309 samples as final gold samples (see Table 4). 221 Which were used to build the automatic abstractive summarization model. 8 222 We also measured the Fleiss’ Kappa , Inter Annotator Agreement (IAA) scores to see how uniformly 223 the annotators understood the guidelines and how reproducible the summarization task is. IAA score 224 calculated by using the equation. 1. P¯ − P¯ κ = e (1) 1 − P¯e

225 We took a sample of 500 text-summary pairs to obtain the IAA score and reported the IAA score of κ 226 = 0.885, which corresponds to a fairly significant agreement.

227 Essence of Annotation tool and Metrics: As detailed in Table 3, without using the annotation tool, 228 we had obtained 39.67% of quality data, but the majority of the annotators spent 10 to 13 hours to 229 finish the task. In order to reduce the overall time consumption to complete the task, we introduced 230 web interfaces.

7https://typingkeyboards.com/google-input-tool-telugu/ 8https://en.wikipedia.org/wiki/Fleiss’_kappa

6 Table 3: Data Collection and Evaluation Statistics # Annotators # sentences Evaluation Time (m) Data Collected Quality Data (%) Without Tool 110 3 - 6 53.8 - 75 30000 39.67 With Tool 120 6 - 9 52.5 - 67.5 40000 22.91 Tool + Metrics 117 10+ 60 - 102.5 13370 61.66

Table 4: LTRCsum Statistics Train Validation Test Text Summary Text Summary Text Summary Pairs 24403 2453 2453 Unique Words: UW 230232 151180 53086 32933 53880 33236 Unique Lemmas: UL 191864 123230 39446 24051 40091 24353 (Min, Max) words (40, 502) (10, 227) (40, 332) (12, 146) (40, 240) (13, 105) Avg words 98.1 43.8 99.5 45.6 102.4 46.7 Avg sentences 8.7 2.9 8.4 2.9 8.4 2.9 Avg (UW, UL) (9.4, 7.9) (6.1, 5) (21.6, 16.1) (13.4, 9.8) (22, 16.3) (46.7, 9.9)

231 From Figure 1 it is evi- 232 dent that even when the ar- 233 ticle’s length increases (sen- 234 tence ranges from 6 to 9), 235 most annotators managed to 236 finish the task in approx- 237 imately a similar duration 238 (10 to 13 hours). However, 239 on the other hand, we ob- 240 tained only 22.9% of qual- 241 ity data. To increase the 242 percentage of quality data, 243 we integrated the intrinsic 244 evaluation metrics in both 245 interfaces. As a result, we Figure 1: Average time consumption for creating 50 summaries 246 have obtained 61.66% qual- 247 ity data, and the majority of 248 the annotators expressed that the complexity of the task is moderate. However, most of the annotators 249 had to spent 15+ hours to finish the task, due to an increase in the number of sentences in the articles 250 ranging from 10 to 17. We also observed that more than 70% of annotators preferred to use Google 251 input tools offline to type the Telugu text while creating the summary. Table 3 also presents the 252 average minimum and maximum time consumption for random evaluation of 12-16 samples in a set 253 of 50 samples and with the corresponding feedback.

Table 5: Intrinsic Evaluation of our LTRCsum dataset compared to CNN/Daily Mail and XLSum. All the values are reported in percentage for easier comparison CNN-DM XLSUM[Telugu] LTRCsum Unigram 13.62 45.2 21.74 Bigram 53.74 83.12 47.81 Novel n-gram Ratio Trigram 73.16 94.06 63.51 Four-gram 81.85 97.57 73.68 Compression 93.06 94.44 54.65 Coverage 86.38 54.8 78.26 Density 5.08 1.04 7.22 Abstractivity 26.94 54.82 30.96 Unigram 24.64 12.01 20.78 Redundancy Bigram 2.70 0.83 2.49

254 4.2 Comparison of Automatically Vs. Manually Created Datasets

255 To compare the quality of the LTRCsum corpora with the XLSUM - Telugu (Hasan et al., 2021), we 256 performed the intrinsic and human evaluations.

7 257 Intrinsic Evaluation: The intrinsic evaluation was done on random samples of 250 taken from 258 CNN Dailymail, XLSUM, and LTRCsum test sets. We used the intrinsic evaluation metrics (Grusky 259 et al., 2018) namely abstractivity, coverage, density, coherence and compression, novel n-gram 260 ratio to compare the quality of the LTRCsum corpus with respect to benchmark datasets. Table 5 261 shows, LTRCsum has higher novel unigrams & abstractivity and lesser redundancy compared to 262 the CNN-Daily mail data (Hermann et al., 2015). Most of the summaries in LTRCsum cover all 263 the relevant aspects of the original article. Such that, LTRCsum has better coverage compared to 264 XLSUM. Moreover, the XLSUM corpus contains summaries at the max of two sentences (23 tokens) 265 only, such that XLSUM has a better compression percentage than LTRCsum.

Table 6: Human Evaluation of XLSUM[Te] and LTRCsum. Except average scores [0 - 4], remaining values are reported in percentage

Avg Scores XLSUM LTRCsum XLSUM LTRCsum % samples XLSUM LTRCsum % Samples >= 3 % Sampes <3 Common Mistakes XLSUM LTRCsum For all 12 91 88 9 three parameters Missing Important 63.6 6.4 Relevance 1.34 3.51 12 91 88 9 information Including irrelevant 24.4 2.6 information Understandability issue/ 16.86 2.65 Not readable Readability 3.19 3.57 77 97 23 3 Coherence issue 6.14 0.13 Unnatural sentence 0 0.22 formations Copied sentences 0 0.2 found Creativity 1.79 3.61 23 94 77 6 Lengthy 0 1.6 Novel sentence 3.67 4.2 structure is missing Out of the context 73.33 0 information present

266 Human Evaluation: Human evaluation is done by three Telugu native speakers on 100 random 267 samples from XLSUM-Telugu (Hasan et al., 2021) and LTRCsum test sets. We have used the 268 evaluation metrics (see Table 2) to do the human evaluation. As detailed in Table 6, we have 269 calculated the percentage of samples with less than and greater than or equal to 3 scores for each 270 parameter. Additionally, we have calculated the percentage contribution of each common mistake. 271 For instance, 88% of samples in the XLSUM dataset had obtained less than 3 scores in the case of 272 relevance. Out of which, 63.6% of samples were missing the relevant information, and 24.4% of 273 samples added the irrelevant information, whereas, in the LTRCsum corpus, only 9% of the samples 274 obtained less than 3 scores in case of relevance. XLSUM corpus is harvested from the BBC news 275 website and considers the first one or two sentences of the article as a summary and the remaining 276 part as the article content. Such that, most of the articles seem incoherent. We have observed that 277 XLSUM articles often contain noisy data [image/video or ads referring contents]. The summaries 278 lack the coverage of the majority of the details present in the original article and include the out of the 279 context information. Summaries often focus on irrelevant aspects of the original article. As detailed 280 in Table 6, LTRCsum outperformed the XLSUM in all the parameters (avg scores of relevance, 281 readability, creativity) in human evaluation.

282 5 Experiments and Results

283 We trained and evaluated several existing summarization models to understand the challenges of 284 LTRCsum and its effectiveness for training systems. We trained the models with Word2Vec (Mikolov 285 et al., 2013) embeddings (pre-trained on Telugu Wikipedia) and fastText (CBOW and Skip-gram 286 variants) (Bojanowski et al., 2017). One set of experiments were performed without pre-trained 287 embeddings. Apart from these, we evaluated reinforcement learning methods and transformer-based 288 architectures. We even tried with multilingual pretrained baselines.

289 5.1 Benchmark Models

290 To perform the automatic summarization task, we have implemented the sequence-to-sequence 291 (Sutskever et al., 2014) Recurrent Neural Networks (RNN) model with attention mechanism (Bah- 292 danau et al., 2014) and pointer-generator(See et al., 2017) with coverage mechanism. Further, we 293 used the novel intra-attention mechanism (Paulus et al., 2017) with reinforcement learning. The

8 294 architecture proposed by Chen and Bansal (2018) is a combination of extractive and abstractive 295 approaches. A novel document-level encoder (Liu and Lapata, 2019) using Bidirectional Encoder 296 Representations from Transformers (BERT) can be used for both extractive and abstractive summa- 297 rization. We fine-tuned the multilingual text to text transfer transformer (mT5) Xue et al. (2020) 298 with LTRCsum corpus. We have elaborated a detailed explanation of the benchmark models and 299 experimental details in the supplementary material.

300 5.2 Experimental Setup

301 We divided the corpus into 80% training and 10% each for the development and testing purpose. The 302 division is subjected to equal portions in terms of length of the article, token/sentence compression 303 ratio, and novel n-gram ratio. Our experiment used output vocabulary of size 50000 by selecting 304 the most frequent tokens from the training set and trained for 100000 iterations (equivalent to 32.78 305 epochs). We used a batch size of 8 and a learning rate of 0.001 for ML training and 0.0001 for RL 306 and ML+RL training. At test time, we used a beam size of 4. We used maximum article length to 307 400 tokens and summary length to 100 tokens. The input word embedding dimensions were 60, 300, 308 300, respectively, for Word2vec, fastText cbow and fastText skip-gram. In the case of non-pretrained 309 embeddings, we used 256 dimensions. Table 7: ROUGE-1, ROUGE-2, ROUGE-L scores for the various baselines Model Embeddings R-1 R-2 R-L Without embeddings 3.92 0.397 3.91 Word2Vec 7.48 1.08 7.46 Sequence2Sequence fastText CBOW 3.64 0.38 3.62 fastText SG 3.38 0.33 3.36 Without embeddings 32.59 10.79 32.47 Word2Vec 33.13 11.04 32.98 Pointer-Generator fastText CBOW 31.14 10.94 31.01 fastText SG 30.78 9.73 30.58 Without embeddings 32.52 10.93 32.34 Word2Vec 33.27 11.28 33.08 Pointer-Generator+Coverage fastText CBOW 31.76 10.69 31.55 fastText SG 31.38 10.68 31.11 Without embeddings 43.38 27.90 42.59 Word2Vec 45.27 29.22 44.09 ML with intra attention fastText CBOW 44.09 28.17 43.16 fastText SG 44.49 28.85 43.87 Without embeddings 43.18 27.63 41.97 Word2Vec 44.54 28.77 43.38 ML without intra attention fastText CBOW 44.04 28.23 42.97 fastText SG 43.35 28.05 42.53 Without embeddings 43.74 28.02 42.44 Word2Vec 45.08 29.18 43.73 ML + RL fastText CBOW 43.76 27.87 42.28 fastText SG 44.34 28.63 43.08 Without embeddings 36.09 21.59 47.19 Word2Vec 38.71 24.46 49.45 RL fastText CBOW 39.14 24.55 48.59 fastText SG 40.03 24.84 49.49 rnn-ext + abs + RL - 34.26 12.16 34.09 rnn-ext + abs + RL + rerank - 34.77 12.18 34.59 TransformersAbs - 30.77 18.20 26.71 BertSumAbs - 32.24 20.04 28.07 BertSumExtAbs - 33.52 20.98 29.10 Lead-3 - 44.21 26.58 38.02 rnn-ext + RL (extractive) - 34.91 13.11 34.67 TransformerExt (extractive) - 41.57 26.54 39.51 BertSumExt (extractive) - 44.10 28.93 41.97 mT5-base - 41.85 24.83 34.04

9 310 6 Discussions

311 We studied the performance of various models on LTRCsum as detailed in Table 7. We used the 312 ROUGE (Lin, 2004) metric and corresponding F1 scores of unigram overlap, bigram overlap, and 313 longest common sub-sequence to measure the performance of each model. ROUGE measures the 314 n-gram overlap between the reference summary and candidate summary pair. In Table 7, each 315 block details the results of a separate model with various pre-trained embeddings. Except for the 316 reinforcement approaches, most models show promising results with word2vec based embeddings 317 compared to other embeddings types or without pretrained-embeddings. The RL model performed 318 slightly better with fastText skip-gram embeddings than word2vec.

319 A few experiments were performed with extractive-abstractive approaches with reinforcement learning 320 (rnn-ext + abs + RL) variants. In these, the rnn-ext + abs + RL + Rerank gave better results than rnn-ext 321 + abs + RL. In case of BERT-based models, BertSumExtAbs gave better results than TransformerAbs 322 and BertSumAbs. Since ML intra attention mechanism works well on longer sequences (Paulus et al., 323 2017). Out of all the models, the ML intra attention model, with Word2Vec embeddings, gave the best 324 results in terms of Rouge-1 and Rouge-2. Mixed objective learning with ML+RL generated better 325 and more readable summaries. Whereas, RL with intra attention gave the best results for Rouge-L. 326 Bert-based RL models may improve the performance if we increase the training dataset size.

Table 8: Human Evaluation on best performed baselines Embeddings Relevance Readability Creativity Human summary scores 3.51 3.57 3.61 ML with intra attention Word2Vec 2.91 3.57 0.3 ML without intra attention Word2Vec 2.78 3.53 0.26 ML + RL with intra attention Word2Vec 3.03 3.48 0.4 RL fastText SG 1.79 1.79 0.25 Pointer generator + Coverage Word2Vec 1.79 1.79 0.25 rnn-ext +abs+ RL - 2.79 2.89 0.48

327 6.1 Human Evaluation

328 Prior to human evaluation of model-generated summaries, we randomly took 100 samples from the 329 test set, and three native speakers examined the samples to measure the human summary scores. 330 We also evaluated the performance of the six baseline models to check the quality of the model- 331 generated summaries by randomly selecting 100 samples from the test set. Three evaluators did this 332 qualitative analysis. The evaluators were asked to rate (ranging from 0 to 4) each sample for each 333 of the (Relevance, Readability, and Creativity) parameters (Table 2 ). Table 8 details the average 334 scores of all 3 evaluators, for the respective parameters. It is clear that all the models were very 335 poor in generating novel words, though novel sentence formations were found occasionally. ML 336 and RL model variants were effective in identifying relevant information from the original article. 337 Human evaluation also confirmed that the ML+RL model generates more readable summaries than 338 RL alone. Due to limited corpus size, the models were incapable of generating novel sentences. 339 RL+fastText often generated incomplete sentences, on the other hand, the same model obtained the 340 highest ROUGE-L score. ML with intra attention + Word2Vec performed fairly well in the case of 341 human evaluation too.

342 7 Conclusions

343 We address the noticeable lack of good quality summarization dataset for Telugu by developing a 344 manually created and curated news summarization dataset containing 29,309 article-summary pairs. 345 Quality is ensured by keeping Relevance, Coverage, Novelty/Creativity, and Readability into account. 346 We also present an annotation/manual summarization tool to expedite the process of manual summary 347 creation. Further, we evaluated the performance of standard benchmark summarization methods 348 on the dataset to present benchmarks for future work. We hope this dataset and effort would spark 349 the move towards manually generated summarization datasets for more human-like summarization 350 models.

10 351 References

352 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by 353 jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

354 B Mohan Bharath, B Aravindh Gowtham, and M Akhil. 2022. Neural abstractive text summarizer for 355 telugu language. In Soft Computing and Signal Processing, pages 61–70. Springer.

356 Shubham Bhosale, Diksha Joshi, V Bhise, and RA Deshmukh. 2018. Marathi e-newspaper text 357 summarization using automatic keyword extraction technique. International Journal of Advance 358 Engineering and Research Development, 5(3):789–792.

359 Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word 360 vectors with subword information. Transactions of the Association for Computational Linguistics, 361 5:135–146.

362 Aqil Burney, Badar Sami, Nadeem Mahmood, Zain Abbas, and Kashif Rizwan. 2012. text 363 summarizer using sentence weight algorithm for word processors. International Journal of 364 Computer Applications, 46(19):38–43.

365 Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected 366 sentence rewriting. arXiv preprint arXiv:1805.11080.

367 Ruchita Damodar, Anuraag Ramineni, and Rishita Konda. 2021. Telugu text summarization. Interna- 368 tional Research Journal of Modernization in Engineering Technology and Science, India.

369 Max Grusky, Mor Naaman, and Yoav Artzi. 2018. Newsroom: A dataset of 1.3 million summaries 370 with diverse extractive strategies. arXiv preprint arXiv:1804.11283.

371 M Hanumanthappa, M Narayana Swamy, and NM Jyothi. 2014. Automatic keyword extraction from 372 dravidian language. International Journal of Innovative Science Engineering and Technology, 373 1(8):87–92.

374 Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin 375 Kang, M. Sohel Rahman, and Rifat Shahriyar. 2021. XL-sum: Large-scale multilingual abstractive 376 summarization for 44 languages. In Findings of the Association for Computational Linguistics: 377 ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.

378 Saran Shantikumar Helen Barratt. 2009. Methods of sampling from a population. [Online; accessed 379 01-August-2021].

380 Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa 381 Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. Advances in 382 neural information processing systems, 28:1693–1701.

383 Jagadish S Kallimani, KG Srinivasa, and B Eswara Reddy. 2016. Statistical and analytical study of 384 guided abstractive text summarization. Current Science, pages 69–72.

385 Jagadish S Kallimani, KG Srinivasa, et al. 2010. Information retrieval by text summarization for an 386 indian regional language. In Proceedings of the 6th International Conference on Natural Language 387 Processing and Knowledge Engineering (NLPKE-2010), pages 1–4. IEEE.

388 Jagadish S Kallimani, KG Srinivasa, et al. 2011. Information extraction by an abstractive text 389 summarization for an indian regional language. In 2011 7th International Conference on Natural 390 Language Processing and Knowledge Engineering, pages 319–322. IEEE.

391 M Humera Khanam and S Sravani. 2016. Text summarization for telugu document. IOSR Journal of 392 Computer Engineering (IOSR-JCE), 18(6):25–28.

393 Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summa- 394 rization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.

395 Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. arXiv preprint 396 arXiv:1908.08345.

11 397 Kishore Kumar Mamidala et al. 2021. A heuristic approach for telugu text summarization with im- 398 proved sentence ranking. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 399 12(3):4238–4243.

400 K Usha Manjari. 2020. Extractive summarization of telugu documents using textrank algorithm. In 401 2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I- 402 SMAC), pages 678–683. IEEE.

403 Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed 404 representations of words and phrases and their compositionality. In Advances in neural information 405 processing systems, pages 3111–3119.

406 Reddy Naidu, Santosh Kumar Bharti, Korra Sathya Babu, and Ramesh Kumar Mohapatra. 2018. Text 407 summarization with automatic keyword extraction in telugu e-newspapers. In Smart Computing 408 and Informatics, pages 555–564. Springer.

409 Courtney Napoles, Matthew R Gormley, and Benjamin Van Durme. 2012. Annotated gigaword. In 410 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale 411 Knowledge Extraction (AKBC-WEKEX), pages 95–100.

412 Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the 413 summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint 414 arXiv:1808.08745.

415 Sagarika Pattnaik and Kumar Nayak. 2020. A simple and efficient text summarization model for 416 odia text documents. Indian journal of computer science and engineering, 11.

417 Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive 418 summarization. arXiv preprint arXiv:1705.04304.

419 Prasad Pingali, Jagadeesh Jagarlamudi, and Vasudeva Varma. 2008. A dictionary based approach 420 with query expansion to cross language query based multi-document summarization: Experiments 421 in telugu-english. Mumbai, India.

422 SR Renjith and P Sony. 2015. An automatic text summarization for using sentence 423 extraction. In proceedings of 27th IRF International Conference.

424 Alexander M Rush, SEAS Harvard, Sumit Chopra, and Jason Weston. 2017. A neural attention 425 model for sentence summarization. In ACLWeb. Proceedings of the 2015 conference on empirical 426 methods in natural language processing.

427 Kamal Sarkar. 2012. Bengali text summarization by sentence extraction. arXiv preprint 428 arXiv:1201.2240.

429 Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization 430 with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association 431 for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. 432 Association for Computational Linguistics.

433 Sana Shashikanth and Sriram Sanghavi. 2019. Text summarization techniques survey on telugu and 434 foreign languages. International Journal of Research in Engineering, Science and Management, 435 2(1).

436 Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural 437 networks. In Advances in neural information processing systems, pages 3104–3112.

438 Chetana Thaokar and Latesh Malik. 2013. Test model for summarizing text using extraction 439 method. In 2013 IEEE Conference on Information & Communication Technologies, pages 1138– 440 1143. IEEE.

441 Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya 442 Barua, and Colin Raffel. 2020. mt5: A massively multilingual pre-trained text-to-text transformer. 443 arXiv preprint arXiv:2010.11934.

444 D Naga Sudha Y Madhavee Latha. 2020. Multi-document abstractive text summarization through 445 semantic similarity matrix for telugu language. International Journal of Advanced Science and 446 Technology, 29(1):513–521.

12