Longsumm 2021: Session Based Automatic Summarization Model for Scientiﬁc Document

LongSumm 2021: Session based automatic summarization model for scientific document Senci Ying, Yanzhao Zheng, Wuhe Zou NetEase, China {yingsenci, zhengyanzhao, zouwuhe}@corp.netease.com Abstract of the advantages of different models. Combined with the later models and solutions, we propose an Most summarization task focuses on generat- ensemble method based on session like figure1. ing relatively short summaries. Such a length constraint might not be appropriate when sum- marizing scientific work. The LongSumm task needs participants generate long summary for scientific document. This task usual can be solved by language model. But an important problem is that model like BERT is limit to memory, and can not deal with a long input like a document. Also generate a long output is hard. In this paper, we propose a session based automatic summarization model (SBAS) which using a session and ensemble mechanism to generate long summary. And our model achieves the best performance in the LongSumm task. Figure 1: SBAS: a session based automatic summariza- 1 Introduction tion model Most of the document summarization tasks focus on generate a short summary that keeps the We split the task into four steps: session gen- core idea of the original document. For long eration, extraction, abstraction, and merging the scientific papers, a short abstract is not long results at the end. First, we split an document enough to cover all the salient information. Re- into several sessions with a certain size, and use searchers often summarize scientific articles by a rouge metric to match the ground truth (sen- writing a blog, which requires specialized knowl- tences from given document’s summary). Then, edge and a deep understanding of the scien- we train two different types of model. One is the tific domain. The LongSumm, a shared task abstraction-based model. Specifically, we use the of SDP 2021(https://sdproc.org/2021/ BIGBIRD(Zaheer et al., 2020), a sparse attention sharedtasks.html), opts to leverage blog mechanism that reduces this quadratic dependency posts created by researchers that summarize sci- to linear, and PEGASUS(Zhang et al., 2020), a pre- entific articles and extractive summaries based on trained model specially designed for summariza- video talks from associated conferences(Lev et al., tion. The other one is based on extraction method. 2019) to address the problem mentioned above. We test the performance of TextRank(Mihalcea Most of the previous methods divide the doc- and Tarau, 2004; Xu et al., 2019), DGCNN(Dilate ument according to section, and use the extrac- Gated Convolutional Neural Network)(Su, 2018) tion or abstraction model to predict the summary and BERTSUMM(Liu, 2019). In the end, for each for each part respectively, and combine the results type of model, we generate the summary from the as the final summary of the document. Section one which has the best performance, and use an based method may drop some important informa- ensemble method to merge the summaries together. tion among the sections. Generally, only uses one The result show that our method is effective and type of model for prediction can not make good use beats the state-of-art models in this task. 97 Proceedings of the Second Workshop on Scholarly Document Processing, pages 97–102 June 10, 2021. ©2021 Association for Computational Linguistics 2 Related Work generate the final summary. The common automatic summarization is mainly 3.1 Session Generation divided into the extraction-based summarization Limited by the computational power, many meth- and the abstraction-based summarization. The ods chose to truncate long articles directly, which extraction-based model extracts several sentences makes the model unable to perceive the content of and words from the original article by the seman- the following articles, and the generated summary tic analysis and sentence importance analysis to can only reflect part of the input text. Others divide form the abstract of the article. Typical models the article into sections, but this also raise some include TextRank(Mihalcea and Tarau, 2004; Xu problems. The length and content of section are et al., 2019) algorithm which based on sentence different between different articles. The division importance and the extraction method based on based on section may not reflect the relationship be- pre-training model(Liu, 2019). The abstracts ob- tween text and abstract well. This paper proposes tained by the extraction model can better reflect a segmentation method based on session, which the focus of the article, but because the extracted divides the article into different sessions according sentences are scattered in different parts of the ar- to the selected size, predicts the summary for each ticle, the coherence of the abstracts is a problem session, and selects the most appropriate window to be challenged. The abstraction-based models size in this task by adjusting the size of the session. are based on the structure of seq2seq, and the pre- The specific data processing steps are as follows: training model is used to achieve better generation (1) First, select the appropriate session size(2048 effect like BART(Lewis et al., 2019), T5(Raffel words) and a buffer(128 words), which is used to et al., 2019). Recently, PEGASUS(Zhang et al., keep the last text of the previous session as the 2020), a pre-training model released by Google, context of the current session. (2) For generating specially designed the pre-training mode for the models. The real summary is divided into sen- summarization task, and achieved the state-of-art tences, and the corresponding summary sentence performance on all 12 downstream datasets. is assigned to each session according to the rouge This task focuses on the the solution of the long metric. In order to make the model predict long summary. The input and ouput text of the tradi- summaries as much as possible, a greedy match- tional model is limited due to the memory and ing rule is used to allocate the summary sentences time-consuming. However, this task requires the to each session. we first drop the sentences with model to summarize scientific papers and generate the threshold 0.7, which denotes the rouge score very long summaries. To solve this problem, most between the session and summary sentences. Then of the solutions in the the previous are based on sec- we pick the sentences according to the scores until tions(Li et al., 2020; Roy et al., 2020). They divide meets the length we set, default 256 words. scientific papers into sections, generate abstracts Although this may cause different sessions to for each seciton, and finally combine them to get predict the same summary, we think that duplicate the final results. Resently, Google’s new model sentences can be detected through the later data BIGBIRD(Zaheer et al., 2020) , using sparse atten- processing, and it is more important for the train- tion mechanism to enable the model fit long text, is ing model to generate long sentences . (3) For suitable for this task scenario. the extraction model, we only need to match dif- 3 Method ferent sessions with their corresponding summary sentences. The pre-training model plays a significant role in the field of automatic summarization, but due to 3.2 Abstraction-based Model its huge amount of parameters, most of the models The training data contains around 700 abstractive can only be used for short text tasks. For long summaries that come from different domains of articles, there are two common ways to do. One CS including ML, NLP, AI, vision, storage, etc. is to directly truncate the long articles, the other And the abstractive summaries are blog posts cre- is to predict the articles according to the section. ated by NLP and ML researchers. The traditional This paper proposes a text segmentation method generation model is mainly based on the classical based on session, and use an ensemble method with transformers structure. In order to solve the prob- the extraction model and the abstraction model to lem of long text input , we use the sparse attention 98 structure BIGBIRD(Zaheer et al., 2020), which according to the golden extractive summarize. is proposed by Google recently, and makes fine- tuning on its two open source pre-training models: 2. Transform each sentence by Roberta-Large (1) Roberta(Liu et al., 2019): a bert model with pre-trained model(Liu et al., 2019), and get the dynamic masking and drops the next predict the output of last hidden layers as the feature loss representation, then convert the feature matrix (2) PEGASUS(Zhang et al., 2020): a trans- to a fixed-size vector by average-pooling. former model while using gap sentences generation 3. TRAINING: Feed the obtained sentence vec- to pre-training. tors into the DGCNN-Extraction model (Fig- The models used in this paper are both pre- ure2) and binary classify each sentence. trained on arXiv datasets, so they have strong abil- ity to generate abstracts. 4. INFERENCE: Take the sigmoid-output of the model as the importance score for each sen- 3.3 Extraction-based Model tence, according to which we extract the cor- The extractive data have 1705 extractive summaries responding sentences from the paper as the which are based on video talks from associated con- extractive summary and the total length of the ferences(Lev et al., 2019). We have tried tree differ- summary is limited. ent extraction models to select important sentence from the documents. (3) BERTSUMM(Liu, 2019): BERTSUMM is (1) TextRank(Mihalcea and Tarau, 2004): We a Bert-based model designed for the extractive simply use the TextRank algorithm to pick out summarization task. Different from DGCNN- some most important sentences from the docu- Extraction model, because of the limit of the input ments and limited the number of sentences ex- length of Bert, we have to divide each paper into tracted.

Longsumm 2021: Session Based Automatic Summarization Model for Scientiﬁc Document

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support