Abstractive Multi-document Summarization with Semantic Infor- mation Extraction

Wei Li Key Lab of Intelligent Info. Processing, Institute of Computing Technology, CAS Beijing, 100190, China [email protected]

analysis without semantic analysis and Abstract representation. Fully abstractive summarization approach re- This paper proposes a novel approach to quires a separate process for the analysis of texts generate abstractive summary for multi- that serves as an intermediate step before the ple documents by extracting semantic in- generation of sentences (Genest and Lapalme, formation from texts. The concept of 2011). Statistics of or phrases and syntac- Basic Semantic Unit (BSU) is defined to tical analysis that have been widely used in exist- describe the semantics of an event or ac- ing summarization approaches are all shallow tion. A semantic link network on BSUs is processing of text. It is necessary to explore constructed to capture the semantic in- summarization methods based on deeper seman- formation of texts. Summary structure is tic analysis. planned with sentences generated based We define the concept of Basic Semantic Unit on the semantic link network. Experi- (BSU) to express the semantics of texts. A BSU ments demonstrate that the approach is is an action indicator with its obligatory argu- effective in generating informative, co- ments which contain actor and receiver of the herent and compact summary. action. BSU is the most basic element of coher- ent information in texts, which can describe the 1 Introduction semantics of an event or action. The semantic information of texts is represented by extracting Most automatic summarization approaches are BSUs and constructing BSU semantic link net- extractive which leverage only literal or syntactic work (Zhuge, 2009). Semantic Link Network information in documents. Sentences are extract- consists of semantic nodes, semantic links and ed from the original documents directly by rank- reasoning rules (Zhuge, 2010; 2011; 2012; ing or scoring and only little post-editing is made 2015b). The semantic nodes can be any resources. (Yih et al., 2007; Wan et al., 2007; Wang et al., In this work, the semantic nodes are BSUs ex- 2008; Wan and Xiao, 2009). Pure extraction has tracted from texts. We use semantic relatedness intrinsic limits compared to abstraction (Carenini between BSUs as semantic links. Then summary and Cheung, 2008). can be generated based on the semantic link net- Abstractive summarization requires semantic work through summary structure planning. analysis and abstract representation of texts, The characteristics of our approaches are as which need knowledge on and beyond the texts follows: (Zhuge, 2015a). There are some abstractive ap-  Each BSU describes the semantics of an proaches in recent years: sentence compression event or action. The semantic relatedness be- (Knight and Marcu, 2000; Knight and Marcu, tween BSUs can capture the context seman- 2002; Cohn and Lapata, 2009), sentence fusion tic relations of texts. (Barzilay and McKeown, 2005; Filippova and  The BSU semantic link network is an ab- Strube, 2008), and sentence revision (Tanaka et stract representation of texts. Reduction on al., 2009). However, these approaches are sen- the network can obtain important infor- tence rewriting techniques based on syntactical mation of texts with no redundancy.

1908 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1908–1913, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics.

 Summary is built from sentence to sentence 2011). They extract binary relations from the to a coherent body of information based on web, which is different from our approach that the BSU semantic link network by summary extracts events or actions expressed in texts. structure planning. 3 The Summarization Framework 2 Related Work Our system produces an abstractive summary for There are some abstractive summarization ap- a set of topic related documents. It consists of proaches in recent years. An approach TTG at- two major components: tempts to generate abstractive summary by using and summary generation. text-to-text generation to generate sentence for each subject-verb-object triple (Genest and 3.1 Information Extraction Lapalme, 2011). A system that attempts to gen- The semantic information of texts is obtained by erate abstractive summaries for spoken meetings extracting BSUs and constructing BSU semantic was proposed (Wang and Cardie, 2013). It iden- link network. A BSU is represented as an actor- tifies relation instances that are represented by a action-receiver triple, which can both detects lexical indicator with an argument constituent the crucial content and incorporates enough syn- from texts. Then the relation instances are filled tactic information to facilitate the downstream into templates which are extracted by applying sentence generation. Some actions may not have multiple sequence alignment. Both of these sys- the receiver argument. For example, “Flight tems need to select a subset of the large volumes MH370 – disappear” and “Flight MH370 - leave of generated sentences. However, our system - Kuala Lumpur” are two BSUs. generates summary directly by summary struc- BSU Extraction. BSUs are extracted from the ture planning. It can generate well-organized and sentences of the documents. The texts are pre- coherent summary more effectively. processed by name entity recognition (Finkel et A recent work aims to generate abstractive al., 2005) and co-reference resolution (Lee et al., summary based on Abstract Meaning Represen- 2011). Constituent and dependency parses are tation (AMR) (Liu et al., 2015). It first parses the obtained by Stanford parser (Klein and Manning, source text into AMR graphs, and then trans- 2003). The eligible action indicator is restricted forms them into a summary graph and plans to to be a predicate verb; the eligible actor and re- generate text from it. This work only focuses on ceiver arguments are noun phrase. Both the actor the graph-to-graph transformation. The module and receiver arguments take the form of constit- of text generation from AMR has not been de- uents in the parse tree. A valid BSU should have veloped. The nodes and edges of AMR graph are one action indicator and at least one actor argu- entities and relations between entities respective- ment, and satisfy the following constraints: ly, which are sufficiently different from the BSUs  The actor argument is the nominal subject or semantic link network. Moreover, texts can be external subject or the complement of a pas- generated efficiently from the BSUs network. sive verb which is introduced by the preposi- Another recent abstractive summarization meth- tion “by” and does the action. od generates new sentences by selecting and  The receiver argument is the direct object or merging phrases from the input documents (Bing the passive nominal subject or the object of et al., 2015). It first extracts noun phrases and a preposition following the action verb. verb-object phrases from the input documents, We create some manual rules and syntactic and then calculates saliency scores for them. An constraints to identify all BSUs based on the syn- ILP optimization framework is used to simulta- tactic structure of sentences in the input texts. neously select and merge informative phrases to maximize the salience of phrases and meanwhile Constructing BSU Semantic Link Network. satisfy the sentence construction constraints. As The semantic relatedness between BSUs contains the results show that the method is difficult to three parts: Arguments Semantic Relatedness generate new informative sentences really differ- (ASR), Action-Verbs Semantic Relatedness ent from the original sentences and may generate (VSR) and Co-occurrence in the Same Sentence some none factual sentences since phrases from (CSS). Arguments of BSUs include actors and different sentences are merged. receivers, which both are noun phrases and indi- Open information extraction has been pro- cate concepts or entities in the text. When com- posed by (Banko et al., 2007; Etzioni et al., puting ASR, the semantic relatedness between concepts must be measured. We use the explicit

1909

semantic analysis based on Wikipedia to com- Basic Features Number of words in actor/receiver pute semantic relatedness between concepts (Ga- Number of nouns in actor/receiver brilovich and Markovitch, 2007). When compu- Number of new nouns in actor/receiver Actor/receiver has capitalized ? ting VSR, WordNet-based measure is used to Actor/receiver has stopword? calculate the semantic relatedness between action Action is a phrasal verb? Content Features verbs (Mihalcea et al., 2006). CSS is measured Actor/receiver has name entity? TF/IDF/TF-IDF of action whether two different BSUs co-occur in the same TF/IDF/TF-IDF min max average of actor/receiver sentence. Semantic relations between BSUs are Syntax Features Constituent tag of actor/action/receiver computed by linearly combining these three parts. Dependency relation of action with actor Then BSUs that are extracted from the texts form Dependency relation of action with receiver a semantic link network. Table 1. Features for BSU summary-worthy Semantic Link Network Reduction. A dis- scoring. We use SVM-light with RBF kernel criminative ranker based on Support Vector Re- by default parameters (Joachims, 1999). gression (SVR) (Smola and Scholkopf, 2004) is utilized to assign each BSU a summary-worthy tence to sentence to a coherent body of infor- score. Training data was constructed from the mation about a topic. DUC 2005 datasets which contain both the The summary structure is planned based on source documents and human generated refer- the BSU semantic link network. An optimal path ence summaries. BSUs are extracted from these which covers all the nodes in the network is datasets. For each BSU in the source documents, found. The following two factors are considered if it has occurred in the corresponding human when finding the optimal path: (1) Context Se- generated summaries or the semantic relatedness mantic Coherent. To make the summary seman- between the BSU and one BSU in the corre- tic coherent, all adjacent sentences should be se- sponding human generated summaries is above a mantically related. We need to find an optimal threshold δ , then it is considered to be a positive path, in which every two adjacent nodes are sample and be assigned 1 to its summary-worthy strong semantically related. The optimal path is score. Otherwise, the BSU is considered to be a denoted as P = [ pp , ,..., p ] and maximize negative sample and be assigned 0 to its sum- rr12 rn n−1 mary-worthy score. Table 1 displays the features 1 nR. (2) Clear-cut Theme. To make ∑ i=1 rrii+1 of BSU used in the SVR model. Then the salien- the theme of generated summary clear-cut, the cy score of each BSU in the semantic link net- important content should be put in prior position. work is calculated by the following equation: The order of the ith node in the path is denoted Sal() BSUi= SW i * Rij (1) ∑ j as ui and its weight is denoted as n Where SWi is the summary-worthy score of w= 1/ Sal BSU ii( ) and maximize ∑ i=1uwii. BSU ; R is the semantic relatedness between i ij To combine the above two factors, we need to

BSUi and BSU j . find an optimal path which covers each node on- BSUs in the semantic link network are clus- ly once and has the longest distance. The biased- tered by hierarchical complete-link clustering sum weight of all nodes in the path should be methods. BSUs in each cluster are semantically maximized. The problem can be proved to be similar. For example, Malaysia Airlines plane - NP-hard by reduction to TSP problem. It can be vanish and Flight MH370 – disappear. Only the formalized as an integer linear programming most important one with the largest saliency (ILP) as follow. xij is defined to indicate wheth- score is reserved in the network. These less im- er the optimal path goes from node i to node j. portant BSUs are eliminated. The remaining BSU 1 if the path goes from node i to node j = semantic link network represents the important xij  (2) 0 otherwise information of the texts with no redundancy. Since each node can be traversed only once, 3.2 Summary Generation the following constraints must be satisfied. n x=1 jn = 1,... The summary for the documents is generated ∑ i=1, ji ≠ ij (3) directly based on the BSU semantic link network. n x=1 in =1,... The summary should be well-structured and ∑ j=1, ji ≠ ij well-organized. It should not just be a heap of The nodes in the path are sequentially ordered. related information, but should build from sen- If the edge between two nodes is in the path, then

1910

System ROUGE-1 ROUGE-2 ROUGE-SU4 are important information of new events are kept. OurSystem 0.42145 0.11016 0.15632 The generated sentences are organized according MultiMR 0.41967 0.10302 0.15385 to the summary structure. If some adjacent sen- RankBSU 0.39123 0.08742 0.14381 tences in the summary have the same subject, the TTG 0.39268 0.09645 0.14553 subject of the latter can be substituted by a pro- AveDUC 0.39684 0.09495 0.14671 noun (such as it or they) to avoid repetition of NIST Baseline 0.33126 0.06425 0.11114 noun phrases. One sample summary generated Table 2. Comparison results (F-measure) on by our system for “Malaysia MH370 Disappear” DUC 2007 under ROUGE evaluation. news is shown in Figure 1.

System OurSystem MultiMR RankBSU TTG 4 Evaluation Results Pyr (Th:0.6) 0.858 0.845 0.832 0.834 Pyr (Th:0.65) 0.743 0.731 0.718 0.721 4.1 Dataset and Experimental Settings Table 3. Comparison results on DUC 2007 un- In order to evaluate the performance of our sys- der the automated pyramid evaluation with two tem, we use two datasets that have been used in threshold value 0.6 and 0.65. recent multi-document summarization shared tasks: DUC2005 and DUC2007. Each task has a the order of the two nodes is sequentially close to gold standard dataset consisting of document each other, which can be formulated as follow: clusters and reference summaries. In our experi-

ui− u j + nx ij ≤ n −1 1≤ i ≠ j ≤ n ments, DUC2005 was used for training and pa- rameter tuning, and DUC2007 was used for test- 1 ≤uni ≤ i =1,... n (4) ing. Based on the tuning set, the parameter λ is u∈ in =1,... i set as 10 and δ is set as 0.7 after tuning. At last, we can formulate the objective func- Our system is compared with one state-of-the- tion as follow: nn n art graph-based extractive approach MultiMR max 1 n R x+ λ wu (5) ∑∑i=1 j= 1, ji≠=ij ij ∑i1 i i (Wan and Xiao, 2009) and one abstractive ap- where parameter λ tunes the effect of the two proach TTG (Genest and Lapalme, 2011). In ad- parts and n is the quantity of BSUs in the final dition, we have implemented another baseline BSU semantic link network (after reduction). RankBSU which uses the graph-based ranking Sentence Generation. After the summary methods on the BSUs network to rank BSUs and structure has been planned, sentences are gener- select the top ranked BSUs to generate sentences. ated for each node in the BSU semantic link net- 4.2 Results work. As the BSU contains enough semantic and syntactic information, sentence can be generated ROUGE-1.5.5 toolkit was used to evaluate the efficiently according to the following rules: quality of summary on DUC 2007 dataset (Lin  Generate a Noun Phrase (NP) based on the and Hovy, 2003). The ROUGE scores of the actor argument to represent the subject, a NP NIST Baseline system (i.e. NIST Baseline) and based on the receiver argument to represent average ROUGE scores of all the participating the object if present. systems (i.e. AveDUC) for DUC 2007 main task  Generate a Verb Phrase (VP) based on the were also listed. According to the results in Ta- action verb to link the components above. ble 2, our system much outperforms the NIST The tense of the verb is set to the same as in Baseline and AveDUC, and achieves higher the original sentence, and most modifiers like ROUGE scores than the abstractive approach auxiliaries and negation are conserved. TTG. So the abstract representation of texts and  Generate complements for the VP when the the information extraction process in our system BSU has no receiver. The verb modifiers fol- are effective for multi-document summarization. lowing the action verb such as prepositional Our system also achieves better performance phrases and infinitive phrases can be used as than the baseline RankBSU, which demonstrates the complement, in case that the verb would that the network reduction method is more effi- have no interesting meaning without a com- cient than the popular graph-based ranking plement. methods. As compared with the state-of-art The process of sentence generation for each graph-based extractive method MultiMR, our node is based on the syntactic structure of the system also achieves better performance. Fur- source sentence where the BSU is extracted from. thermore, our system is abstractive with abstract The time and location preposition phrases which representation and sentence generation. Incorrect

1911

parser and co-reference resolution will lead to Flight MH370 disappeared after leaving Kuala Lumpur. It had been expected to land in Beijing at 06:30. It took off at 00:41 MYT from runway 32R. It ended in wrong extraction of BSU. If with more accurate the southern Indian Ocean. The aircraft, a Boeing 777-200ER made a sharp turn westwards. It passed into Vietnamese airspace. The captain of another aircraft parser and co-reference resolution, our system attempted to reach the crew of Malaysia Airlines flight MH370. Malaysia Airlines flight 386 was requested to attempt to contact Malaysia Airlines flight will be expected to achieve better performance. MH370 on the Lumpur Radar frequency. Malaysia Airlines flight MH17,

Since ROUGE metric evaluates summaries another Boeing 777-200ER, was surpassed Malaysia Airlines flight MH370. Malaysia Airlines assumes beyond reasonable doubt there are no survivors. only from word overlapping perspective, we also They reported the Malaysia Airlines flight MH370 missing. They releases use the pyramid evaluation metric (Nenkova and passenger manifest of flight MH370. They will give US$ 5000 to the relatives of each passenger. Malaysia released preliminary report. It set up a Joint Inves- tigation Team. Southeast Asian states have joined forces to search waters Passonneau, 2004) which can measure the sum- between Malaysia and Vietnam. Chinese government criticizes Malaysia for mary quality beyond simply string matching. The inadequate answers regarding Malaysia Airlines flight MH370. Malaysia will be deploying more ships and equipment to assist in the search. It ends hunt in pyramid evaluation metric involves semantic South China Sea. Continued refinement of analysis of flight MH370's satellite matching of summary content units (SCUs) so as communications identified a wide area search. Australia and Malaysia are working on a Memorandum of Understanding to cover financial and co- to recognize alternate realizations of the same operation arrangements for search and recovery activities. meaning, which is a better metric for the abstrac- Figure 1. The summary of “Malaysia MH370 Dis- tive summary evaluation. Since the manual pyr- appear” news event generated by our system. amid evaluation is time-consuming and the eval- uation results can’t be reproducible with different Malaysia Airlines said in a statement that flight MH370 had disappeared at groups of assessors, we use the automated ver- 02:40 local time on Saturday after leaving Kuala Lumpur. Southeast Asian states sion of pyramid proposed in (Passonneau et al., have joined forces to search waters between Malaysia and Vietnam after a Malaysia Airlines plane vanished on a flight to Beijing, with 239 people on 2013) and adopt the same setting as in (Bing et board. Flight MH370 had been expected to land in Beijing at 06:30. If Malaysia al., 2015). Table 3 shows the evaluation results Airlines flight MH370 had impacted the ocean hard, resulting underwater sounds could have been detected by hydrophones, given favorable circumstances. Scientists from the CTBTO analyzed their recordings soon after flight MH370 of our system and the three baseline systems on disappeared, finding nothing of interest. The CMST researchers believe that the DUC 2007. The results show that the perfor- most likely explanation of the hydroacoustic data is that they come from the same event, but unrelated to Malaysia Airlines flight MH370. The lead research- mance of our system is significantly better than er of the CMST team, Dr.Alec Duncan, believes there's a slim chance that the acoustic event is related to Malaysia Airlines flight MH370. Several IMOS the three baseline systems, which demonstrates recorders deployed in the Indian Ocean off northwestern Australia by CMST that the summaries of our system contain more may have recorded data related to Malaysia Airlines flight MH370. Malaysia Airlines released the names and nationalities of the 227 passengers and 12 crew SCUs than summaries of other systems. So our members, based on the flight manifest, later modified to include two Iranian system can generate more informative summary. passengers travelling on stolen passports. If the data relates to the same event, related to flight MH370, but the arc derived from analysis of the aircraft's satel- In addition, large volumes of news texts for lite transmission is incorrect, then the most likely place to look for the aircraft would be along a line from HA01. popular news events are crawled from the news websites. Figure 1 and 2 show the summaries for Figure 2. The summary of “Malaysia MH370 Dis- the “Malaysia MH370 Disappear” news event appear” news event generated by MultiMR. generated by our system and MultiMR respec- tively. The summary by MultiMR contains some repetition of facts obviously. And it is just a heap them can’t be appropriately computed by the of information about MH370. The summary by methods described in the paper. More efficient our system doesn’t contain much repetition of methods to computer semantic relations between facts, so it can contain more useful information. BSUs should be developed in the following work. And it is built from sentence to sentence to a co- The sentence generation process described in herent body. Obviously, the summary by our sys- the paper is just a preliminary scheme. It should tem is more coherent and compact. be developed to generate sentence relying less on the original sentence structure and aggregating 5 Conclusions and Future Works information from several different BSUs.

The proposed summarization approach is effec- Reference tive in information extraction and achieves good performance on DUC datasets. Through the Barzilay, R., and McKeown, K. R. 2005. Sentence sample summary, we can find that the approach fusion for multidocument news summarization. is very effective for summarizing texts that main- Computational Linguistics, 31(3): 297-328. ly describe facts and actions of news event. Banko, M., Cafarella, M. J., Soderland, S., et al. 2007. Summaries generated by our system are informa- Open information extraction for the web. In IJCAI tive, coherent and compact. 2007, 7: 2670-2676. But for texts expressing opinions, the ap- Bing, L., Li, P., Liao, Y., Lam, W., et al., 2015. Ab- proach can’t settle it appropriately. For example, stractive Multi-Document Summarization via when the verbs of BSUs are not meaningful ac- Phrase Selection and Merging. In ACL 2015, 1587- tions, like “be”, the semantic relations between 1597

1912

Carenini, G., and Cheung, J. C. K. 2008. Extractive vs. Nenkova, A., and Passonneau, R. 2004. Evaluating NLG-based abstractive summarization of evalua- content selection in summarization: The pyramid tive text: The effect of corpus controversiality. method. In HLT-NAACL, pages 145-152. Proceedings of the Fifth International Natural Lan- Passonneau, R. J., Chen, E., Guo, W., and Perin, D. guage Generation Conference. Association for 2013. Automated Pyramid Scoring of Summaries Computational Linguistics, 33-41. using . In ACL(2), pages: Cohn, T., and Lapata, M. 2009. Sentence compres- 143-147. sion as tree transduction. Journal of Artificial Intel- Smola, A. J., and Schölkopf, B. 2004. A tutorial on ligence Research, 637-674. support vector regression. Statistics and computing, Etzioni, O., Fader, A., Christensen, J., et al. 2011. 14(3): 199-222. Open Information Extraction: The Second Genera- Tanaka, H., Kinoshita, A., Kobayakawa, T., Kumano, tion. In IJCAI 2011, 11: 3-10. T., and Kato, N. 2009. Syntax-driven sentence re- Finkel, J. R., Grenager, T., and Manning, C. 2005. vision for broadcast news summarization. Incorporating Non-local Information into Infor- In Proceedings of the 2009 Workshop on Language mation Extraction Systems by Gibbs Sampling. In Generation and Summarisation, 39-47. ACL 2005, 363-370. Wan, X., Yang, J., and Xiao, J. 2007. Manifold- Filippova, K., and Strube, M. Sentence fusion via Ranking Based Topic-Focused Multi-Document dependency graph compression. In EMNLP 2008, Summarization. In IJCAI 2007, 7:2903-2908. 177-185. Wang, D., Li, T., Zhu, S., and Chris, D. 2008. Multi- Gabrilovich, E., and Markovitch, S. 2007. Computing document summarization via sentence-level se- Semantic Relatedness Using Wikipedia-based Ex- mantic analysis and symmetric matrix factoriza- plicit Semantic Analysis. In IJCAI 2007, 7: 1606- tion. In SIGIR 2008, 307-314. 1611. Wan, X., and Xiao, J. 2009. Graph-Based Multi- Genest, P. E., and Lapalme, G. 2011. Framework for Modality Learning for Topic-Focused Multi- abstractive summarization using text-to-text gener- Document Summarization. In IJCAI 2009, 1586- ation. In Proceedings of the Workshop on Mono- 1591. lingual Text-To-Text Generation, 64-73. Wang, L., and Cardie, C. 2013. Domain-Independent Joachims, T. 1999. Svmlight: Support vector machine. Abstract Generation for Focused Meeting Summa- SVM-Light Support Vector Machine rization. In ACL 2013, 1395-1405. http://svmlight. joachims. org/, University of Dort- Zhuge, H. 2009. Communities and Emerging Seman- mund, 19(4). tics in Semantic Link Network: Discovery and Knight, K., and Marcu, D. 2000. Statistics-based Learning, IEEE Transactions on Knowledge and summarization-step one: Sentence compression. In Data Engineering, vol.21, no.6, 2009, pp. 785-799. AAAI/IAAI 2000, 703-710. Zhuge, H. 2010. Interactive Semantics, Artificial In- Knight, K., and Marcu, D. 2002. Summarization be- telligence, 174(2010)190-204. yond : A probabilistic approach Zhuge, H. 2011. Semantic linking through spaces for to sentence compression. In Artificial Intelligence. cyber-physical-socio intelligence: A methodology, 139(1): 91-107 Artificial Intelligence, 175(2011)988-1019. Klein, D., and Manning, C. D. 2003. Accurate unlexi- Zhuge, H. 2012. Chapter 2 in The Knowledge Grid: calized . In ACL 2003, 423-430. Toward Cyber-Physical Society. World Scientific. Liu, F., Flanigan, J., et al. 2015. Toward Abstractive Zhuge, H. 2015a. Dimensionality on Summarization, Summarization Using Semantic Representations. arXiv:1507.00209 [cs.CL], 2 July 2015. In HLT-NAACL 2015. Zhuge, H. 2015b. Mapping Big Data into Knowledge Lin, C. Y., and Hovy, E. 2003. Automatic evaluation Space with Cognitive Cyber-Infrastructure, of summaries using n-gram co-occurrence statistics. arXiv:1507.06500, 24 July 2015. In HLT-NAACL 2003, 71-78. Lee, H., Peirsman, Y., Chang, A., et al. 2011. Stan- ford's multi-pass sieve coreference resolution sys- tem at the CoNLL-2011 shared task. In ACL 2011, 28-34. Mihalcea, R., Corley, C., and Strapparava, C. 2006. Corpus-based and knowledge-based measures of text . In AAAI 2006, 6: 775-780.

1913