An Evolutionary Algorithm for Automatic Summarization

Aurelien´ Bossard* and Christophe Rodrigues**

*LIASD - EA 4383, Universite´ Paris 8, Saint-Denis, France [email protected] **Leonard´ de Vinci Poleˆ Universitaire, Research Center, Paris La Defense,´ France [email protected]

Abstract that allow for evaluating a summary without a human reference (writing it is the most time- This paper proposes a novel method to consuming task in evaluation) emerged and have select sentences for automatic summa- recently achieved good performances, using prob- rization based on an evolutionary algo- abilistic models (Louis and Nenkova, 2009; Sag- rithm. The algorithm explores candidate gion et al., 2010). summaries space following an objective Probabilistic models for automatic summariza- function computed over ngrams probabil- tion evaluation are natural: a summary and its ity distributions of the candidate summary source have to share the same distribution of con- and the source documents. This method cepts. As they have proven performant for evalu- does not consider a summary as a stack ation, using them to guide automatic summariza- of independent sentences but as a whole tion process seems obvious, although it has to our text, and makes use of advances in un- knowledge not been already tested. As opposed to supervised summarization evaluation. We the most part of automatic summarization methods compare this method that use encoded metrics, we here propose to con- to one of the best existing methods which sider automatic summarization as the maximiza- is based on integer linear programming, tion of a natural score: the divergence between and show its efficiency on three different the concepts distribution in the source and the con- acknowledged corpora. cepts distribution of a candidate summary. So we view automatic summarization as choos- 1 Introduction ing the best summary among a very large set of Automatic summarization systems are essential candidate summaries upon a metric that is com- components of information systems. Indeed, in- puted on the whole candidate summary. This leads crease of numerical information sources can have us to the use of an evolutionary algorithm in order a negative effect on online content reading and as- to naturally select the best summaries. similation. Summarizing such content can allow While other recent papers (Li et al., 2013; users to better apprehend it. Automatic summa- Nishikawa et al., 2014; Peyrard and Eckle-Kohler, rization has therefore become one of the first re- 2016) integrate sophisticated and task-specific search in natural language processing field (Luhn, preprocessings and postprocessings, and handle 1958) and still remains a widely spread topic. semantics using complex representations, this pa- In order to validate the benefits obtained from per proposes a new generic and directly usable new methods or parametrization, the automatic sentence extraction method for automatic sum- summarization field needs robust evaluation meth- marization. This method explores the candidate ods. Evaluating automatic summaries, just like summaries space using an evolutionary algorithm. evaluating automatic translation, is a complex This algorithm aims at finding an approximate so- task. Until early 2000s, only two types of ap- lution of the maximization of an objective func- proach existed: entirely manual evaluation with a tion computed over a candidate summary. We first reading grid and semi-automatic evaluations that present iterative methods and exploratory analy- compare automatic summaries with human written sis methods for automatic summarization. We references. Since, entirely automatic approaches then expose our sentence extraction method and

111 Proceedings of Recent Advances in Natural Language Processing, pages 111–120, Varna, Bulgaria, Sep 4–6 2017. https://doi.org/10.26615/978-954-452-049-6_017 the evaluation protocol: it is compared to one of Liu et al.(2006); Nandhini and Balasundaram the best methods in the automatic summarization (2013); Shigematsu and Kobayashi(2014) pro- field. Finally, we discuss our results. pose to look for an approximate solution using a genetic algorithm. As opposed to ILP extrac- 2 Related Work tion methods, these methods are free of any con- straints on the objective function. However, these Automatic summarization Iterative Selection methods keep on considering a summary as a systems generally combine a centrality score for set of independent portions of text, and do not text portions and an extraction method for these take advantage of the new structure of the prob- portions. The first automatic summarization sys- lem that we propose: a summary is not consid- tems (Luhn, 1958; Edmundson, 1969) simply ex- ered as a whole. Considering a summary as a tracted most central portions. MMR method (Car- whole allows for a better space exploration and bonell and Goldstein, 1998) allows for iterative more complex objective functions. Alfonseca and text portions extraction given a centrality score Rodr´ıguez(2003) use a ”standard GA” without de- and a redundancy score. CSIS-based method scribing it and several fitness scores, whose main (Radev, 2000) removes from a list of text por- metric is cosine-tfidf similarity between sources tions sorted by centrality every one that shares too and candidate summaries. However, this similar- much information with a higher ranked. These ity has obtained poor results when used as an au- methods share a major drawback: generated sum- tomatic evaluation metric (Nenkova et al., 2007). maries depend mostly on the first selected text So this metric should not be used as fitness score portion. Therefore, they are exposed to omitting for automatic summarization. As opposed to Al- summaries made of average ranked sentences that fonseca and Rodr´ıguez(2003), we define a new, reflect correctly the overall content of the source non-standard, and fully replicable evolutionary al- documents when combined together. gorithm combined with an extension of an agreed- Optimization Other methods emerged recently upon automatic evaluation metric. to overcome this problem. They consist in explor- (Litvak et al., 2010; ing the space of all candidate summaries in or- Bossard and Rodrigues, 2011) use genetic algo- der to find the one that maximizes an objective rithms for supervised learning of parameters in function. This problem is exponential in input order to tune automatic summarization systems. sentences as there are Cn candidate summaries m Nishikawa et al.(2014); Takamura and Okumura composed of n sentences for a corpus of m sen- (2010); Sipos et al.(2012) perform structured out- tences. For example, choosing 10 sentences over put learning to maximize ROUGE scores. These 200 leads to 1025 possible solutions. Adding con- approaches suffer from the complexity of machine straints on text portions selection and using ILP1 learning model. Moreover, it requires learning help delimiting the problem and finding (not al- data and is very task-specific. Recently, Peyrard ways) an exact solution (McDonald, 2007; Gillick and Eckle-Kohler(2016) proposed to use an ap- and Favre, 2009b). The search space is lim- proximation of ROUGE-N score combined to an ited by constraints on text portions length and by ILP solver. The approximation of ROUGE-N constraints that avoid including text portions that score is performed with supervised learning. The do not provide additional information. Whereas results obtained outperform state-of-the art meth- Gillick and Favre(2009b) select summaries based ods on DUC 2002 and 2003 corpora. on the maximization of occurrences, Li et al.(2013) try to maximize the similarity be- In contrast to supervised learning, we propose tween summaries and sources bigram frequencies. to use a fully unsupervised method that allows for These methods have proven to be very efficient. summarizing even when no manual reference is However, the use of an ILP solver enforces to available. We here propose to use scoring func- modelize the problem as a linear functions, so does tions based on probabilistic models of source doc- not allow for complex functions that could bet- uments and candidate summaries. The smoothing ter take into account the structure of the automatic used for building probability distributions consid- summarization problem. ers a candidate summary as a whole, not as a set of independent sentences. This complex objec- 1Integer Linear Programming tive function requires a constraint-free optimiza-

112 tion algorithm. So evolutionary algorithms seem 3.1 Unigram Distribution Objective Function appropriate. In the meantime, the scoring func- This objective function (Uniprob) fits the exact au- tions we use allow to benefit more of evolutionary tomatic evaluation function described above and in algorithms. (Louis and Nenkova, 2009).

3 Our Method 3.2 Bigram Simple Sum Objective Function Louis and Nenkova(2009) have proposed an en- Bigram simple sum objective function (Bisimple) tirely unsupervised function for automatic sum- consists in summing all bigram weights in a can- mary evaluation. The evaluation method has didate summary. Candidate summaries get a high proved efficient as it strongly correlates to Pyra- score if they are composed of the most frequent mid score, a semi-automatic score used in TAC in the source documents. evaluation campaigns to assess automatic sum- maries informational quality (Nenkova et al., 3.3 Bigram Cosine Objective Function 2007). Saggion et al.(2010) confirmed the rel- Bigram cosine objective function (Bicos) consists evance of the approach. The method is entirely in a between source and can- unsupervised, so it does not need any human ref- didate summaries bigram vector. This objective erence. This makes it perfectly fit to be used as function is used in (Alfonseca and Rodr´ıguez, an objective function in a guided space exploring 2003). algorithm. Our objective function computes the probabil- 3.4 Bigram Distribution Objective Function ity distribution of tokens in the source documents Lin(2004) has shown that ROUGE semi- with the probability distribution of tokens in the automatic evaluation metrics are more correlated summaries. Tokens can be , ngrams or even to manual evaluation when using bigrams rather semantic concepts. The objective function is based than unigrams when it comes to compare sum- on (Louis and Nenkova, 2009) that handles auto- maries and references of a standard length: 50 matic evaluation using Jensen-Shannon(JS) diver- words and more. Therefore, we make the assump- gence. tion that a probabilistic model based on bigrams The JS divergence can be considered as a outcomes one based on unigrams for our objec- symmetric version of Kullback-Leibler divergence tive function. Moreover, we smooth probabili- (KL). KL divergence is defined as the average ties using Dirichlet smoothing (MacKay and Peto, number of bits wasted by coding samples belong- 1994), that adds to every count of a token in a ing to P using another distribution Q, an approxi- summary its probability in the source documents. mate of P (Louis and Nenkova, 2009). In our case, Dirichlet smoothing is more faithful to the data the two distributions are those for words (or con- (Zhai and Lafferty, 2004) than Laplacian smooth- cepts) in the input and the summary. ing while remaining simple to compute. With Given two distributions P and Q, here is the Dirichlet smoothing, the probability of a token t definition of JS divergence: in a summary S is computed this way:

1 CS (t)+µpML(t D) JS(P Q) = [KL(P A) + KL(Q A)] pdir(t S) = | || 2 || || | NS +µ P +Q with: D the source documents; C (t) the number with : A = 2 the mean distribution of P and S Q; KL(P A) the Kullback-Leibler divergence: of occurrences of t in S; pML(t D) the maximum | || likelihood of t in D; N the number of tokens p (w) S KL(P Q) = p (w)log P in S; and µ a constant parameter called pseudo- || P 2 p (w) w Q X frequency. Louis and Nenkova(2009) use a simple weighted Candidate summaries are subsets of source doc- Laplace smoothing over probabilities. uments, so smoothing is only applied to sum- C(w)+δ maries. p(w) = N+δ 1.5 V × ×| | However, this objective function (Biprob) may with : C(w) the number of occurrences of w; N have a major drawback: it favors summaries the overall number of tokens; V the vocabulary, whose probability distributions are close to source and δ = 0.0005. documents. If these source documents are highly

113 redundant, the summaries that are rated high on Starting Population Selection N individuals this objective function would also be redundant. are randomly generated. A new sentence is ran- domly added to an individual until the length con- 3.5 Evolutionary Algorithm straint is reached: X We here describe the evolutionary algorithm we s S Is.t. length(si)+length(s) < maxLength , ∃ ∈ \ s I developed and use for maximization of our objec- i∈ with I an individual and S the sentences in source tive function. This evolutionary algorithm differs documents. from traditional genetic algorithms as described below, to better handle automatic summarization Parents Selection Parents selection can be per- problem. In fact, automatic summarization tasks formed through different methods. These meth- often bound the summaries length in words, not in ods favor space exploration or exploitation by se- number of sentences. So the number of sentences lecting the best individuals. We chose a selection extracted in a summary depends on both maxi- method that is a compromise between exploration mum summary length and each sentence length. and exploitation : tournament selection. Np tour- That is why we cannot use a standard problem naments composed of N individuals – with N the Np model in which every candidate summary would total population size – are randomly created. The be an individual with n genes, n being the number best individual in each tournament is selected as a of sentences to extract. Having a specific repre- parent for the next generation. sentation of the individuals leads to defining new operators on these individuals. So the mutation Mutation Operator We cannot use classic mu- and hybridization mutators differ from what we tation operators: changing one gene for another. are used to because of the structure of our prob- Selecting a too long sentence would violate the lem. summary length constraint while selecting a too short sentence would let the new candidate sum- Individuals Definition Each individual (= a mary to be evaluated against other summaries that candidate summary) is defined by a set of chro- are longer. Our mutation operator is defined as mosomes (= sentences). A chromosome codes for follows : a chromosome (sentence) is randomly a sentence. The number of chromosomes in the deleted from an individual. The individual is then set of an individual is variable, while there is an randomly filled with new chromosomes until the upper threshold for the sum over the length of all length constraint is no longer satisfiable. chromosomes in words. This requires some adap- tations to the classic mutation and hybridization Hybridization Operator The chromosomes of operators, that we define below. two individuals (here, parents) are put together Algorithm Sequence At generation 0, a starting in a single set. A new individual is then cre- population is created randomly. It contains N in- ated with chromosomes randomly selected from this set. Once no more chromosome matches the dividuals, where N = Np + Nm + Nh, where Np, length constraint, the individual is randomly filled Nm and Nh are the parameters of the evolutionary algorithm for the next generations. For each of the with chromosomes from the source documents. next generations: This operation does not guarantee that chromo- somes from both individuals will be selected, so it N is the number of parents; is completely different from traditional hybridiza- • p tion operators. Proceeding this way is however N is the number of mutated individuals; • m mandatory in order to create individuals that will N is the number of hybridated individuals. satisfy the summary length constraint. • h

Then Np parents are selected to generate by mu- 4 Experiment tation and by hybridation Nm and Nh additional individuals. The parents and the individuals they We compare our summarization method to three baselines on TAC 2008, TAC 2009 evaluation generated constitute a new generation. Ng gener- 2 ations are iteratively created this way. At the end, campaigns corpora . We also use the French cor- the individual that maximizes the objective func- 2The corpora are available on request at http://www. tion is selected. nist.gov/tac/data/index.html.

114 pus RPM23 evaluation that share a similar struc- purpose. For the French corpus, a is consid- ture with the TAC 2008 and 2009 corpora. ered as a stopword if it is tagged as either a deter- miner or a preposition. 4.1 Corpora Sentence Selection Sentence selection is per- TAC 2008, 2009 and RPM2 are composed of two formed by the evolutionary algorithm described in distinct parts: the first one is dedicated to stan- Section 3.5 using the three objective functions de- dard multi-document summarization. The second scribed in Sections 3.4, 3.1 and 3.2. is dedicated to update summarization: the goal is to summarize information in the update set, as- Evolutionary Algorithm Parameters We em- suming that the user has already read the infor- pirically tuned the number of generations using mation in the standard set. The following TAC TAC 2009 test dataset in order to increase the campaigns concern guided summarization: writ- probability that the algorithm converges. TAC ing summaries for a given topic where the topic 2009 test dataset and evaluation datasets are falls into a predefined category, including ”Acci- strictly disjoint. Here are the parameters of the dents and Natural Disasters”, ”Attacks”, ”Health evolutionary algorithm: Np = Nh = Nm = 160 and Safety”, ”Endangered Resources”, and ”Tri- and Ng=160. als and Investigations”. The goal is to encourage a deeper linguistic analysis, and summarization sys- First Sentences Salience TAC 2009 corpus is tems must take into account each task specificity a news corpus composed for the most part of in order to produce coherent summaries. So, post news wires and articles. First sentences are of- 2009 TAC campaigns are out of this article scope. ten considered as more salient than the other ones. TAC 2009 standard summarization corpus is (Gillick et al., 2009a) upweight by a factor 2 the composed of 44 sets of 10 documents each: concepts that appear in the first sentence of a docu- english-written news articles. The task consists in ment and thus reach a 16% gain in automatic eval- generating a 100 words summary for every docu- uation metrics on TAC 2009 corpus. Our systems ment set. Average document length is 6330 words. also upweight by a factor 2 the bigrams in the first sentences. We test its effect in Section5. 4.2 Our System 4.3 Baselines Document Preprocessing The documents are first cleaned: all tags and meta-information are re- We implemented two baselines: moved. lexmmr: common scoring method LexRank • Tokenization and Sentence Splitting We first (Erkan and Radev, 2004), followed by MMR use the default tokenizer of the POS-tagging tool (Carbonell and Goldstein, 1998) to perform tree-tagger4 (Schmid, 1994). Tree-tagger is also redundancy removal; used for POS-tagging (not used in the English sys- ILP: the sentence selection method of tem) and sentence splitting. • (Gillick et al., 2009a) described in Section2. Words are stemmed via the Porter stemmer for TAC 2008 and 2009 corpora. For These two systems have the same preprocess- RPM2, we implemented the French snowball ing, tokenizing, stemming, filtering, and first sen- stemmer5. tence bigrams weighting as ours, so they can be compared efficiently. Bigrams Filtering Bigrams composed of two ILP1 baseline is based on ICSI summarization stopwords are pruned. A stoplist composed of the system (Gillick et al., 2009a) which is consid- 200 more frequent English words is used for this ered as state-of-the-art in a recent study (Hong et al., 2014). ICSI employ different heuris- 3The corpus is available on request at http:// lia.univ-avignon.fr/fileadmin/documents/ tics/preprocessings, eg pronominal references re- rpm2/ moval, relative dates and ”said” clauses removal. 4 Treetagger: http://www.cis.uni-muenchen. In order to efficiently compare sentence extraction de/˜schmid/tools/TreeTagger/ 5Snowball: http://snowball.tartarus.org/ methods, we here use a generic version of ICSI algorithms/french/stemmer.html system: sentence extraction is processed using the

115 Baselines Evolutionary algorithm based systems lexmmr hextac ILP1 ILP2 uniprob biprob bisimple bicos Oracle TAC2008 ROUGE-1 .3134 - .3713 .3716 .3546 .3804 .3752 .3497 .4262 ROUGE-2 .0647 - .1075 .1027 .0816 .1103 .1065 .0930 .1718 TAC2009 ROUGE-1 .3361 .3794 .3729 .3837 .3509 .3855 .3846 .3586 .4348 ROUGE-2 .0787 .1065 .1053 .1096 .0821 .1173 .1128 .0961 .1782 RPM2 ROUGE-1 .3203 - .4126 .4126 .3843 .4250 .3801 .4071 .4407 ROUGE-2 .0889 - .1568 .1568 .1472 .1683 .1473 .1491 .2039 Overall ROUGE-1 .3236 - .3793 .3837 .3655 .3904 .3798 .3634 .4321 ROUGE-2 .0745 - .1154 .1151 .0961 .1234 .1163 .1042 .1800

Table 1: Average results of all systems on TAC 2008, 2009 and RPM2 same solver as ICSI system (glpk) but preprocess- we don’t use any smoothing at all. ings are limited to the strict minimum: the same This oracle is a good way to fit to the manual as our system, described in 4.2. This way it can summaries. However, it is not optimal in all cases. be efficiently compared to our evolutionary based The manual summaries are indeed small and are sentence extraction method. As in (Gillick et al., not necessarily composed of bigrams also present 2009a), sentences of less than 10 words and bi- in source documents. This can cause the distribu- grams that do not appear more than twice are not tion divergences to be less precise. taken into account; in ILP2, we add the last rule in ICSI summarization system: sentences that do not 5 Results share at least one word in common with the query We here use the same ROUGE parameters as for are pruned. This way, we can compare more effi- TAC evaluation campaings6. ciently ILP baselines to our system that does not take query into account. The only difference be- 5.1 General Results tween ICSI summarization system and ILP2 lies Table1 presents the results obtained by all the in processings that are not related to the sentence systems and baselines described in Section 4.2 on selection module. three corpora : TAC 2008, TAC 2009 and RPM2. The last baseline (hextac) is TAC 2009 third The best system, whatever the evaluation metric, baseline : human generated extractive summaries is biprob, that uses our evolutionary algorithm and (Genest et al., 2009). This baseline stands for de- JS divergence between bigrams distributions on termining the best summary we can achieve using source documents and candidate summaries as ob- purely extractive methods. Therefore, comparing jective function. It outperforms ILP1 and ILP2 its results to a pure extractive summarizer is in- baselines. ILP1 andILP2 baselines have the same formative. However, as the goal of this system is ROUGE scores on RPM2 corpus, as RPM2 is not to produce good summaries for a human judge, its query-oriented. One can notice that biprob and hu- automatic scores could be lower than manual ones. man baseline hextac are really close in terms of In fact, human extractors can emphasize linguistic ROUGE scores. quality and global coherence rather than pure in- Biprob system as well as ILP baselines out- formation extraction. perform the human generated hextac baseline. 4.4 Oracle Experiments It means that both systems succeed in extract- ing as many salient information as a human. We want to estimate the upper bound that a system However, hextac summaries outperform automatic based on our evolutionary algorithm and a proba- summaries in terms of linguistic quality: coher- bilistic model can achieve. For this purpose, we ence, ordering... define a new objective function. The objective function described above compares distributions 5.2 Impact of First Sentence Bigrams between candidate summaries and source docu- Upweight ments. If we suppose that manual summaries are All of the three corpora are only composed of available (which is the case in ROUGE evaluation newswire articles and share the same information campaigns), then we can directly compute distri- structure. The central information is located in butions between candidate summaries and manual the first sentences. Figure1 shows biprob and ones. So doing, we obtain an evolutionary ap- ILP baseline ROUGE-2 scores when the upweight proach guided by an Oracle. Given the small size of manual summaries used as source documents, 6ROUGE-1.5.5.pl -c 95 -r 1000 -n 2 -m -a -x -d file

116 only these bigrams. The results are not as good as when computing probability distribution over all bigrams. This can be linked to the fact that our method performs better when the number of bi- grams as input is high.

5.5 Convergence on TAC 2009 Data Figure 1: Biprob system and ILP1 baseline results Figure3 shows the average fitness score (= bigram depending on first sentences concepts upweight distribution objective function) of each generation value for every topic of TAC 2008, TAC 2009 and RPM2 corpora for 12 different runs of the evolutionary algorithm. We can see that the scores seem to con- verge quite fast on average, before the 50th gener- ation.

Figure 2: ROUGE-2 scores on all topics depend- ing on topic size in bigrams Figure 3: Average fitness scores of all topics in TAC 2008, 2009 and RPM2 corpora depending on the generation. factor of first sentence bigrams varies on our three corpora. Increasing first sentences concepts The Figure3 presents the fitness average scores weight improves ROUGE-2 scores of both ILP on all evaluation topics. Convergence speed de- baseline and biprob on the overall score over the pend on several factors: data dispersal, explo- three corpora. The maximum is reached around ration space size... Figure4 shows the conver- the factor 2, as assumed by Gillick et al.(2009a). gence speed for the most populated topic of our Our experiment confirms that weighting up first three corpora: the D0918 topic from TAC 2009. It sentence concepts with a factor 2 is efficient on prints the average score, and the score standard de- newswire articles corpora. viation for D0918 topic on 12 different runs. This can give us an idea of the algorithm convergence 5.3 Impact of Input Size in the worst case, speaking of exploration space Figure2 presents ROUGE-2 scores of ILP1 base- size. line and Biprob depending on the number of bi- grams in the source documents for all corpora. The x error bar shows the number of topics that are concerned by a point. One can see that the two systems are really close for small topics. How- ever, for topics with a larger number of bigrams, Biprob tends to outperform ILP1 (except for num- ber of bigrams intervals with a small number of examples). Figure 4: Average scores and standard deviation of scores for D0918 topic depending on the gener- 5.4 Impact of Bigram Filtering ation. Taking into account only the bigrams that appear more than twice for ILP baselines increases re- The algorithm converges near the 220th gener- sults (as also noticed in (Gillick et al., 2009a)), ation, and the topic is composed of 575 sentences. so we tested a probability distribution that keeps There are more than 5 1013 candidate summaries ×

117 that match the 100 words length limit, and the ture for faster computation of distribution diver- evolutionary algorithm has always reached con- gences as the evolutionary algorithm can be easily vergence after exploring 70400 different candidate parallelized, or finding a linear approximation of summaries. the divergence distribution function so it can be used inside an ILP solver. 5.6 Oracle Results At last, an evolutionary algorithm needs a fine One can see in table1 that, as expected, our Ora- parameters tuning: population size, mutants and cle performs more than 50% better than every of cross over percentage. Although we parameter- our systems. This means that there are summaries ized the algorithm so it finds a good solution for composed of sentences in the source documents every topic of our three corpora, their influence on that are closer to the reference summaries than the algorithm convergence should be further stud- those chosen by our objective functions. The 50% ied using different data sets. difference in ROUGE-2 scores shows that there is Extractive automatic summaries can outperform room for improvement. (in terms of ROUGE scores) human extractive summaries. It brings us to the upper limit that 6 Discussion one can achieve with purely extractive methods, and to question the methods to consider from now In this article, we proposed a sentence extraction on: sentence compression, generative paradigms method for automatic summarization that allows for specialized summaries, or sentence rephrasing for good summaries on our three evaluation cor- to achieve a better textual cohesion. pora. Our method outperforms a state-of-the-art sentence extraction method based on an ILP model 7 Conclusion by more than 6.4% on the combined score of our three evaluation corpora. For automatic summarization, finding a summary We show that Jensen-Shannon divergence out- (as good as possible) without an Oracle or any ref- performs cosine similarity as objective function erence summary is an unsupervised task. In this in our evolutionary algorithm. This confirms the paper, we show that automatic summarization can conclusions of (Louis and Nenkova, 2009) on the be achieved using recent advances in automatic evaluation of metrics for automatic summary eval- summarization evaluation. In particular, we ex- uation. plore automatic summarization under the hypoth- esis that source documents and summaries have to Thanks to the use of an evolutionary algorithm share the same bigram distribution. rather than an ILP-based method, the score com- We present a sentence extraction evolutionary putation and constraints are totally free and not algorithm for automatic summarization capable of limited by ILP modelization constraints. How- improving summary generation after generation ever, the objective functions described here do not based only on distribution computation. We show handle redundancy. A candidate summary will be that this extraction algorithm achieves good results well scored if its probability distribution sticks to compared to state-of-the-art method on three cor- the one of the source documents. So, if the source pora. However, the system can be improved in documents display high redundancy on a given in- several ways: the objective function can be im- formation, our method could generate redundant proved to better evaluate candidate summaries or summaries. However, other objective functions be approximated to be used in a faster algorithm. that handle redundancy can be implemented as If ROUGE-2 scores used in this paper are heavily well as constraints on candidate summaries gener- correlated to human metrics, the evaluation needs ation. Our system does not try to improve linguis- to be pushed further using fully manual metrics tic quality: sentences articulation is not managed. such as Pyramid. Methods such as lexical chains (Barzilay and El- hadad, 1999) could be implemented to the objec- tive function or used as post-treatment. References The method proposed in the article uses an evo- Enrique Alfonseca and Pilar Rodr´ıguez. 2003. Gen- lutionary algorithm and objective functions that erating Extracts with Genetic Algorithms, Springer imply heavy computation. Several solutions can Berlin Heidelberg, Berlin, Heidelberg, pages 511– be brought to this problem: using GPU architec- 519.

118 Regina Barzilay and Michael Elhadad. 1999. Using Dexi Liu, Yanxiang He, Donghong Ji, and Hua Yang. lexical chains for text summarization. Advances in 2006. Genetic algorithm based multi-document automatic text summarization pages 111–121. summarization. In Qiang Yang and Geoff Webb, ed- itors, PRICAI 2006: Trends in Artificial Intelligence, Aurelien´ Bossard and Christophe Rodrigues. 2011. Springer Berlin Heidelberg, volume 4099 of Lecture Combining a multi-document update summarization Notes in Computer Science, pages 1140–1144. system cbseas with a genetic algorithm. In Ioan- nis Hatzilygeroudis and Jim Prentzas, editors, Com- Annie Louis and Ani Nenkova. 2009. Automatically binations of Intelligent Methods and Applications, evaluating content selection in summarization with- Springer Berlin Heidelberg, volume 8 of Smart In- out human models. In Proc. of the 2009 EMNLP novation, Systems and Technologies, pages 71–87. Conference : Volume 1. ACL, pages 306–314. and Jade Goldstein. 1998. The use H.P. Luhn. 1958. The automatic creation of literature of MMR, diversity-based reranking for reordering abstracts. IBM Journal 2(2):159–165. documents and producing summaries. In SIGIR’98: Proceedings of the 21st ACM SIGIR Conference. David J.C. MacKay and Linda C. Bauman Peto. 1994. pages 335–336. A hierarchical dirichlet language model. Natural Language Engineering 1:1–19. H. P. Edmundson. 1969. New methods in automatic extracting. Journal of the ACM 16(2):264–285. Ryan McDonald. 2007. A study of global infer- ence algorithms in multi-document summarization. Gunes¸Erkan¨ and Dragomir R. Radev. 2004. Lexrank: Springer. Graph-based centrality as salience in text summa- rization. Journal of Artificial Intelligence Research Kumaresh Nandhini and Sadhu Ramakrishnan Bala- (JAIR) . sundaram. 2013. Use of genetic algorithm for co- hesive summary extraction to assist reading difficul- Pierre-Etienne Genest, Guy Lapalme, and Mehdi ties. Applied Computational Intelligence and Soft Yousfi-Monod. 2009. Hextac: the creation of a man- Computing 2013:8. ual extractive run. In Proceedings of the Second Text Analysis Conference. Gaithersburg, Maryland, Ani Nenkova, Rebecca Passonneau, and Kathleen USA. McKeown. 2007. The pyramid method: Incorporat- ing human content selection variation in summariza- Dan Gillick and Benoit Favre. 2009b. A scalable global tion evaluation. ACM Trans. Speech Lang. Process. model for summarization. In Proceedings of the 4(2). Workshop on Integer Linear Programming for Nat- Hitoshi Nishikawa, Kazuho Arita, Katsumi Tanaka, ural Language Processing. Association for Compu- Tsutomu Hirao, Toshiro Makino, and Yoshihiro tational Linguistics, pages 10–18. Matsuo. 2014. Learning to generate coherent Dan Gillick, Benoit Favre, Dilek Hakkani-tr, Berndt summary with discriminative hidden semi-markov Bohnet, Yang Liu, and Shasha Xie. 2009a. The model. In COLING 2014, 25th International Con- ICSI/UTD summarization system at TAC 2009. In ference on Computational Linguistics, Proceed- Proceedings of Workshop on Summarization task at ings of the Conference: Technical Papers, August TAC 2009 conference. 23-29, 2014, Dublin, Ireland. pages 1648–1659. http://aclweb.org/anthology/C/C14/C14-1156.pdf. Kai Hong, John Conroy, Benoit Favre, Alex Kulesza, Hui Lin, and Ani Nenkova. 2014. A repository of Maxime Peyrard and Judith Eckle-Kohler. 2016. Op- state of the art and competitive baseline summaries timizing an approximation of rouge - a problem- for generic news summarization. In Proceedings of reduction approach to extractive multi-document the Ninth International Conference on Language Re- summarization. In Proceedings of the 54th Annual sources and Evaluation (LREC’14). European Lan- Meeting of the Association for Computational Lin- guage Resources Association (ELRA), Reykjavik, guistics (ACL 2016). Association for Computational Iceland. Linguistics, volume Volume 1: Long Papers, pages 1825–1836. Chen Li, Xian Qian, and Yang Liu. 2013. Using su- pervised bigram-based ilp for extractive summariza- Dragomir R. Radev. 2000. A common theory of infor- tion. In ACL 2013. The Association for Computer mation fusion from multiple text sources step one: Linguistics, pages 1004–1013. cross-document structure. In Proceedings of the 1st SIGdial workshop. Association for Computational Chin-Yew Lin. 2004. Rouge: A package for automatic Linguistics, pages 74–83. evaluation of summaries. In Proc. ACL workshop on Text Summarization Branches Out. page 10. Horacio Saggion, Juan-Manuel Torres-Moreno, Iria da Cunha, and Eric SanJuan. 2010. Multilingual sum- Marina Litvak, Mark Last, and Menahem Friedman. marization evaluation without human models. In 2010. A new approach to improving multilingual Proceedings of the 23rd International Conference on summarization using a genetic algorithm. In Pro- Computational Linguistics: Posters. Association for ceedings of the 48th Annual Meeting of ACL. ACL Computational Linguistics, Stroudsburg, PA, USA, ’10, pages 927–936. COLING ’10, pages 1059–1067.

119 Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Lan- guage Processing. Manchester, UK. Haruka Shigematsu and Ichiro Kobayashi. 2014. Topic-based multi-document summarization using differential evolution for combinatorial optimization of sentences . Ruben Sipos, Pannaga Shivaswamy, and Thorsten Joachims. 2012. Large-margin learning of sub- modular summarization models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguis- tics. Association for Computational Linguistics, Stroudsburg, PA, USA, EACL ’12, pages 224–233. http://dl.acm.org/citation.cfm?id=2380816.2380846. Hiroya Takamura and Manabu Okumura. 2010. Learn- ing to generate summary as structured output. In Jimmy Huang, Nick Koudas, Gareth J. F. Jones, Xin- dong Wu, Kevyn Collins-Thompson, and Aijun An, editors, CIKM. ACM, pages 1437–1440.

Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to . ACM Trans. Inf. Syst. 22(2):179–214.

120