An Entity-Focused Approach to Generating Company Descriptions
Total Page:16
File Type:pdf, Size:1020Kb
An Entity-Focused Approach to Generating Company Descriptions Gavin Saldanha* Or Biran** Kathleen McKeown** Alfio Gliozzo Columbia University Columbia University Columbia University IBM Watson† * [email protected] ** orb, kathy @cs.columbia.edu [email protected]{ } † Abstract these with sentences on the web that match learned expressions of relationships. We evaluate our hy- Finding quality descriptions on the web, brid approach and compare it with a targeted-only such as those found in Wikipedia arti- approach and a data-driven-only approach, as well cles, of newer companies can be difficult: as a strong multi-document summarization base- search engines show many pages with line. Our results show that the hybrid approach varying relevance, while multi-document performs significantly better than either approach summarization algorithms find it difficult alone as well as the baseline. to distinguish between core facts and other The targeted (TD) approach to company de- information such as news stories. In this scription uses Wikipedia descriptions as a model paper, we propose an entity-focused, hy- for generation. It learns how to realize RDF re- brid generation approach to automatically lations that have the company as their subject: produce descriptions of previously unseen each relation contains a company/entity pair and companies, and show that it outperforms a it is these pairs that drive both content and expres- strong summarization baseline. sion of the company description. For each com- pany/entity pair, the system finds all the ways in 1 Introduction which similar company/entity pairs are expressed As new companies form and grow, it is impor- in other Wikipedia company descriptions, clus- tant for potential investors, procurement depart- tering together sentences that express the same ments, and business partners to have access to a company/entity relation pairs. It generates tem- 360-degree view describing them. The number plates for the sentences in each cluster, replacing of companies worldwide is very large and, for the mentions of companies and entities with typed the vast majority, not much information is avail- slots and generates a new description by insert- able in sources like Wikipedia. Often, only fir- ing expressions for the given company and entity mographics data (e.g. industry classification, lo- in the slots. All possible sentences are generated cation, size, and so on) is available. This cre- from the templates in the cluster, the resulting sen- ates a need for cognitive systems able to aggre- tences are ranked and the best sentence for each gate and filter the information available on the web relation selected to produce the final description. and in news, databases, and other sources. Provid- Thus, the TD approach is a top-down approach, ing good quality natural language descriptions of driven to generate sentences expressing the rela- companies allows for easier access to the data, for tions found in the company’s RDF data using real- example in the context of virtual agents or with izations that are typically used on Wikipedia. text-to-speech applications. In contrast, the data-driven (DD) approach uses In this paper, we propose an entity-focused sys- a semi-supervised method to select sentences from tem using a combination of targeted (knowledge descriptions about the given company on the web. base driven) and data-driven generation to create Like the TD approach, it also begins with a company descriptions in the style of Wikipedia de- seed set of relations present in a few companies’ scriptions. The system generates sentences from DBPedia entries, represented as company/entity RDF triples, such as those found in DBPedia and pairs, but instead of looking at the corresponding Freebase, about a given company and combines Wikipedia articles, it learns patterns that are typ- 243 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 243–248, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics ically used to express the relations on the web. tence selection system (Blair-Goldensohn et al., In the process, it uses bootstrapping (Agichtein 2003; Weischedel et al., 2004; Schiffman et al., and Gravano, 2000) to learn new ways of ex- 2001). In our approach, we target both relevance pressing the relations corresponding to each com- and variety of expression, driving content by se- pany/entity pair, alternating with learning new lecting sentences that match company/entity pairs pairs that match the learned expression patterns. and inducing multiple patterns of expression. Sen- Since the bootstrapping process is driven only by tence selection has also been used in prior work on company/entity pairs and lexical patterns, it has generating Wikipedia overall articles (Sauper and the potential to learn a wider variety of expressions Barzilay, 2009). Their focus is more on learning for each pair and to learn new relations that may domain-specific templates that control the topic exist for each pair. Thus, this approach lets data structure of an overview, a much longer text than for company descriptions on the web determine we generate. the possible relations and patterns for expressing those relations in a bottom-up fashion. It then uses 3 Targeted Generation the learned patterns to select matching sentences from the web about a target company. The TD system uses a development set of 100 S&P500 companies along with their Wikipedia 2 Related Work articles and DBPedia entries to form templates. For each RDF relation with the company as the The TD approach falls into the generation pipeline subject, it identifies all sentences in the corre- paradigm (Reiter and Dale, 1997), with content sponding article containing the entities in the re- selection determined by the relation in the com- lation. The specific entities are then replaced with pany’s DBpedia entry while microplanning and their relation to create a template. For example, realization are carried out through template gen- “Microsoft was founded by Bill Gates and Paul eration. While some generation systems, partic- Allen” is converted to “ company was founded h i ularly in early years, used sophisticated gram- by founder ,” with conjoined entities collapsed h i mars for realization (Matthiessen and Bateman, into one slot. Many possible templates are created, 1991; Elhadad, 1991; White, 2014), in recent some of which contain multiple relations (e.g., years, template-based generation has shown a “ company , located in location , was founded h i h i resurgence. In some cases, authors focus on doc- by founder ”). In this way the system learns how h i ument planning and sentences in the domain are Wikipedia articles express relations between the stylized enough that templates suffice (Elhadad company and its key entities (founders, headquar- and Mckeown, 2001; Bouayad-Agha et al., 2011; ters, products, etc). Gkatzia et al., 2014; Biran and McKeown, 2015). At generation time, we fill the template slots In other cases, learned models that align database with the corresponding information from the RDF records with text snippets and then abstract out entries of the target company. Conjunctions are specific fields to form templates have proven suc- inserted when slots are filled by multiple entities. cessful for the generation of various domains (An- Continuing with our example, we might now pro- geli et al., 2010; Kondadadi et al., 2013). Others, duce the sentence “Palantir was founded by Peter like us, target atomic events (e.g., date of birth, Thiel, Alex Karp, Joe Lonsdale, Stephen Cohen, occupation) for inclusion in biographies (Filatova and Nathan Gettings” for target company Palan- and Prager, 2005) but the templates used in other tir. Preliminary results showed that this method work are manually encoded. was not adequate - the data for the target com- Sentence selection has also been used for ques- pany often lacked some of the entities needed to tion answering and query-focused summarization. fill the templates. Without those entities the sen- Some approaches focus on selection of relevant tence could not be generated. As Wikipedia sen- sentences using probabilistic approaches (Daume´ tences tend to have multiple relations each (high III and Marcu, 2005; Conroy et al., 2006), semi- information density), many sentences containing supervised learning (Wang et al., 2011) and graph- important, relevant facts were discarded due to based methods (Erkan and Radev, 2004; Otter- phrases that mentioned lesser facts we did not have bacher et al., 2005). Yet others use a mixture of the data to replace. We therefore added a post- targeted and data-driven methods for a pure sen- processing step to remove, if possible, any phrases 244 from the sentence that could not be filled; other- pattern.1 The entities are therefore assumed to be wise, the sentence is discarded. related since they are expressed in the same way This process yields many potential sentences as the seed pair. Unlike the TD approach, the ac- for each relation, of which we only want to choose tual relationship between the entities is unknown the best. We cluster the newly generated sen- (since the only data we use is the web text, not the tences by relation and score each cluster. Sen- structured RDF data); all we need to know here is tences are scored according to how much informa- that a relationship exists. tion about the target company they contain (num- We alternate learning the patterns and generat- ber of replaced relations). Shorter sentences are ing entity pairs over our development set of 100 also weighted more as they are less likely to con- companies. We then take all the learned patterns tain extraneous information, and sentences with and find matching sentences in the Bing search re- more post-processing are scored lower. The high- sults for each company in the set of target compa- est scored sentence for each relation type is added nies.2 Sentences that match any of the patterns are to the description as those sentences are the most selected and ranked by number of matches (more informative, relevant, and most likely to be gram- matches means greater probability of strong rela- matically correct.