ENTYFI: a System for Fine-Grained Entity Typing in Fictional Texts

ENTYFI: A System for Fine-grained Entity Typing in Fictional Texts Cuong Xuan Chu Simon Razniewski Gerhard Weikum Max Planck Institute for Informatics Saarbrucken,¨ Germany fcxchu, srazniew, [email protected] Abstract Fiction and fantasy are core parts of human cul- Fiction and fantasy are archetypes of long-tail ture, spanning from traditional folks and myths domains that lack suitable NLP methodologies into books, movies, TV series and games. People and tools. We present ENTYFI, a web-based have created sophisticated fictional universes such system for fine-grained typing of entity men- as the Marvel Universe, DC Comics, Middle Earth tions in fictional texts. It builds on 205 au- or Harry Potter. These universes include entities, tomatically induced high-quality type systems social structures, and events that are completely for popular fictional domains, and provides different from the real world. Appropriate entity recommendations towards reference type systems for given input texts. Users can ex- typing for these universes is a prerequisite for sev- ploit the richness and diversity of these ref- eral end-user applications. For example, a Game erence type systems for fine-grained super- of Thrones fan may want to query for House Stark vised typing, in addition, they can choose members who are Faceless Men or which charac- among and combine four other typing mod- ter is both a Warg and a Greenseer. On the other ules: pre-trained real-world models, unsuper- hand, an analyst may want to compare social struc- vised dependency-based typing, knowledge tures between different mythologies or formations base lookups, and constraint-based candidate consolidation. The demonstrator is avail- of different civilizations. able at https://d5demos.mpi-inf.mpg. State-of-the-art methods for entity typing mostly de/entyfi. use supervised models trained on Wikipedia con- 1 Introduction tent, and only focus on news and similar real-world texts. Due to low coverage of Wikipedia on fic- Motivation and Problem. Entity types are a tional domains, these methods are thus not suffi- core building block of current knowledge bases cient for literary texts. For example, for the follow- (KBs) and valuable for many natural language ing sentence from Lord of the Rings: processing tasks, such as coreference resolution, “After Melkor’s defeat in the First Age, Sauron relation extraction and question answering (Lee became the second Dark Lord and strove to con- et al., 2006; Carlson et al., 2010; Recasens et al., quer Arda by creating the Rings” 2013). Context-based entity typing, the task of assigning semantic types for mentions of entities state-of-the-art entity typing methods only return person in textual contexts (e.g., musician, politician, few coarse types for entities, such as for location location or battle) therefore has become an im- SAURON and MELKOR or for FIRST portant NLP task. While traditional methods of- AGE and ARDA. Moreover, existing methods typi- ten use coarse-grained classes, such as person, cally produce predictions for each individual men- location, organization and misc, as targets, tion, so that different mentions of the same entity recent methods try to classify entities into finer- may be assigned incompatible types, e.g., ARDA grained types, from hundreds to thousands of them, may be predicted as person and location in dif- yet all limited to variants of the real world, like ferent contexts. from Wikipedia or news (Lee et al., 2006; Ling and Contribution. The prototype system presented Weld, 2012; Corro et al., 2015; Choi et al., 2018). in this demo paper, ENTYFI (fine-grained ENtity Entity type information plays an even more im- TYping on FIctional texts, see Chu et al.(2020) portant role in literary texts from fictional domains. for full details) overcomes the outlined limitations. 100 Proceedings of the 2020 EMNLP (Systems Demonstrations), pages 100–106 November 16-20, 2020. c 2020 Association for Computational Linguistics [1] Type System Construction ENTYFI supports long input texts from any kind of Taxonomy Induction u1, u2, ..., un literature, as well as texts from standard domains [2] Reference [4] Mention Typing Universe Ranking (e.g., news). With the sample text above, ENTYFI [4.1] Supervised [4.2] Unsupervised U1: r1 {T1, KB1} e, cl, cr e, cl, cr U2: r2 {T2, KB2} Input [4.1.1] .. .. Patterns is able to predict more specific and meaningful Fictional Typing Dependency [3] Mention Detection types for entity mentions: LSTM [4.3] KB Lookup [4.1.2] Real-world Typing e KB1 Decoding ELKOR Ainur, Villain IRST GE Eras, Time KB2 M : F A : e: t1, t2,... .. Output Mention: e, Context: cl, cr SAURON: Maiar, Villain DARK LORD: Ainur, Title e1: t1, t2, ... RINGS: Jewelry, Magic Things ARDA: Location e2: t3, t5, ... [5] Type Consolidation e: t1, t2, t3,..., tn ILP Model To address the lack of reference types, ENTYFI Figure 1: Overview of the architecture of ENTYFI leverages the content of fan communities on (Chu et al., 2020). Wikia.com, from which 205 reference type systems are induced. Given an input text, ENTYFI then re- trieves the most relevant reference type systems Breaking Bad) and video games (e.g. League of and uses them as typing targets. By combining su- Legends, Pokemon). pervised typing method with unsupervised pattern Each universe in Wikia is organized similarly extraction and knowledge base lookups, suitable to Wikipedia, such that they contain entities and type candidates are identified. To resolve incon- categories that can be used to distill reference type sistencies among candidates, ENTYFI utilizes an systems. We adopt techniques from the TiFi sys- integer linear programming (ILP) based consolida- tem (Chu et al., 2019) to clean and structure Wikia tion stage. categories. We remove noisy categories (e.g. meta- categories) by using a dictionary-based method. To Outline. The following section describes the ensure connectedness of taxonomies, we integrate architecture of ENTYFI with the approach un- the category networks with WordNet (WN) by link- derlying its main components. The demonstra- ing the categories to the most similar WN synsets. tion is illustrated afterwards through its graph- The similarity is computed between the context ical user interface. Our demonstration system of the category (e.g., description, super/sub cate- is available at: https://d5demos.mpi-inf.mpg. gories) and the gloss of the WN synset (Chu et al., de/entyfi. We also provide a screencast video 2019). Resulting type systems typically contain demonstrating our system, at: https://youtu. between 700 to 10,000 types per universe. be/g_ESaONagFQ. 2.2 Reference Universe Ranking 2 System Overview Given an input text, the goal of this step is to find ENTYFI comprises five steps: type system con- the most relevant universes among the reference struction, reference universe ranking, mention de- universes. Each reference universe is represented tection, mention typing and type consolidation. Fig- by its entities and entity type system. We compute ure1 shows an overview of the ENTYFI architec- the cosine similarity between the TF-IDF vectors ture. of the input and each universe. The top-ranked reference universes and their type systems are then 2.1 Type System Construction used for mention typing (section 2.4). To counter the low coverage of entities and rele- 2.3 Mention Detection vant types in Wikipedia for fictional domains, we make use of an alternative semi-structured resource, To detect entity mentions in the input text, we rely Wikia1. on a BIOES tagging scheme. Inspired by He et al. (2017) from the field of semantic role labeling, we Wikia. Wikia is a large fiction community portal, design a BiLSTM network with embeddings and includes over 385,000 individual universes. It cov- POS tags as input, highway connections between ers a wide range of universes in fiction and fantasy, layers to avoid vanishing gradients (Zhang et al., from old folks and myths like Greek Mythology, 2016), and recurrent dropout to avoid over-fitting Egyptian Mythology to recent stories like Harry (Gal and Ghahramani, 2016). The output is then put Potter, The Lord of the Rings. It also contains pop- into a decoding step by using dynamic program- ular movies, TV series (e.g. Game of Thrones, ming to select the tag sequence with maximum 1https://wikia.com score that satisfies the BIOES constraints. The de- 101 coding step does not add more complexity to the a noun phrase can be considered as a type can- training. didate if there exists a noun compound modifier (nn) between the noun phrase and the given men- 2.4 Mention Typing tion. In the case of candidate types appearing in We produce type candidates for mentions by us- the mention itself, we extract the head word of the ing a combination of supervised, unsupervised and mention and consider it as a candidate if it appears lookup approaches. as a noun in WordNet. For example, given the text Queen Cersei was the twentieth ruler of the Seven Supervised Fiction Types. Given an entity men- Kingdoms, queen and kingdom are type candidates tion and its textual context, we approach typing for the mentions CERSEI and SEVEN KINGDOMS, as multiclass classification problem. The mention respectively. representation is the average of all embeddings of tokens in the mention. The context representation KB Lookup. Using top-ranked universes from is a combination of left and right context around the section 2.2 as basis for the lookup, we map en- mention. The contexts are encoded by using BiL- tity mentions to entities in reference universes by STM models (Graves, 2012) and then put into at- using lexical matching. The types of entities in tention layer to learn the weight factors (Shimaoka corresponding type systems then become type can- et al., 2017). Mention and context representations didates for the given mentions.

Load more