Sangrahaka: a Tool for Annotating and Querying Knowledge Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Sangrahaka: A Tool for Annotating and Querying Knowledge Graphs Hrishikesh Terdalkar Arnab Bhattacharya [email protected] [email protected] Indian Institute of Technology Kanpur Indian Institute of Technology Kanpur India India ABSTRACT Semantic tasks are high-level tasks dealing with the meaning of We present a web-based tool Sangrahaka for annotating entities and linguistic units and are considered among the most difficult tasks relationships from text corpora towards construction of a knowl- in NLP for any language. Question Answering (QA) is an example edge graph and subsequent querying using templatized natural of a semantic task that deals with answering a question asked in a language questions. The application is language and corpus agnos- natural language. The task requires a machine to ‘understand’ the tic, but can be tuned for specific needs of a language or a corpus. The language, i.e., identify the intent of the question, and then search for application is freely available for download and installation. Besides relevant information in the available text. This often encompasses having a user-friendly interface, it is fast, supports customization, other NLP tasks such as parts-of-speech tagging, named entity and is fault tolerant on both client and server side. It outperforms recognition, co-reference resolution, and dependency parsing [21]. other annotation tools in an objective evaluation metric. The frame- Making use of knowledge bases is a common approach for the work has been successfully used in two annotation tasks. The code QA task [19, 22, 30, 32]. Construction of knowledge graphs (KGs) is available from https://github.com/hrishikeshrt/sangrahaka. from free-form text, however, can be very challenging, even for English. The situation for other languages, whose state-of-the-art CCS CONCEPTS in NLP is not as advanced in English, is worse. As an example, consider the epic Mahabharata1 in Sanskrit. The state-of-the-art • Software and its engineering ! Software libraries and repos- in Sanskrit NLP is, unfortunately, not advanced enough to identify itories; • Information systems ! Information retrieval. the entities in the text and their inter-relationships. Thus, human KEYWORDS annotation is currently the only way of constructing a KG from it. Even a literal sentence-to-sentence translation of Mahabharata Annotation Tool, Querying Tool, Knowledge Graph in English, which probably boasts of the best state-of-the-art in NLP, ACM Reference Format: is not good enough. Consider, for example, the following sentence Hrishikesh Terdalkar and Arnab Bhattacharya. 2021. Sangrahaka: A Tool from (an English translation of) “The Mahabharata” [15]: for Annotating and Querying Knowledge Graphs. In Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on Ugrasrava, the son of Lomaharshana, surnamed the Foundations of Software Engineering (ESEC/FSE ’21), August 23–28, 2021, Sauti, well-versed in the Puranas, bending with Athens, Greece. ACM, New York, NY, USA,7 pages. https://doi.org/10.1145/ humility, one day approached the great sages 3468264.3473113 of rigid vows, sitting at their ease, who had 1 INTRODUCTION AND MOTIVATION attended the twelve years’ sacrifice of Saunaka, surnamed Kulapati, in the forest of Naimisha. Annotation is a process of marking, highlighting or extracting rele- vant information from a corpus. It is important in various fields of The above sentence contains numerous entities, e.g., Ugrasrava, computer science including natural language processing (NLP) and Lomaharshana, as well as multiple relationships, e.g., Ugrasrava text mining. A generic application of an annotation procedure is in is-son-of Lomaharshana. One of the required tasks in building a KG creating a dataset that can be used as a training or testing set for for Mahabharata is to extract these entities and relationships. arXiv:2107.02782v2 [cs.SE] 23 Aug 2021 various machine learning tasks. The exact nature of the annotation Even a state-of-the-art tool such as spaCy [20] makes numerous process can vary widely based on the targeted task, though. mistakes in identifying the entities; it misses out on Ugrasrava and In the context of NLP, annotation often refers to identifying identifies types wrongly of several entities, e.g., Lomaharshana is and highlighting various parts of the sentence (e.g., characters, identified as an Organization instead of a Person, and Saunaka as a words or phrases) along with syntactic or semantic information. Location instead of a Person.2 Consequently, relationships identified are also erroneous. This highlights the difficulty of the task for Permission to make digital or hard copies of all or part of this work for personal or machines and substantiates the need for human annotation. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. 1Mahabharata is one of the two epics in India (the other being Ramayana) and is ESEC/FSE ’21, August 23–28, 2021, Athens, Greece probably the largest book in any literature, containing nearly 1,00,000 sentences. It © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. was originally composed in Sanskrit. ACM ISBN 978-1-4503-8562-6/21/08...$15.00 2This is not a criticism of spaCy; rather, this highlights the hardness of semantic tasks https://doi.org/10.1145/3468264.3473113 such as entity recognition. ESEC/FSE ’21, August 23–28, 2021, Athens, Greece Hrishikesh Terdalkar and Arnab Bhattacharya Table 1: Comparison of features of various annotation tools Table 2: Roles and Permissions Feature WebAnno GATE BRAT FLAT doccano Sangrahaka Roles Permissions Distributed Annotation XXXXXX Querier Annotator Curator Admin Simple Installation XXXX Query XXXX Intuitive XXXXX Annotate XXX Entity and Relationship XXXXX Curate XX Query Support X Create Ontology X Crash Tolerance X Upload Corpus X Manage Access X 2 BACKGROUND needs. The tool can be deployed on the Web for distributed an- There are numerous text annotation tools available including We- notation by multiple annotators. No programming knowledge is bAnno [33], FLAT [28], BRAT [26], GATE Teamware [14], and doc- expected from an annotator. cano [23], for handling a variety of text annotation tasks including The primary tools and technologies used are Python 3.8 [29], classification, labeling and sequence-to-sequence annotations. Flask 1.1.2 [16, 25], Neo4j Community Server 4.2.1 [31], SQLite 3.35.4 Ideally, an annotation tool should support multiple features and [18] for the backend and HTML5, JavaScript, Bootstrap 4.6 [4], and facilities such as friendly user-interface, simple setup and annota- vis.js [13] for the frontend. tion, fairly fast response time, distributed framework, web-based deployment, customizability, access management, crash tolerance, 3.1 Workflow etc. In addition, for the purpose of knowledge-graph focused annota- Figure1 shows the architecture and workflow of the system. tion, it is important to have capabilities for multi-label annotations The tool is presented as a web-based full-stack application. To and support for annotating relationships. deploy it, one first configures the application and starts the server. A While WebAnno is extremely feature rich, it compromises on sim- user can then register and login to access the interface. The tool uses plicity. Further, its performance deteriorates severely as the number a role based access system. Roles are Admin, Curator, Annotator, of lines displayed on the screen increases. GATE also has the issue and Querier. Permissions are tied to roles. Table2 enlists the roles of complex installation procedure and dependencies. FLAT has a and the permissions associated with them. A user can have more non-intuitive interface and non-standard data format. Development than one role. Every registered member has permission to access of BRAT has been stagnant, with the latest version being published user control panel and view corpus. as far back as 2012. The tool doccano, while simple to setup and An administrator creates a corpus by uploading the text. She use, does not support relationship annotation. Thus, unfortunately, also creates a relevant ontology for the corpus and grants annotator none of these tools supports all the desired features of an annota- access to relevant users. The ontology specifies the type of entities tion framework for the purpose of knowledge-graph annotation. and relationships allowed. An annotator signs-in and opens the Further, none of the above frameworks provide an integration with corpus viewer interface to navigate through lines in the corpus. a graph database, a querying interface, or server and client side For every line, an annotator then marks the relevant entities and crash tolerance. relationships. A curator can access annotations by all annotators, Thus, to satisfy the need of an annotation tool devoid of these and can make a decision of whether to keep or discard a specific pitfalls, we present Sangrahaka. It allows users to annotate and annotation. This is useful to resolve conflicting annotations. An query through a single platform. The application is language and administrator may customize the graph generation mechanism corpus agnostic, but can be customized for specific needs of a lan- based on the semantic task and semantics of the ontology. She then guage or a corpus. Table1 provides a high-level feature comparison imports the generated graph into an independently running graph of these annotation tools including Sangrahaka. database server. A querier can then access the querying interface A recently conducted extensive survey [24] evaluates 78 an- and use templatized natural language questions to generate graph notation tools and provides an in-depth comparison of 15 tools.