CADET: Computer Assisted Discovery Extraction and Translation
Total Page:16
File Type:pdf, Size:1020Kb
CADET: Computer Assisted Discovery Extraction and Translation Benjamin Van Durme, Tom Lippincott, Kevin Duh, Deana Burchfield, Adam Poliak, Cash Costello, Tim Finin, Scott Miller, James Mayfield Philipp Koehn, Craig Harman, Dawn Lawrie, Chandler May, Max Thomas Annabelle Carrell, Julianne Chaloux, Tongfei Chen, Alex Comerford Mark Dredze, Benjamin Glass, Shudong Hao, Patrick Martin, Pushpendre Rastogi Rashmi Sankepally, Travis Wolfe, Ying-Ying Tran, Ted Zhang Human Language Technology Center of Excellence, Johns Hopkins University Abstract Computer Assisted Discovery Extraction and Translation (CADET) is a workbench for helping knowledge workers find, la- bel, and translate documents of interest. It combines a multitude of analytics together with a flexible environment for customiz- Figure 1: CADET concept ing the workflow for different users. This open-source framework allows for easy development of new research prototypes using a micro-service architecture based atop Docker and Apache Thrift.1 1 Introduction CADET is an integrated workbench for helping knowledge workers discover, extract, and translate Figure 2: Discovery user interface useful information. The user interface (Figure1) is based on a domain expert starting with a large information in structured form. To do so, she ex- collection of data, wishing to discover the subset ports the search results to our Extraction interface, that is most salient to their goals, and exporting where she can provide annotations to help train an the results to tools for either extraction of specific information extraction system. The Extraction in- information of interest or interactive translation. terface allows the user to label any text span using For example, imagine a humanitarian aid any schema, and also incorporates active learning worker with a large collection of social media to complement the discovery process in selecting messages obtained in the aftermath of a natural data to annotate. disaster. Her goal is to find messages that contain Now consider a different user of the same data: specific needs, such as hospitals requiring food or suppose a citizen reporter wishes to provide up- medical supplies. She may begin by performing to-date news to foreign audiences. He uses our a textual search in our Discovery interface (Fig- Discovery interface to find the messages needed ure2). The Discovery interface can be customized for his story, then exports the text to our Trans- with different types of search providers, and the lation interface. The Translation interface is a aid worker can provide relevance feedback to per- computer-assisted translation toolkit that enables sonalize her search results. After several search him to quickly translate the messages to the target sessions, the aid worker may wish to automatically foreign language. construct a spreadsheet that contains the relevant The challenge with building these personalized workflows is that it involves integration of dis- 1 Please see http://hltcoe.github.io/cadet parate technologies and software components. In- for more information. The demo consists of a running sys- tem (both online and locally on a laptop), as well as pointers dividual components like search, information ex- for building the software. traction, active learning, and machine translation 5 The Companion Volume of the IJCNLP 2017 Proceedings: System Demonstrations, pages 5–8, Taipei, Taiwan, November 27 – December 1, 2017. c 2017 AFNLP may already be open-sourced and extensible, but Amazon Web Services. Data storage is abstracted, to make all these components talk to each other with implementations supporting a simple file- requires a non-trivial amount of effort. CADET backed directory structure, up to a network dis- is a prototyping framework that demonstrates how tributed Apache Accumulo instance. Each com- this integration can be made easier using a micro- ponent is a Docker container which can be down- service architecture built on Docker and Thrift.2 loaded, run, and mixed-and-matched based on the need. This framework was used as the basis of 2 Architecture Design a popular course at JHU, with undergraduate stu- dents cloning entire frameworks, developing both Data serialization In order for analytics to be on laptops and on AWS for projects on knowledge easily integrated, we developed CONCRETE3, a discovery in text.5 data serialization format for Human Language Technology (HLT). It replaces ad-hoc XML, CSV, 3 Discovery or programming language-specific serialization as a way of storing document- and sentence- level Discovery in CADET is presented to the user as annotations. CONCRETE is based on Apache a basic IR interface (Figure2): a query is en- Thrift and thus works cross-platform and in al- tered, results are returned in snippet format in a most all popular programming languages, includ- ranked list. These may be additionally labeled ing Javascript, C++, Java, and Python. All ana- for relevance-feedback, interacted with for visual- lytics in CADET, such as search providers or in- izing pre-computed HLT annotations stored with formation extraction systems, are required to be CONCRETE, or exported to either translation or “concrete compliant”. extraction services. A major goal in the design Microservices Analytics within CADET are im- of the discovery micro-services was to allow for plemented using a microservice architecture and abstracting a large number of non-traditional IR served up as Docker containers4. This allows for mechanisms, through a common and simple inter- rapid prototyping without worrying about the var- face. The CADET admin interface allows for on- ious underlying programming languages and li- the-fly changing out of different discovery service brary dependencies. The containers talk via a providers, in order to allow for compare and con- common CONCRETE microservice API. CADET trast studies in how well one perspective on search consists of a set of these microservices that pro- may be more beneficial than another to a given vide functionality for fetching and storing docu- user. Current discovery providers include: ments, searching, annotating, and training. These Keyword search Our baseline discovery ap- service definitions then support code generation proach is keyword search supported by a mod- of clients and servers in a wide range of lan- ule implementing our microservice APIs and us- guages including Python, Java, C++, Perl, and ing the Lucene information retrieval library.6 This JavaScript. The decoupled microservice design analytic is aware of the Concrete COMMUNICA- combined with Thrift’s code generation allows re- TION data structure, taking a collection of pro- searchers to rapidly integrate their own HLT com- cessed communications and performing standard ponents or compose workflows from existing com- bag of word indexing using the existing document ponents. tokenizations. Platforms CADET workflows have been run in Cross-lingual For demonstrating the ease of ex- a variety of environments: from standalone lap- tending the existing pipelines we have imple- tops with no connection to the internet, to the grid mented an English-Chinese transliteration engine, environment, to demonstration systems hosted on which is beneficial in particular to named entity search. Many names are transliterated, i.e. char- 2CADET is the result of a 9-week summer workshop at the Johns Hopkins University Human Language Technology acters of Chinese entities may be spelled out in Center of Excellence (JHU HLTCOE). A motivation for the terms of the Latin alphabet, where for an English- workshop was the observation that: there are many more po- tential users of HLT, each with their own needs, than there are speaking user it may be easier to issue queries researchers to customize technology to those needs. Our goal using the English transliteration, rather than the with CADET, besides the workbench itself, is to demonstrate an approach for rapid prototyping and integration. 5This course has recently won an internal educational 3http://hltcoe.github.io/concrete award at JHU (Lippincott & Van Durme). 4 https://www.docker.com 6https://lucene.apache.org 6 original Chinese. We currently support a query transliteration system based on an approach sim- ilar to Finch and Sumita(2008). Question Answering From keywords to natural language sentences, one of the CADET discovery service providers is a fully integrated version of recent JHU work in discriminative IR for question Figure 3: Extraction annotation interface answer passage retrieval (Chen and Van Durme, 2017). This allows users to type in a query and 4 Extraction get back individual sentences ranked by their like- After selecting content via the Discovery services, lihood of answering the question. the user can switch to the Extraction web interface Mention search We define entity mention search to build an information extraction system. CADET as the task of selecting a mention (name, nom- provides an interface (Figure3) for the user to inal or pronominal) in a currently viewed doc- efficiently label this data according to their own ument, and returning documents most likely to sequence tagging schemas, e.g., for named entity contain mentions of the same entity. We im- recognition (NER). plemented mention search wrapped on top of The extraction annotation framework is also KELVIN (Finin et al., 2016) which is a multi-year built on CONCRETE, supporting the application investment in knowledge base population (KBP) of multiple competing systems on content, stor- research. This framework processes a provided