Arxiv:1809.00221V1 [Cs.CL] 1 Sep 2018

A Multilingual Information Extraction Pipeline for Investigative Journalism Gregor Wiedemann Seid Muhie Yimam Chris Biemann Language Technology Group Department of Informatics Universitat¨ Hamburg, Germany fgwiedemann, yimam, [email protected] Abstract 2) court-ordered revelation of internal communi- cation, 3) answers to requests based on Freedom We introduce an advanced information extraction pipeline to automatically process very of Information (FoI) acts, and 4) unofficial leaks large collections of unstructured textual data of confidential information. Well-known exam- for the purpose of investigative journalism. ples of such disclosed or leaked datasets are the The pipeline serves as a new input proces- Enron email dataset (Keila and Skillicorn, 2005) sor for the upcoming major release of our or the Panama Papers (O’Donovan et al., 2016). New/s/leak 2.0 software, which we develop in To support investigative journalism in their cooperation with a large German news organi- zation. The use case is that journalists receive work, we have developed New/s/leak (Yimam a large collection of files up to several Giga- et al., 2016), a software implemented by experts bytes containing unknown contents. Collec- from natural language processing and visualiza- tions may originate either from official disclo- tion in computer science in cooperation with jour- sures of documents, e.g. Freedom of Informa- nalists from Der Spiegel, a large German news or- tion Act requests, or unofficial data leaks. Our ganization. Due to its successful application in the software prepares a visually-aided exploration investigative research as well as continued feed- of the collection to quickly learn about poten- tial stories contained in the data. It is based on back from academia, we further extend the func- the automatic extraction of entities and their tionality of New/s/leak, which now incorporates co-occurrence in documents. In contrast to better pre-processing, information extraction and comparable projects, we focus on the follow- deployment features. The new version New/s/leak ing three major requirements particularly serv- 2.0 serves four central requirements that have not ing the use case of investigative journalism in been addressed by the first version or other ex- cross-border collaborations: 1) composition of isting solutions for investigative and forensic text multiple state-of-the-art NLP tools for entity analysis: extraction, 2) support of multi-lingual document sets up to 40 languages, 3) fast and easy- Improved NLP processing: We use stable and to-use extraction of full-text, metadata and en- robust state-of-the-art natural language processing tities from various file formats. (NLP) to automatically extract valuable information for journalistic research. Our pipeline com- 1 Support Investigative Journalism bines extraction of temporal entities, named en- Journalists usually build up their stories around tities, key-terms, regular expression patterns (e.g. arXiv:1809.00221v1 [cs.CL] 1 Sep 2018 entities of interest such as persons, organizations, URLs, emails, phone numbers) and user-defined companies, events, and locations in combination dictionaries. with the complex relations they have. This is es- Multilingualism: Many tools only work for En- pecially true for investigative journalism which, in glish documents or a few other ‘big languages’. the digital age, more and more is confronted to In the new version, our tool allows for automatic find such relations between entities in large, un- language detection and information extraction in structured and heterogeneous data sources. 40 different languages. Support of multilingual Usually, this data is buried in unstructured texts, collections and documents is specifically useful to for instance from scanned and OCR-ed docu- foster cross-country collaboration in journalism. ments, letter correspondences, emails or protocols. Multiple file formats: Extracting text and meta- Sources typically range from 1) official disclo- data from various file formats can be a daunting sures of administrative and business documents, task, especially in journalism where time is a very scarce resource. In our architecture, we include search, and document tagging. Since this tool is a powerful data wrangling software to automatize already mature and has successfully been used in this process as much as possible. We further put a number of published news stories, we adapted emphasis on scalability in our pipeline to be able some of its most useful features such as document to process very large datasets. For easy deploy- tagging, full-text search and a keyword-in-context ment, New/s/leak 2.0 is distributed as a Docker (KWIC) view for search hits. setup. The Jigsaw visual analytics (Gorg¨ et al., 2014) Keyword graphs: We have implemented key- system is a third tool that supports analyzing and word network graphs, which is build based on the understanding of textual documents. The Jigsaw set of keywords representing the current document system focuses on the extraction of entities us- selection. The keyword network enables to fur- ing Gate tool suite for NLP (Cunningham et al., ther improve the investigation process by display- 2013). Hence, support for multiple languages is ing entity networks related to the keywords. somewhat limited. It also lacks sophisticated data import mechanisms. 2 Related Work The new version of New/s/leak was built tar- geting these drawbacks and challenges. With There are already a handful of commercial and New/s/leak 2.0 we aim to support the journalist open-source software products to support inves- throughout the entire process of collaboratively tigative journalism. Many of the existing tools analyzing large, complex and heterogeneous doc- such as OpenRefine1, Datawrapper2, Tabula3, or ument collections: data cleaning and formatting, Sisense4 focus solemnly on structured data and metadata extraction, information extraction, inter- most of them are not freely available. For un- active filtering, visualization, close reading and structured text data, there are costly products for tagging, and providing provenance information. forensic text analysis such as Intella5. Targeted user groups are national intelligence agencies. For 3 Architecture smaller publishing houses, acquiring a license for those products is simply not possible. Since we Figure1 shows the overall architecture of also follow the main idea of openness and freedom New/s/leak. In order to allow users to analyze of information, we concentrate on other open- a wide range of document types, our system in- source products to compare our software to. cludes a document processing pipeline, which ex- DocumentCloud6 is an open-source tool specif- tracts text and metadata from a variety of doc- ically designed for journalists to analyze, annotate ument types into a unified representation. On and publish findings from textual data. In addition this unified text representation, a number of NLP to full-text search, it offers named entity recogni- pre-processing tasks are performed as a UIMA tion (NER) based on OpenCalais7 for person and pipeline (Ferrucci and Lally, 2004), e.g. automatic location names. In addition to automatic NER for identification of the document language, segmen- multiple languages, our pipeline supports the iden- tation into paragraph, sentence and token units, tification of keyterms as well as temporal and user- and extraction of named entities, keywords and defined entities. metadata. ElasticSearch is used to store the pro- Overview (Brehmer et al., 2014) is another cessed data and create aggregation queries for dif- open-source application developed by computer ferent entity types to generate network graphs. scientists in collaboration with journalists to sup- The user interface is implemented with a RESTful port investigative journalism. The application sup- web service based on the Scala Play framework ports import of PDF, MS Office, and HTML doc- in combination with an AngularJS browser app to uments, document clustering based on topic sim- present information to the journalists. Visualiza- ilarity, a simple location entity detection, full-text tions are realized with D3 (Bostock et al., 2011). In order to enable a seamless deployment of the 1 http://openrefine.org tool by journalists with limited technical skills, we 2https://www.datawrapper.de 3http://tabula.technology have integrated all of the required components of 4https://www.sisense.com the architecture into a Docker8 setup. Via docker- 5https://www.vound-software.com compose, a software to orchestrate Docker con- 6https://www.documentcloud.org 7http://www.opencalais.com 8https://www.docker.com Hoover • processing of documents, archives, email inbox formats Hoover • fulltext extraction index • metadata extraction • duplicate detection IE pipeline User Interface • Language detection • Segmentation Newsleak • Temporal Entity extraction index • Named Entity Extraction • Key-term extraction • Dictionary extraction Figure 1: Architecture of New/s/leak 2.0 tainers for complex architectures, end-users can New/s/leak connects directly to Hoover’s index download and run locally a preconfigured version to read full-texts and metadata for its own infor- of New/s/leak with one single command. Being mation extraction pipeline. Through this close in- able to process data locally and even without any tegration with Hoover, New/s/leak can offer infor- connection to the internet is a vital prerequisite for mation extraction to a wide variety of data formats.

Load more