A Multilingual Information Extraction Pipeline for Investigative Journalism

Gregor Wiedemann Seid Muhie Yimam Chris Biemann Language Technology Group Department of Informatics Universitat¨ Hamburg, Germany {gwiedemann, yimam, biemann}@informatik.uni-hamburg.de

Abstract 2) court-ordered revelation of internal communi- cation, 3) answers to requests based on Freedom We introduce an advanced information extrac- tion pipeline to automatically process very of Information (FoI) acts, and 4) unofficial leaks large collections of unstructured textual data of confidential information. Well-known exam- for the purpose of investigative journalism. ples of such disclosed or leaked datasets are the The pipeline serves as a new input proces- Enron email dataset (Keila and Skillicorn, 2005) sor for the upcoming major release of our or the Panama Papers (O’Donovan et al., 2016). New/s/leak 2.0 software, which we develop in To support investigative journalism in their cooperation with a large German news organi- zation. The use case is that journalists receive work, we have developed New/s/leak (Yimam a large collection of files up to several Giga- et al., 2016), a software implemented by experts bytes containing unknown contents. Collec- from natural language processing and visualiza- tions may originate either from official disclo- tion in computer science in cooperation with jour- sures of documents, e.g. Freedom of Informa- nalists from Der Spiegel, a large German news or- tion Act requests, or unofficial data leaks. Our ganization. Due to its successful application in the software prepares a visually-aided exploration investigative research as well as continued feed- of the collection to quickly learn about poten- tial stories contained in the data. It is based on back from academia, we further extend the func- the automatic extraction of entities and their tionality of New/s/leak, which now incorporates co-occurrence in documents. In contrast to better pre-processing, information extraction and comparable projects, we focus on the follow- deployment features. The new version New/s/leak ing three major requirements particularly serv- 2.0 serves four central requirements that have not ing the use case of investigative journalism in been addressed by the first version or other ex- cross-border collaborations: 1) composition of isting solutions for investigative and forensic text multiple state-of-the-art NLP tools for entity analysis: extraction, 2) support of multi-lingual docu- ment sets up to 40 languages, 3) fast and easy- Improved NLP processing: We use stable and to-use extraction of full-text, metadata and en- robust state-of-the-art natural language processing tities from various file formats. (NLP) to automatically extract valuable informa- tion for journalistic research. Our pipeline com- 1 Support Investigative Journalism bines extraction of temporal entities, named en- Journalists usually build up their stories around tities, key-terms, regular expression patterns (e.g. arXiv:1809.00221v1 [cs.CL] 1 Sep 2018 entities of interest such as persons, organizations, URLs, emails, phone numbers) and user-defined companies, events, and locations in combination dictionaries. with the complex relations they have. This is es- Multilingualism: Many tools only work for En- pecially true for investigative journalism which, in glish documents or a few other ‘big languages’. the digital age, more and more is confronted to In the new version, our tool allows for automatic find such relations between entities in large, un- language detection and information extraction in structured and heterogeneous data sources. 40 different languages. Support of multilingual Usually, this data is buried in unstructured texts, collections and documents is specifically useful to for instance from scanned and OCR-ed docu- foster cross-country collaboration in journalism. ments, letter correspondences, emails or protocols. Multiple file formats: Extracting text and meta- Sources typically range from 1) official disclo- data from various file formats can be a daunting sures of administrative and business documents, task, especially in journalism where time is a very scarce resource. In our architecture, we include search, and document tagging. Since this tool is a powerful data wrangling software to automatize already mature and has successfully been used in this process as much as possible. We further put a number of published news stories, we adapted emphasis on scalability in our pipeline to be able some of its most useful features such as document to process very large datasets. For easy deploy- tagging, full-text search and a keyword-in-context ment, New/s/leak 2.0 is distributed as a Docker (KWIC) view for search hits. setup. The Jigsaw visual analytics (Gorg¨ et al., 2014) Keyword graphs: We have implemented key- system is a third tool that supports analyzing and word network graphs, which is build based on the understanding of textual documents. The Jigsaw set of keywords representing the current document system focuses on the extraction of entities us- selection. The keyword network enables to fur- ing Gate tool suite for NLP (Cunningham et al., ther improve the investigation process by display- 2013). Hence, support for multiple languages is ing entity networks related to the keywords. somewhat limited. It also lacks sophisticated data import mechanisms. 2 Related Work The new version of New/s/leak was built - geting these drawbacks and challenges. With There are already a handful of commercial and New/s/leak 2.0 we aim to support the journalist open-source software products to support inves- throughout the entire process of collaboratively tigative journalism. Many of the existing tools analyzing large, complex and heterogeneous doc- such as OpenRefine1, Datawrapper2, Tabula3, or ument collections: data cleaning and formatting, Sisense4 focus solemnly on structured data and metadata extraction, information extraction, inter- most of them are not freely available. For un- active filtering, visualization, close reading and structured text data, there are costly products for tagging, and providing provenance information. forensic text analysis such as Intella5. Targeted user groups are national intelligence agencies. For 3 Architecture smaller publishing houses, acquiring a license for those products is simply not possible. Since we Figure1 shows the overall architecture of also follow the main idea of openness and freedom New/s/leak. In order to allow users to analyze of information, we concentrate on other open- a wide range of document types, our system in- source products to compare our software to. cludes a document processing pipeline, which ex- DocumentCloud6 is an open-source tool specif- tracts text and metadata from a variety of doc- ically designed for journalists to analyze, annotate ument types into a unified representation. On and publish findings from textual data. In addition this unified text representation, a number of NLP to full-text search, it offers named entity recogni- pre-processing tasks are performed as a UIMA tion (NER) based on OpenCalais7 for person and pipeline (Ferrucci and Lally, 2004), e.g. automatic location names. In addition to automatic NER for identification of the document language, segmen- multiple languages, our pipeline supports the iden- tation into paragraph, sentence and token units, tification of keyterms as well as temporal and user- and extraction of named entities, keywords and defined entities. metadata. ElasticSearch is used to store the pro- Overview (Brehmer et al., 2014) is another cessed data and create aggregation queries for dif- open-source application developed by computer ferent entity types to generate network graphs. scientists in collaboration with journalists to sup- The user interface is implemented with a RESTful port investigative journalism. The application sup- web service based on the Scala Play framework ports import of PDF, MS Office, and HTML doc- in combination with an AngularJS browser app to uments, document clustering based on topic sim- present information to the journalists. Visualiza- ilarity, a simple location entity detection, full-text tions are realized with D3 (Bostock et al., 2011). In order to enable a seamless deployment of the 1 http://openrefine.org tool by journalists with limited technical skills, we 2https://www.datawrapper.de 3http://tabula.technology have integrated all of the required components of 4https://www.sisense.com the architecture into a Docker8 setup. Via docker- 5https://www.vound-software.com compose, a software to orchestrate Docker con- 6https://www.documentcloud.org 7http://www.opencalais.com 8https://www.docker.com Hoover • processing of documents, archives, email inbox formats Hoover • fulltext extraction index

• metadata extraction • duplicate detection

IE pipeline User Interface • Language detection • Segmentation Newsleak • Temporal Entity extraction index • Named Entity Extraction

• Key-term extraction • Dictionary extraction

Figure 1: Architecture of New/s/leak 2.0 tainers for complex architectures, end-users can New/s/leak connects directly to Hoover’s index download and run locally a preconfigured version to read full-texts and metadata for its own infor- of New/s/leak with one single command. Being mation extraction pipeline. Through this close in- able to process data locally and even without any tegration with Hoover, New/s/leak can offer infor- connection to the internet is a vital prerequisite for mation extraction to a wide variety of data formats. journalists when they work with sensitive data. All In many cases, this drastically limits or even com- necessary source code and installation instructions pletely eliminates the amount of work needed to can be found on our Github project page.9 clean and preprocess large datasets beforehand.

4 Data Wrangling 5 Multilingual Information Extraction Extracting text and metadata from various formats The core functionality of New/s/leak is the auto- into a format readable by a specific analysis tool matic extraction of various kinds of entities from can be a tedious task. In an investigative journal- text to facilitate the exploration and sense-making ism scenario, it can even be a deal breaker since process from large collections. Since a lot of time is an especially scarce resource and file for- steps in this process involve language-dependent mat conversion might not be a task journalists are resources, we put an emphasis on the work to sup- well trained in. To offer access to as many file port as many languages as possible. formats as possible in New/s/leak, we opted for a close integration with Hoover,10 a set of open- 5.1 Preprocessing source tools for text extraction and search in large Information extraction in New/s/leak is imple- text collections. Hoover is developed by the Eu- mented as a configurable UIMA pipeline (Ferrucci ropean Investigative Collaborations (EIC) network and Lally, 2004). Text documents and metadata with a special focus on large data leaks and hetero- from a Hoover collection (see Section4) are read geneous datasets. It can extract data from various in parallelized manner and put through a chain of text file formats such as txt, html, docx, pdf, zip, annotators. In a final step of the chain, results tar, pst, mbox, eml, etc. The text is extracted along from annotation processes are indexed in an Elas- with metadata from files (e.g. file name, creation ticSearch index for later retrieval and visualiza- date, file hash) and header information (e.g. sub- tion. ject, sender, and receiver). In the case of emails, First, we identify the language of each doc- attachments are processed automatically, too. ument. Alternatively, language can also be de- termined on a paragraph level to support multi- 9https://uhh-lt.github.io/ newsleak-frontend language documents, which can occur quite of- 10https://hoover.github.io ten, for instance in email leaks or bilingual con- tracts. Second, we separate sentences and tokens 5.4 Named Entity Recognition in each text. To guarantee compatibility with var- We automatically extract person, organization and ious Unicode scripts in different languages, we location names from all documents to allow for rely on the ICU4J library11 for this task. ICU4J an entity-centric exploration of the data collec- provides locale-specific sentence and word bound- tion. Named entity recognition is done using ary detection relying on a simple rule-based ap- the polyglot-NER library (Al-Rfou et al., 2015). proach. While the quality of the segmentation and Polyglot-NER contains sequence classification for tokenization results might be better when using named entities based on weakly annotated training specifically trained segmentation models, the ad- data automatically composed from Wikipedia12 vantage of the rule-based approach in ICU4J is and Freebase13. Relying on the automatic com- that it works robustly not only for many languages position of training data allows polyglot-NER to but also for noisy data, which we expect to be provide pre-trained models for 40 languages.14 abundant in real-life datasets.

5.2 Dictionaries and RE-patterns 5.5 Keyterm Extraction In many cases, journalists follow some hypothesis To further summarize document contents in addi- to test for their investigative work. Such a pro- tion to named entities, we automatically extract ceeding can involve looking for mentions of al- keyterms and phrases from documents. For this, ready known terms or specific entities in the data. we implemented a keyterm extraction library for This can be realized by lists of dictionaries pro- the 40 languages also supported in the previous vided to the initial information extraction process. step.15 Our approach is based on a statistical com- New/s/leak annotates every mention of a dictio- parison of document contents with generic ref- nary term with the respective list type. Dictionar- erence data. Reference data for each language ies can be defined in a language-specific fashion, is retrieved from the Leipzig Corpora Collection but also applied across documents of all languages (Goldhahn et al., 2012), which provides large rep- in the corpus. Extracted dictionary entities are dis- resentative corpora for language statistics. We em- played along with extracted named entities in the ploy log-likelihood significance as described in visualization. (Rayson et al., 2004) to measure the overuse of In addition to self-defined dictionaries, we an- terms (i.e. keyterms) in our target documents com- notate email addresses, telephone numbers, and pared to the generic reference data. Ongoing se- URLs with regular expression patterns. This is quences of keyterms in target documents are con- useful, especially for email leaks to reveal commu- catenated to key phrases if they occur regularly in nication networks of persons and filter for specific that exact same order. Regularity is determined email account related content. with the Dice coefficient. This simple method al- lows to reliably extract multiword units such as 5.3 Temporal Expressions “stock market” or “machine learning” in the docu- Tracking documents across the time of their cre- ments. Since this method also extracts named en- ation or by temporal events they mention can pro- tities if they occur significantly often in a docu- vide valuable information during investigative re- ment, there can be a substantial overlap between search. Unfortunately, many document sets (e.g. both types. To allow for a separate display of collections of scanned pages) do not come with named entities and keywords, we filter keyterms if a specific document creation date as structured they already have been annotated as a named en- metadata. To offer a temporal selection of contents tity. The remaining top keyterms are used to cre- to the user, we extract mentions of temporal ex- ate a brief summary of each document for the user pressions. This is done by integrating the Heidel- and to generate keyterm networks for document Time temporal tagger (Strotgen¨ and Gertz, 2015) browsing. in our UIMA workflow. HeidelTime provides au- 12https://wikipedia.org tomatically learned rules for temporal tagging in 13https://developers.google.com/ more than 200 languages. Extracted timestamps can be used to select and filter documents. 14A list of the 40 languages covered by Polyglot-NER can be found at https://tinyurl.com/yaju7bf7 11http://icu-project.org/apiref/icu4j 15https://github.com/uhh-lt/lt-keyterms 7 Case Study

To illustrate analysis capabilities of the new ver- sion of New/s/leak, we present an exemplary case study at https://ltdemos.informatik. uni-hamburg.de/newsleak/ (login with ”user” and ”password”). Since we refrain from publishing any confidential leak data, we created an artificial dataset from publicly available doc- Figure 2: The entity and keyword graphs of New/s/leak uments that share certain characteristics with the based on the WW2 collection (see Section7). Net- data from intended use cases in investigative jour- works are visualized based on the current document nalism. It contains documents written in multiple selection, which can be filtered by full-text search, en- languages, roughly centered on one topic and is tities or metadata. Visualization parameters such as the number of nodes per type or minimum edge strength full of references to entities. can be set by the user. Hovering over nodes and edges Ca. 27.000 documents in our sample set are in one graph highlights information present in the re- Wikipedia articles related to the topic of World spective another graph to show which entities and key- War II. Articles were crawled from the encyclo- words frequently co-occur with each other in docu- pedia in four languages (English:en, Spanish:es, ments. Hungarian:hu, German:de) as a link network start- ing from the article ”Second World War” in each 6 User Interface respective language. Preprocessing and data ex- Browsing entity networks: Access to unstruc- traction took around 75 minutes on a moderately tured text collections via named entities is essen- fast server with 12 parallel CPU threads. tial for journalistic investigations. To support this, Analysis: From a perspective of national his- we included two types of graph visualization, as it tory discourses and education, a certain common is shown in Figure2. The first graph, called entity knowledge about WW2 can be expected. But, the network, displays entities in a current document topic becomes quickly a novel unexplored terrain selection as nodes and their joint occurrence as for most people when it comes to aspects outside edges between nodes. Different node colors rep- of the own region, e.g. the involvement of Asian resent different types such as person, organization powers. In our test case, we strive to fill gaps or location names. Furthermore, mentions of enti- in our knowledge by identifying interesting de- ties that are annotated based on dictionary lists are tails regarding this question. First, we start with included in the entity network graph. The second a visualization of entities from the entire collec- graph, called keyword network, is build based on tion which highlights central actors of WW2 in the set of keywords representing the current doc- general. In the list of extracted location entities, ument selection. The keyword network also in- we can filter for ca. 2,000 articles referencing to cludes tags that can be attached to documents by Asia (en, es), Azsia´ (hu) or Asien (de). In this journalists during work with the collection. subselection, we find most references to China as Journalist in the loop: In addition to the auto- a political power of the region followed by India matic annotation of entities and keyterms, we fur- and Japan. Further refinement of the collection ther enable journalists to: 1) annotate new entity by references to China highlights a central per- types that are not in the system at all, 2) correct au- son name in the network, Chiang Kai-shek, who tomatic annotations provided by the pipeline, e.g. raises our interest. To find out more, we start to remove false positives or false entity type labels the filter process all over again, subselecting all annotated by the NER process, 3) merge identi- articles referencing this name. The resulting en- cal entities which have different forms (e.g. last tity network reveals a close connection to the or- names to full names, or spelling variants in differ- ganization Kuomintang (KMT). Filtering for this ent languages), and 4) label documents with user- organization, too, we can quickly identify arti- defined terms called tags. The tags are mainly cles centrally referencing to both entities by look- used to annotate the document either for later read- ing at their titles and extracted keywords. From ing or to share with collaborators. the corresponding keyterm network and a KWIC view into the article full-texts, we learn that KMT References is the national party of China and Kai-Shek as Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, and their leader ruled the country during the period of Steven Skiena. 2015. Polyglot-NER: Massive Mul- WW2. A second central actor, Mao Zedong, is tilingual Named Entity Recognition. In Proc. strongly connected with both, KMT and Chiang SIAM/ICDM-2015, pages 586–594. Kai-shek in our entity network. From articles also Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. prominently referencing Zedong, we learn from 2011. D3 Data-Driven Documents. IEEE Trans. Vi- sections highlighting both person names that Kai- sualization & Comp. Graphics, 17(12):2301–2309. shek and Zedong, also a member of KMT and later M. Brehmer, S. Ingram, J. Stray, and T. Munzner. 2014. leader of the Chinese Communists, shared a com- Overview: The Design, Adoption, and Analysis of plicated relationship. By filtering for both names, a Visual Document Mining Tool for Investigative we can now explore the nature of this relationship Journalists. IEEE Trans. Visualization & Comp. Graphics, 20(12):2271–2280. in more detail and compare its display across the four languages in our dataset.16 Hamish Cunningham, Valentin Tablan, Angus Roberts, and Kalina Bontcheva. 2013. Getting more out of 8 Discussion and Future Work biomedical documents with GATE’s full lifecycle open source text analytics. PLoS computational bi- In this paper, we introduced the completely ology, 9(2):e1002854. renewed information extraction pipeline of David Ferrucci and Adam Lally. 2004. UIMA: An New/s/leak 2.0, an open-source software to architectural approach to unstructured information support investigative journalism. As major processing in the corporate research environment. Natural Language Engineering, 10(3-4):327–348. requirements based on prior experiences, we identified the automatic annotation of various Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. entity types in very large, multi-lingual document 2012. Building large monolingual dictionaries at the sets contained in heterogeneous file formats. Our Leipzig corpora collection: From 100 to 200 lan- guages. In Proc. LREC 2012, pages 759–547. solution involves a combination of powerful NLP libraries for temporal and named entities, own Carsten Gorg,¨ Zhicheng Liu, and John Stasko. 2014. developments for keyterm and pattern extraction, Reflections on the evolution of the Jigsaw vi- sual analytics system. Information Visualization, and a powerful data wrangling suite for text and 13(4):336–345. metadata extraction. The pipeline is capable to process information extraction in 40 languages. P. S. Keila and D. B. Skillicorn. 2005. Detecting Proc. CASCON New/s/leak unusual email communication. In has been in use successfully at the 2005, pages 117–125. German news organization Der Spiegel. It recently has also been introduced as an open-source tool to James O’Donovan, Hannes F. Wagner, and Stefan the community of investigative journalists at re- Zeume. 2016. The value of offshore secrets evi- dence from the panama papers. SSRN Electronic spective conferences. We expect to collect more Journal. user feedback and experiences from case studies in the near future to further improve the software. Paul Rayson, Damon Berridge, and Brian Francis. 2004. Extending the Cochran rule for the compar- As a new main feature, we plan to extend the in- ison of word frequencies between corpora. In Proc. formation extraction pipeline for user-defined cat- JADT ’04, pages 926–765. egories into the direction of adaptive and active machine learning approaches. Currently, while Jannik Strotgen¨ and Michael Gertz. 2015. A Base- line Temporal Tagger for all Languages. In Proc. reading the full-texts, users can manually annotate EMNLP 2015, pages 541–547. new entity types in the text or tag the entire docu- ments. In combination with an adaptive and active Seid Muhie Yimam, Heiner Ulrich, Tatiana von Lan- desberger, Marcel Rosenbach, Michaela Regneri, learning approach, users will be able to train auto- Alexander Panchenko, Franziska Lehmann, Uli matic tagging of documents and extraction of in- Fahrer, Chris Biemann, and Kathrin Ballweg. 2016. formation while working with the data in the user new/s/leak – Information Extraction and Visualiza- interface. tion for Investigative Data Journalists. In Proc. ACL 2016 System Demonstrations, pages 163–168.

16A video of the proceeding can be found at: http:// youtu.be/96f_4Wm5BoU