INCEPTION CORPUS-BASED DATA SCIENCE FROM SCRATCH

Richard Eckart de Castilho, Jan-Christoph Klie, Naveen Kumar, Beto Boullosa, Iryna Gurevych

Ubiquitous Knowledge Processing Lab (UKP-TUDA) Dept. of Computer Science Technische Universität Darmstadt

Motivation Project Workflow Assisted

• Corpus-based data science is seeing $ % & rapid adoption in science and industry ### Expert Trained “Crowd” ### • Domain-specific annotated corpora are Sources Subcorporation Export / Freeze required in many domains Identify relevant annotation units Project Setup Curation • Current corpus processes for manual Annotation Define data sources Annotate extracted units Manual comparison Define annotation scheme With categories from KB Automatic aggregation corpus annotation do not scale Define knowledge scheme Linking to instances in KB Assign permissions $ KB Population Human in the loop Define schema (classes/properties) Recommenders Active INCEpTION is an infrastructure-ready Edit instances Add qualifiers to statements ● Continually learn from the users actions ● Aims at reducing the time to learn by human-in-the-loop annotation platform ● Asynchronous training does not slow asking specific feedback from the user down user interface ● Using uncertainty-sampling strategy Monitoring ● Automatic evaluation to avoid inaccurate ● Compatible with any recommender that combining machine learning and human ! Assign workload Monitor progress " predictions (configurable quality threshold) provides a confidence score Manager Evaluate quality Curator ● Built-in recommenders: Dictionary-based, ● User can freely switch between active expertise for rapid domain adaptation OpenNLP-based sequence classifier (part- learning and normal annotation of-speech, named entities, …)

INCEpTION Platform Knowledge Bases Extensibility

INCEpTION aims to support three functionalities which are INCEpTION is a modular architecture providing many commonly required for text annotation projects but typically extension points where new functionality can be added and not available in a single tool: corpus creation; text existing functionality can be changed - a selection of these annotation; knowledge management. is shown below:

The platform additionally provides assistive features such as Annotation Annotation Project machine learning recommenders to help users working on Editors Layers Importers/Exporters these tasks more efficiently. Annotation Editor Annotation Data Formats Sidebars Feature types Integrating these functionalities into a single comprehensive Data Model Entity and Fact Linking platform permits addressing tasks typically not found in Annotation Editor ● RDF-based data model: ● Linking of to KB classes and Recommenders Event Logging generic annotation platforms, such as entity linking, classes, properties, instances, literals instances via an auto-complete field Extensions knowledge base population, cross-document coreference ● Support for class hierarchies ● Optional contextual ranking of candidates ● Classes can have instances ● Linked annotations shown on the KB page annotation, etc. ● Support for typed properties including when a class/instance is selected The architecture modular architecture is realized using the domain and range restrictions ● Special support for linking factual ● Editors for different value types (string, statements in text documents to Spring framework. Dependency injection and events are numeric, boolean, KB resources) subject/predicate/object triples in the KB used to achieve a loose coupling between the modules. Diversity Relevance Feedback Corpus Creation

Search Semantic Search Interoperability and Integration Active Learning Statistics Distant Supervision INCEpTION Within the NLP and text mining landscape, Integration goes beyond interoperability. E.g. when an external text mining an annotation platform like INCEpTION only platform wants to delegate annotation to the INCEpTION platform, it needs to be Knowledge Cross-Document covers a part of the overall text analysis able to automatically set up annotation projects, import data, monitor the ongoing Annotation Relations Management needs. Therefore, it is important that the annotations and finally retrieve the annotated data for further use. Mentions Concepts platform is open and interoperable with Relations Entity Linking Entities Automatic Event Linking Completion external services and resources. Suggestions Syntax Events Deduplication Infrastructures & Platforms Legacy WebAnno TSV CLARIN TCF CoNLL-U CoNLL ● Supports the OpenMinTeD AERO protocol formats ● Remote configuration of annotation projects ● Import of documents for annotation Knowledge Resources ● Export of annotated documents Data Formats UIMA CAS Wikipedia Binary+XMI ● RDF-based knowledge resources are ● Notifications on key events (e.g. document or ● Supports a range of different formats for supported project finished) via web hooks importing and exporting annotated text corpora ● Knowledge bases are accessed via SPARQL ● Support of the TCF format enables Open Development DBPedia ● Configurable schema mapping allows interoperability with the CLARIN-D WebLicht Text supporting many different knowledge annotation services resources (RDFS, OWL, SKOS, …) ● Full compatibility with CLARIN’s WebAnno including the import of entire annotation projects ● Support for entity linking if knowledge base While most annotation tools are built in annotation projects, ● UIMA CAS data format enables interoperability Custom KB has full-text search capabilities ... INCEpTION is an infrastructure project and is not with NLP pipelines like DKPro Core associated with any single annotation project. Acquiring INCEpTION early adopters and aligning with their use-cases is a key part Annotation Services Annotation Schemata DKPro TC of our mission. ● Connect external NLP services and machine ● Flexible type system configuration. This motivates our open development philosophy: learning tools to support the user during the ● Use UIMA type system descriptor to auto- Python- annotation process Authentication based configure an annotation project. ● Programming language independent HTTP- machine ● Support for external authentication ● DKPro Core types pre-configured (token, learning based protocol sentence, POS, lemma, morphological ● All code is open and publicly available on GitHub under mechanisms via HTTP headers ● Protocol supports re-training the external tools features, dependencies, coreference ● UIMA CAS XMI data format; supported in Python ● Allows delegating authentication to a reverse the liberal Apache License 2.0 chains…) ... via DKPro PyCAS; in Java via Apache UIMA proxy (e.g. Apache HTTPD) and using any of ● All development-related tasks and issues are publicly its authentication mechanisms, e.g. LDAP, managed and discussed via GitHub SAML2/Shibboleth ● Enables single-sign-on scenarios ● Internal and private is kept at a minimum System-level integration Data-level interoperability

This work has received funding from the German Research Foundation under the grants No. EC 503/1-1 and GU 798/21-1 Download INCEpTION at (INCEpTION) and as part of the RTG Adaptive Preparation of Information from Heterogeneous Sources (AIPHES) under grant https://inception-project.github.io No. GRK 1994/1. We thank Wei Ding, Peter Jiang, Marcel de Boer and Michael Bugert for their valuable contributions.