Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4547–4556 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC NorNE: Annotating Named Entities for Norwegian Fredrik Jørgensen,† Tobias Aasmoe,‡ Anne-Stine Ruud Husevåg,} Lilja Øvrelid,‡ Erik Velldal‡ Schibsted Media Group,† Oslo Metropolitan University,} University of Oslo‡
[email protected],†
[email protected],} {tobiaaa,liljao,erikve}@ifi.uio.no‡ Abstract This paper presents NorNE, a manually annotated corpus of named entities which extends the annotation of the existing Norwegian Dependency Treebank. Comprising both of the official standards of written Norwegian (Bokmål and Nynorsk), the corpus contains around 600,000 tokens and annotates a rich set of entity types including persons, organizations, locations, geo-political entities, products, and events, in addition to a class corresponding to nominals derived from names. We here present details on the annota- tion effort, guidelines, inter-annotator agreement and an experimental analysis of the corpus using a neural sequence labeling architecture. Keywords: Named Entity Recognition, corpus, annotation, neural sequence labeling 1. Introduction In addition to discussing the annotation process and guide- This paper documents the efforts of creating the first pub- lines, we also provide an exploratory analysis of the re- licly available dataset for named entity recognition (NER) sulting dataset through a series of experiments using state- for Norwegian, dubbed NorNE.1 The dataset adds named of-the-art neural architectures for named entity recognition entity annotations on top of the Norwegian Dependency (combining a character-level CNN and a word-level BiL- Treebank (NDT) (Solberg et al., 2014), containing manu- STM, feeding into a CRF inference layer).