DKIE: Open Source Information Extraction for Danish
Total Page:16
File Type:pdf, Size:1020Kb
DKIE: Open Source Information Extraction for Danish Leon Derczynski Camilla Vilhelmsen Field Kenneth S. Bøgh University of Sheffield University of Southern Denmark Aarhus University [email protected] [email protected] [email protected] Abstract Interoperability: DKIE is internally consistent and adopts unified, well-grounded solutions to the Danish is a major Scandinavian language problems of processing Danish. Where possible, spoken daily by around six million peo- DKIE re-uses existing components, and strives for ple. However, it lacks a unified, open set compatibility with major text processing architec- of NLP tools. This demonstration will in- tures. troduce DKIE, an extensible open-source toolkit for processing Danish text. We im- Portability: It is preferable for developed com- plement an information extraction archi- ponents to be readily movable within the chosen tecture for Danish within GATE, including architecture, GATE, and without, usable indepen- integrated third-party tools. This imple- dently. mentation includes the creation of a sub- Openness: The resultant application, and cor- stantial set of corpus annotations for data- pora and annotations developed in its creation, are intensive named entity recognition. The as freely-available as possible. final application and dataset is made are The remainder of this paper first discusses con- openly available, and the part-of-speech siderations specific to the language and prior tagger and NER model also operate in- work, then introduces the information extraction dependently or with the Stanford NLP pipeline, followed by an evaluation of the tools toolkit. provided. 1 Introduction 2 Processing Danish Danish is primarily spoken in the northern hemi- There are a few representational issues for Danish sphere: in Denmark, on the Faroe islands, and on that are not solved in a unified fashion across exist- Greenland. Having roots in Old Norse, Danish ing technological issues. DKIE builds upon major bears similarities to other Scandinavian languages, standards in general linguistic annotation and in and shares features with English and German. Danish to unify these solutions. Previous tools and language resources for Dan- Danish is written using the Latin alphabet, with ish have suffered from license restrictions, or from the addition of three vowels: æ, ø and a,˚ which using small or non-reusable datasets. As a result, may be transliterated as ae, oe and aa respectively. it is often difficult to use Danish language tech- It is similar to English in terms of capitalisation nologies, if anything is available at all. In cases rules and character set. where quality tools are available, they often have Over time, the orthography of Danish has disparate APIs and input/output formats, making shifted. Among other things, a spelling reform integration time-consuming and prone to error. in 1948 removed the capitalisation of nouns, and To remedy this, this paper presents an open- introduced the three vowel characters to repre- source information extraction toolkit for Danish, sent existing vowel digraphs. There were also using the established and flexible GATE text pro- spelling shifts in this reform (e.g. kjærlighed to cessing platform (Cunningham et al., 2013). To kærlighed). In addition, some towns and mu- this end, there are three main goals: nicipalities have changed the spelling of their Adaptation: The application adapts to collo- name. For example, Denmarks second-largest city quial and formal Danish. Aarhus changed its name to Arhus˚ with the 1948 Figure 1: The ANNIE-based information extraction pipeline for Danish reform, although Aalborg and Aabenraa did not. (CDT) (Buch-Kromann and Korzen, 2010), which Later, in 2011, the city reverted from Arhus˚ to built on and included previously-released corpora Aarhus. The city’s university retained the Aarhus for Danish. This 200K-token corpus is taken spelling throughout this period. from news articles and editorials, and includes The effect of these relatively recent changes is document structure, tokenisation, lemma, part-of- that there exist digitised texts using a variety of or- speech and dependency relation information. thographies not only to represent the same sound, The application demonstrated, DKIE, draws as also in English, but also the same actual word. only on open corpus resources for annotation, and A language processing toolkit for Danish must ex- the annotations over these corpora are released hibit sensitivity to these variances. openly. Further, the application is also made open- In addition, Danish has some word bound- source, with each component having similar or ary considerations. Compound nouns are com- better performance when compared with the state- mon (e.g. kvindehandboldlandsholdet˚ for “the of-the-art. women’s national handball team”), as are hyphen- ated constructions (fugle-fotografering for “bird 4 Information Extraction Pipeline photography”) which are often treated as single to- This section details each step in the DKIE kens. pipeline. A screenshot of the tool is shown in Fig- Finally, abbreviations are common in Danish, ure 1. and its acronyms can be difficult to disambiguate without the right context and language resource 4.1 Tokeniser (e.g. OB for Odense Boldklub, a football club). We adopt the PAROLE tokenisation scheme (Ke- 3 Background son and Norling-Christensen, 1998). This makes different decisions from Penn Treebank in some The state of the art in Danish information extrac- cases, concatenating particular expressions as sin- tion is not very interoperable or open compared to gle tokens. For example, the two word phrase i alt that for e.g. English. Previous work, while high- – meaning in total – is converted to the single to- performance, is not available freely (Bick, 2004), ken i alt. A set list of these group formations is or domain-restricted.1 This makes results diffi- given in the Danish PAROLE guidelines. cult to reproduce (Fokkens et al., 2013), and leads Another key difference is in the treatment of to sub-optimal interoperability (Lee et al., 2010). quoted phrases and hyphenation. Phrases con- Even recent books focusing on the topic are heav- nected in this way are often treated as single to- ily licensed and difficult for the average academic kens. For example, the phrase “Se og hør”- to access. Further, prior tools are often in the form læserne (readers of “See and Hear”, a magazine) of discrete components, hard to extend or to inte- is treated as a single token under this scheme. grate with other systems. Some good corpus resources are available, most 4.2 Part-of-Speech tagger recently the Copenhagen Dependency Treebank We use a machine-learning based tag- 1E.g. CST’s non-commercial-only anonymisation tool, at ger (Toutanova et al., 2003) for Danish part- http://cst.dk/online/navnegenkender/ of-speech labelling. The original PAROLE id expression interpretation Tagger Token accuracy % Sentence acc. % -- ---------- -------------- DKIE 95.3 49.1 3 igaa ADD(DCT,day,-1) TnT 96.2 39.1 13 Num._jul ADD(DATE_MONTH_DAY(DCT, 12, 24), day, TOKEN(0)) Table 1: Part-of-speech labelling accuracy in DKIE Figure 2: Example normalisation rules in TIMEN. “DCT” refers to the document creation time. scheme introduces a set of around 120 tags, many Secondly, entities outside of Denmark sometimes of which are used only rarely. The scheme com- have different names specific to the Danish lan- prises tags built up of up to nine features. These guage (e.g. Lissabon for Lisboa / Lisbon). features are used to describe information such As well as a standard strict-matching gazetteer, as case, degree, gender, number, possessivity, we include a “fuzzy” gazetteer specific to Dan- reflexivity, mood, tense and so on (Keson and ish that tolerates vowel orthography variation and Norling-Christensen, 1998). the other changes introduced in the 1948 spelling The PAROLE data includes morphological en- reform. For locations, we extracted data for coding in tags. We separate this data out in names of Danish towns from DBpedia and a lo- our corpus, adding morphological features distinct cal gazetteer, and from Wikipedia the Danish- from part-of-speech data. This data may then be language versions of the world’s 1 000 most popu- used by later work to train a morphological anal- lous cities. For organisations, we used Wikipedia yser, or by other tools that rely on morphological cross-language links to map the international or- information. ganisations deemed notable in Wikipedia to their We combine PAROLE annotations with the re- Danish translation and acroynm (e.g. the United duced tagset employed by the Danish Dependency Nations is referred to as FN). The major Danish Treebank (DDT) (Kromann, 2003). This has 25 political parties were also added to this gazetteer. tags. We adapted the tagger to Danish by in- For person names, we build lists of both notable cluding internal automatic mapping of æ, ø and people,2 and also populated GATE’s first and last a˚ to two-letter diphthongs when both training and name lists with common choices in Denmark. labelling, by adding extra sets of features for handling words and adjusting our unknown word 4.4 Temporal Expression Annotation threshold to compensate for the small corpus (as We include temporal annotation for Danish in this in Derczynski et al. (2013)), and by specifying the pipeline, making DKIE the first temporal anno- closed-class tags for this set and language. We tation tool for Danish. We follow the TimeML also prefer a CRF-based classifier in order to get temporal annotation standard (Pustejovsky et al., better whole-sequence accuracy, providing greater 2004), completing just the TIMEX3 part. opportunities for later-stage tools such as depen- Danish is interesting in that it permits flexible dency parsers to accurately process more of the temporal anchors outside of reference time (Re- corpus. ichenbach, 1947) and the default structure of a cal- Results are given in Table 1, comparing token- endar.