ESTNLTK - NLP Toolkit for Estonian

Siim Orasmaa, Timo Petmanson, Alexander Tkachenko, Sven Laur, Heiki-Jaan Kaalep Institute of Computer Science, University of Tartu Liivi 2, 50409 Tartu, Estonia [email protected],[email protected],[email protected],[email protected],[email protected]

Abstract Although there are many tools for natural language processing tasks in Estonian, these tools are very loosely interoperable, and it is not easy to build practical applications on top of them. In this paper, we introduce a new Python library for natural language processing in Estonian, which provides a unified programming interface for various NLP components. The ESTNLTK toolkit provides utilities for basic NLP tasks including tokenization, morphological analysis, lemmatisation and named entity recognition as well as offers more advanced features such as a clause segmentation, temporal expression extraction and normalization, verb chain detection, Estonian Wordnet integration and rule-based information extraction. Accompanied by a detailed API documentation and comprehensive tutorials, ESTNLTK is suitable for a wide range of audience. We believe ESTNLTK is mature enough to be used for developing NLP-backed systems both in industry and research. ESTNLTK is freely available under the GNU GPL version 2+ license, which is standard for academic .

Keywords: natural language processing, Python, Estonian language

1. Introduction dows and Mac OS X and supports Python versions 2.7 and 3.4. Estonian scientific community has recently enjoyed an ac- tive period, when a number of major NLP components have been developed under free open source licenses. Unfortu- 2. Related Work nately, these tools are very loosely interoperable. They use There is a great variety of available tools for natural lan- different programming languages, data formats and have guage processing. Typically, they come in the form of specific hardware and software requirements. Such sit- reusable software libraries, which can be embedded into uation complicates their usage in both software develop- user applications. Toolkits like Stanford CoreNLP (Man- ment and educational setting. In this paper, we introduce ning et al., 2014) and NTLK (Bird and Klein, 2009) pro- ESTNLTK, a Python library for natural language process- vide all-in-one solution for the most common NLP tasks, ing in Estonian, which addresses this issue. such as tokenization, lemmatisation, part-of-speech tag- The ESTNLTK toolkit glues together existing software ging, named entity extraction, chunking, parsing and sen- components and makes them easily accessible via a uni- timent analysis. Others, like gensim (Reh˚uˇ rekˇ and Sojka, fied programming interface. It provides utilities for basic 2010), a topic modelling framework in Python, and Apache NLP tasks including tokenization, morphological analysis, Lucene (Cutting et al., 2004), a Java library for document lemmatisation and named entity recognition as well as of- indexing and search, are designed for specific tasks. In con- fers more advanced features such as a clause segmentation, trast, projects GATE (Cunningham et al., 2011) and Apache temporal expression extraction, verb chain detection, Esto- UIMA (Apache, 2010), represent a comprehensive family nian Wordnet integration and grammar-based information of tools for text analytics. In addition to software mod- extraction. ules, they provide tools to manage complex text processing The ESTNLTK toolkit is written in Python programming workflows, annotate corpora and support large-scale dis- language and borrows design ideas from popular NLP tributed computing. toolkits TextBlob and NLTK. Users familiar with these The design of ESTNLTK is strongly influenced by tools can easily get started with ESTNLTK. Accompanied TextBlob (Loria, 2014), a Python library that is built on by a detailed API documentation and comprehensive tutori- top of the NLTK toolkit. It provides a simplified API for als, ESTNLTK is suitable for a wide range of audience. We common NLP tasks such as part-of-speech tagging, noun believe ESTNLTK is mature enough to be used for devel- phrase extraction, sentiment analysis, classification and oping NLP-backed systems both in industry and research. translation. The central class throughout the framework is Additionally, ESTNLTK is a good environment for teach- Text, which encodes information about natural language ing NLP for students. text. Being passed through a text-processing pipeline, a Although ESTNLTK is explicitly targeted for the Estonian Text instance accumulates information provided by each language, the architecture is quite generic and the toolkit processing task. This approach differs from the typical can be used as a template for building analogous toolkits for pipeline architecture, where each processing layer modifies other languages. The only limiting factor is the availability specially formatted input text. of external NLP components that must be replaced. Our choice to implement ESTNLTK as a Python library The ESTNLTK toolkit is available under the GNU similar to TextBlob was motivated by the following rea- GPL version 2+ license from https://github.com/ sons. First, Python is widely adopted in NLP community, estnltk/estnltk. The library works on , Win- due to its powerful text processing capabilities and good

2460 support for NLP and . Second, Python Annotation layers. The outcome of most NLP opera- has a good support for external libraries written in other tions can be seen as different annotation layers in the orig- compiled languages such as C or C++. This greatly sim- inal text. ESTNLTK stores each layer as a list of non- plifies integration of existing tools. Furthermore, the most overlapping regions, defined by start and end positions. The performance-critical code sections can be moved to C/C++ lists are stored as dictionary elements identified by unique extension modules. Thirdly, Python is a popular program- layer names. ming language for teaching entry-level computer science There are two types of annotations. A simple annotation courses in many universities including Estonian. Hence, no has only a single start-end position pair and is used to de- extra skills are needed to use ESTNLTK. note sentences, words, named entities and other annotations Initially, we also considered Java as a programming lan- made up of a single continuous area. Multi-region annota- guage, since many tools provide libraries written in Java. tions can have several start-end position pairs. They are As a compiled, statically-typed, object-oriented language, used to denote clauses and verb chains, which in Estonian Java is well-suited for large software projects. However, it can allocate several non-adjacent regions in a sentence. also has a verbose inflexible syntax, what makes it highly Each annotation is a dictionary, which in addition to com- unproductive for prototyping and experimentation and it pulsory start and end attributes can store any kind of cannot be used for scripting in interactive environments. relevant information. For example, named entity layer an- As an alternative to building ESTNLTK from scratch, we notations have label attribute denoting whether the an- also considered customising an existing toolkit such as notation marks a location, organisation or a person. Words NLTK or Textblob for the Estonian language. This would layer annotations store a list of morphological analysis vari- provide a benefit of using a common interface for text anal- ants, where each one is a dictionary containing a lemma, ysis in both Estonian and English. However, we found it form, part-of-speech tag and other morphological attributes difficult to achieve, since neither NLTK nor Textblob are of Estonian words. designed to internally handle the morphological ambiguity Users can create custom annotation layers simply by defin- and attributes such as cases, forms and clitics found in Es- ing new Text dictionary elements. Of course, using re- tonian. Ignoring this information would mean incomplete served layer names used by ESTNLTK is not allowed. representation of Estonian morphology. On the other hand, Users can also extend existing layers and add new attributes modifying existing framework internals would require sig- as long as the attribute names are unique. nificant rewrites and create compatibility issues. Given the Dependency management. The ESTNLTK NLP tools benefits and tradeoffs, we decided to implement ESTNLTK require specific preconditions to be satisfied before the de- as a specialised package for Estonian. That said, we do not sired operations can be performed. These dependencies are rule our integrating ESTNLTK with other toolkits in the depicted in Figure 1. Whether a dependency is satisfied or future. not, can be answered by looking directly at the dictionary contents of the Text class. This is exactly, what the code 3. Design Principles does, when the user requests a certain resource. In case the resource is not available, the code will execute the opera- The ESTNLTK toolkit exposes its NLP utilities through a tion that can provide the resource. This operation in turn single wrapper class Text. To perform a text-processing checks if all of its dependencies are available and recur- operation, the user needs to access the corresponding prop- sively executes necessary operations to provide the missing Text erty of an initialised object. After that ESTNLTK resources. will carry out all necessary pre-processing behind the scenes. Hence, an analysis pipeline can be specified dy- 1. paragraph tokenization 5. clause detection 6. verb chain detection namically through an interactive scripting session without 2. sentence tokenization 5. named entity recognition thinking of implementation details. 3. word tokenization 5. temporal expressions For example, named entities can be accessed via the prop- 4. morphological analysis 5. grammars erty named_entities. To extract named entities, the ESTNLTK toolkit will complete five separate operations Figure 1: Text-processing utilities in ESTNLTK. Numbers in succession: (1) segment paragraphs; (2) segment sen- and lines denote the order and dependencies between the tences; (3) tokenise words; (4) perform morphological anal- operations. ysis; (5) identify named entities. For each operation in the pipeline, ESTNLTK comes with a sane default implemen- In case of a newly initiated Text instance, requesting any tation. However, a user can provide an alternative imple- non-trivial resource executes large portions of the pipeline. mentation through the constructor of the class Text. However, consequent calls to the same resource only re- Text class. The Text is a subclass of a standard Python quire few dictionary lookups later. dictionary with additional methods and properties designed Overriding standard operations. ESTNLTK comes for NLP operations. A new instance can be created simply with reasonable default behaviour for all built-in opera- by passing the plaintext string as an argument to the con- tions, but it may be desirable to override this functionality. structor, which initiates a dictionary with a single attribute For example, the user might require a custom sentence to- text. Using a dictionary as a base data format has several kenizer or an optimised named entity recogniser for some of advantages: it is simple to inspect and debug, extendible, task. How a specific component can be replaced, is dis- can store meta-data and can be serialized to a JSON format. cussed more thoroughly in Section 4. In simple cases, it is

2461 sufficient to provide replacement components as keyword Note that word tokenization depends on sentence tokeniza- arguments to the Text constructor. However, miscella- tion, which in turn depends on paragraph tokenization. Al- neous use cases may benefit from subclassing the Text though any of the tokenizers can be actually executed on class and develop the custom behaviour directly into it. raw plain text, they might not be consistent. Thus, the Text class enforces consistency by performing sentence Text segmentation. It is often better to process texts in tokenization on each individual paragraph and word tok- smaller chunks if smaller pieces represent the problem we enization on each individual sentence. are solving better or the text is just too big to process as a To customise tokenisation, one needs to implement an whole. This can be achieved with the split_by method. interface defined by NLTK’s StringTokenizer class The method takes in a name of an annotation layer and cre- and pass the tokeniser as paragraph_tokenizer, ates a list of Text instances based on the regions defined in sentence_tokenizer or word_tokenizer argu- the annotations. The number of the resulting texts is equal ment to the constructor of the Text class. to the total number regions in the original annotation layer. As multi-layer annotations can define more than a single Morphological analysis. Morphological analysis is a region, this number can be greater than the number of an- core part of any text-processing pipeline, as it serves needs notations. of many higher level tasks. ESTNLTK provides a wrap- All annotation layers are preserved in this process, but the per API built on top of the VABAMORF, a C++ library for annotations themselves are divided between the resulting morphological analysis for Estonian (Kaalep and Vaino, Text instances. As annotations may define one or more 2001). Full text analysis can be performed using the func- regions, they can end up in more than a single piece. In such tion tag_analysis of the class Text. It identifies word cases, only the regions belonging to the piece will appear in lemmas, suffixes, endings, parts of speech, forms (e.g. the the annotation, although other attributes are preserved. case of a noun, the tense and the voice of a verb). These bits of information can be accessed via the corresponding prop- Comparison with alternatives. Note that layers can be erties of the class Text. When a word has multiple mor- embedded directly to the textual output format. This has phological interpretations, VABAMORF will try to perform been the traditional way for Estonian NLP tools. Although disambiguation. If disambiguation fails, multiple analysis it makes it easy to write shell programs for analysis work- variants will be reported. flows, it is brittle and restricting. A small change in an output format may cause subtle errors and parsing routines Morphological synthesis and spell checking. To syn- are needed to do unexpected analysis steps. The second thesise a particular form of a word, ESTNLTK provides a alternative is to store the layer information in a separate function synthesize. Given a word and some criteria, it index object that is shared by many documents. This can generates all possible inflections that satisfy these criteria. significantly speed up certain search operations, as linear The spell-corrector allows to identify misspelled words and scan over all documents can be replaced with simple index provides suggestions for correction through the properties lookup. However, the right structure of the index object spelling and spelling_suggestions of the class depends on a particular task and is wasteful for online pro- Text. As the synthesis and spell check functions are built cessing of documents. Hence, ESTNLTK uses this alter- on top of VABAMORF, they inherit strengths and shortcom- native only for handling large document collections where ings of VABAMORF counterpart functions. the creation index objects justified and the potential list of Named entity recognition. Named entity recognition search terms are known upfront. (NER) is often used as an important subtask in informa- tion retrieval, opinion mining and semantic indexing. The 4. Standard NLP Tasks class Text provides default algorithms for recognising persons, organisations and locations. These algorithms In this section, we will discuss the rationale and the de- can be invoked by calling tag_named_entities. The sign tradeoffs of each standard NLP task. The ESTNLTK call will trigger the whole text-processing pipeline, includ- standard tasks are paragraph, sentence and word tokeniza- ing tokenization, morphological analysis if these analyses tion, morphological analysis, clause detection, named en- have not been performed before. The first invocation of tity recognition and temporal time expression detection. tag_named_entities will additionally load statistical These tools depend on each other and form a dependency models and store them in the global scope for future use. graph, which can be seen in Figure 1. NER algorithms in ESTNLTK are reimplementations of the Tokenization. Text tokenization tasks are the most basic methods described by Tkachenko et al (Tkachenko et al., steps of any NLP pipeline. Paragraph tokenization is use- 2013). The main difference lies in the implementation de- ful when texts are longer and contain more than a single tails. The original NER tool uses conditional random fields paragraph, such as news articles. By default, ESTNLTK training algorithm (Lafferty et al., 2001) implemented assumes the paragraphs are separated by a single empty in a Java package MALLET (McCallum, 2002), whereas line (two newline characters). Sentence tokenization uses the algorithms in ESTNLTK use the C++ CRFSUITE li- a pre-trained NLTK punkt tokenizer for Estonian, which brary (Okazaki, 2007). Both algorithms achieve reasonable is trained on a corpus of news articles. For word tokeniza- accuracy for Estonian text. The C++ version just provides tion, we use a modified NLTK WordPunctTokenizer a better integration with Python. that makes a tradeoff between traditional word tokenization Extracted named entities, their categories and lo- practices and compatibility with other NLP tools. cations in the text can be accessed using proper-

2462 ties named_entites, named_entity_labels and [who we saw there] left in a hurry.]’).1 Calling the property named_entity_spans. The same information is also clause_texts performs clause tagging along with all accessible through the layers named_entities and the dependent computations such as paragraph, sentence, words. The layer named_entities stores information word tokenization and morphological analysis. Clauses are on entity categories and their positions in text. The layer stored as multi-region annotation in the clauses layer. word carries individual token labels in BIO format (Tjong Because commas are important clause delimiters in Esto- Kim Sang and De Meulder, 2003). nian, the quality of the clause segmentation may suffer due The default models have been pre-trained to recognise a to missing commas. We have improved the original clause fixed set of entity-types in generic news articles. It is also boundary detection algorithm, adding a special mode in possible to customise NER algorithms by providing a new which the program is more robust to missing commas. training corpus consisting of annotated sentences. For this In this mode, we apply the regular segmentation rules aug- purpose, the class estnltk.ner.NerTrainer imple- mented with the following general heuristic: if a conjunc- ments necessary feature extraction logic and provides a tion word (such as et ’that, for’, sest ’because’, kuid ’al- training interface. To train a new CRFSUITE model, a user though’, millal ’when’, kus ’where’, kuni ’as long as’) is must provide a training corpus and a custom configuration in the context where it is preceded and followed by a ver- module, which lists feature extractors and feature templates bal centre of a clause, but it is not preceded by a comma used for detecting named entities. The resulting model can (although it should be, according to Estonian orthographic be used with a class estnltk.ner.NerTagger to ex- conventions), we mark a clause boundary at the location tract entities from text. of missing comma. The heuristic has a number of excep- Temporal expression tagging. In many information ex- tions, which account for ambiguous usages of the specific traction applications, recognition and normalisation of conjunction words. time-referring expressions (timexes) can significantly im- For example, kuni also means ’to’, and so we have to ex- prove accuracy. We can essentially discard sentences that clude its usage in range-denoting phrases (such as in 5 kuni describe events from irrelevant time-periods. Sometimes, 7 kilomeetrit ’5 to 7 kilometres’) from being marked as a the exact time of an event is the desired information. clause boundary. We have also written exceptions to ex- The ESTNLTK package integrates AJAVT, a rule-based clude frequently used verbal idiomatic expressions, such as temporal expression tagger for Estonian (Orasmaa, 2012). vaat et ’almost, nearly’, lit. ’look that (until)’, from being The tagger recognises timexes and normalises semantics detected as clause boundaries. of these expressions into a standard format based on We believe that this new mode can be useful for improv- TimeML’s TIMEX3 (Pustejovsky et al., 2003). Results ing clause segmentation quality on non-standard language of the tagging can be accessed via Text object’s property (such as the Internet language), where comma usage often timexes. does not follow the orthographic rules. By default, expressions with relative semantics (such as eile NLP task Affected layers ’yesterday’ or reedel ’on Friday’) are resolved with respect tokenize_paragraphs() paragraphs to the execution time of the program. This default can also tokenize_sentences() sentences be overridden by providing creation_date argument tokenize_words() words on initialisation of the Text object, after which relative tag_analysis() words expressions are resolved with respect to the argument date. tag_clauses() words, clauses Although the current version of the tagger is tuned for pro- tag_named_entities() words, named_entities cessing news texts, our evaluation has shown that the cur- tag_timexes() timexes rent configuration of rules obtains high accuracy levels also tag_verb_chains() verb_chains on other types of formal written language texts, such as on law texts, and on parliament transcripts. Table 1: Annotation layers affected by ESTNLTK’s stan- Details on the implementation of the tagger and its evalua- dard NLP tasks. tion are provided in (Orasmaa, 2012). Clause boundary detection. Clause boundary detection can be used to split long and complex sentences into smaller 5. Additional Analysis Tools segments (clauses). As such, it can be useful in ma- chine learning experiments offering a linguistically moti- Besides standard NLP components ESTNLTK contains ad- vated “window” for feature extraction. In linguistic analy- ditional tools which are less commonly used or are less ma- sis, clause boundary detection offers a context form which ture compared to the standard components. phrase boundaries or word collocations can be detected. WordNet. The ESTNLTK package comes together with The clause tagging in ESTNLTK is a re-implementation of the Estonian Wordnet developed under the EuroWord- the clause detector module introduced by Kaalep and Muis- Net project (Ellman, 2003). Provided interface is mostly chnek (Kaalep and Muischnek, 2012). It recognises con- NLTK-compatible and enables to query for synsets, rela- secutively ordered clauses (e.g. [Ta istus nurgalauda ja] tionships and compute similarities between the synsets. To [tellis kohvi.] ’[She sat on the corner table and] [ordered a coffee.]’) and clauses embedded within other clauses (e.g. 1Clause boundaries in the examples above are indicated by [Mees[, keda seal nägime,] lahkus kiirustades.] ’[The man square brackets surrounding the clauses.

2463 >> text = Text(’Londoni lend, mis piditäna hommikul kell 4:30 Tallinna saabuma, hilineb mootori starteri rikketõttu ning peaks Tallinnajõudmaööl vastu homset kell 02:20.’) >>> text.named_entities [u’London’,u’Tallinn’,u’Tallinn’] >>> text.clause_texts [u’Londoni lend hilineb mootori starteri rikketõttu ning’, u’, mis piditäna hommikul kell 4:30 Tallinna saabuma,’, u’peaks Tallinnajõudmaööl vastu homset kell 02:20.’] >>> text.timex_values [u’2016 −02−28T04:30’,u’2016 −02−29T02:20’] >>> text.verb_chain_texts [u’hilineb’,u’pidi saabuma’,u’peaksjõudma’]

Figure 2: Example of standard NLP tasks in Estnltk applied to a sentence “Londoni lend, mis pidi täna hommikul kell 4:30 Tallinna saabuma, hilineb mootori starteri rikke tõttu ning peaks Tallinna jõudma ööl vastu homset kell 02:20.” (eng. “The flight from London, which was scheduled to land to Tallinn today morning at 4:30, is late due to an engine starter malfunction and is about to arrive to Tallinn tomorrow night at 02:00.”). The example has been run on date 2016-02-28 (so timex values have been calculated with respect to that date). access EuroWordNet database files, we use a Python mod- models. Recent advances in machine learning ule Eurown (Kahusk, 2010). provide high dimensional word embeddings which capture distributed semantics of words and phrases. These have Verb chain detection. Estonian language is rich in verb been successfully applied in machine translation and syn- chain constructions where the meaning of the content verb onym detection. Thus, we have included experimental sup- is significantly altered by other parts of the chain. Hence, port for word2vec models (Mikolov et al., 2013), trained proper detection and semantic normalisation of such con- on the Estonian Reference Corpus3. The corpus consists of structions is essential in many practical applications. It al- 16M sentences, 55M words and 3M types. We provide sep- lows us to detect negated clauses in sentiment analysis, to arate models trained on the original or on the lemmatised distinguish actual events from events marked with uncer- version of the corpus. For training, we used word2vec soft- tainty in event factuality analysis, and to handle verb chains ware4. Resulting models can be used with the word2vec as a single translation unit in machine translation experi- command line tools or a Python library gensim5. ments. The corresponding module in ESTNLTK addresses single- Efficient document storage. Many modern text analysis word main verbs and the following verb chain construc- methods require a large collection of documents as an input tions: regular grammatical constructions (negation and to be bootstrapped. This process is usually very resource compound tenses), catenative verb constructions (such as intensive. In most cases, the computational complexity can modal verb + infinite verb, e.g. Ta võib meid homme külas- be significantly reduced by providing an efficient search in- tada ’He might visit us tomorrow’, and finite verb + infinite terface. Most storage solutions provide only efficient key- verb constructions in general, e.g. Ta unustas meid külas- word search for text whereas operations in ESTNLTK re- tada ’He forgot to visit us’), and verb+nominal construc- quire efficient search over layers. Therefore, we have in- tions which subcategorise for infinite verbs (e.g. Ta otsis cluded a Database class that uses NoSQL database Elas- 6 võimalust meid külastada ’He sought for opportunity to tic Search to store and retrieve Text objects. visit us.’). Detected verb chains are available via property Visualisation tools. Many NLP applications produce text verb_chains of the Text object. Grammatical features annotations. Visualisation of these annotations is often the of the finite verb (polarity, tense, mood and voice) are pro- best way convey the analysis results to the end user. The vided as attributes of a verb chain object. PrettyPrinter class can be used to highlight text frag- The module has been implemented in a rule-based man- ments according to annotations. The class generates valid ner. It firstly uses morphological information (informa- HTML markup together with CSS files that can be used for tion about finite verbs) and clause boundary information to developing Web front ends. detect grammatical main verb constructions, and then re- There are eight aesthetics that can be used for visualisa- lies on subcategorisation information listed in lexicons to tion: text colour, text background, font, font weight, nor- further extend the grammatical main verbs into catenative, mal/italics, normal/underline, text size, character tracking. and/or verb+nominal+infinite verb constructions. Subcate- The most simple use case involves mapping a layer to an gorisation lexicons were derived by semi-automatic analy- 3 sis of the Balanced Corpus of Estonian2. http://www.cl.ut.ee/korpused/segakorpus/ 4https://code.google.com/p/word2vec/ 5https://radimrehurek.com/gensim/ 2http://www.cl.ut.ee/korpused/ 6https://www.elastic.co/products/ grammatikakorpus/index.php elasticsearch

2464 aesthetic and then rendering it as HTML and CSS. As a re- The source code is hosted online on github.com. sult, all regions in the layer will obtain the same markup, ESTNLTK can be installed directly from GitHub or, alter- for instance have a green background. It is also possible natively from the Python Package Index (PyPi) using com- to decorate regions according to annotations, e.g., colour mand pip install estnltk. verbs green and nouns red. For that one must specify map- ESTNLTK source code is licensed under a GNU GPL ver- ping between annotation values and aesthetic values. sion 2+. It permits to use ESTNLTK to build proprietary Grammar extractor. In many information retrieval and software requiring the entire source code of the derived text classification tasks one must first recognise text frag- work to be made available to end users. ments with a certain structure, such as measurements in 8. Conclusions medical text or trigger phrases in sentiment mining. Gram- mar extractor module allows to specify such phrases in term We presented ESTNLTK, a Python library which aims to of layers using finite grammar. organise Estonian NLP resources into a single framework. Although ESTNLTK is right in its infancy, it has already 6. Performance seen a number of uses, including online media monitor- For practical uses, it is important to have idea of the per- ing, digital book indexing and client complaint categorisa- formance characteristics of individual components of the tion. Recently, ESTNLTK for the first time has been used framework. Thus, we benchmarked ESTNLTK using a in teaching a NLP course in Estonian at the University of sample of 500 news articles picked randomly from the Es- Tartu. We believe that in the future ESTNLTK will find tonian Reference Corpus7. The sample contains 167,180 even more applications. tokens. We run benchmarks using Python3.4 on a Linux 9. Acknowledgements Mint desktop with the Intel Core i5 processor and 4GB of RAM. Table 2 illustrates the results. Note that the abso- ESTNLTK has been funded by Eesti Keeletehnoloogia Ri- lute values should be taken with a grain of salt, since they iklik Programm under the project EKT57, and supported largely depend on a particular benchmark environment. by Estonian Ministry of Education and Research (grant IUT 20-56 "Computational models for Estonian"). We NLP task tokens/second are grateful to our code contributors: Heiki-Jaan Kaalep, Tokenization 97,095 Neeme Kahusk, Karl-Oskar Masing, Andres Matsin, Siim Morphological analysis 3,913 Orasmaa, Timo Petmanson, Annett Saarik, Alexander NER 2,550 Tkachenko, Tarmo Vaino and Karl Valliste. Temporal expression tagging 1,905 Clause boundary detection 5,271 10. References Verb chain detection 11,658 Apache, U. (2010). Unstructured information management applications. http://uima.apache.org. Online; Table 2: Performance of ESTNLTK core components. The accessed 20-October-2015. reported performance relates to the individual component Bird, Steven, E. L. and Klein, E. (2009). Natural Language alone and does not account for execution time of its depen- Processing with Python. O’Reilly Media Inc. dencies. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood, M. A., Sag- 7. Source Code gion, H., Petrak, J., Li, Y., and Peters, W. (2011). Text ESTNLTK is implemented in Python and supports lan- Processing with GATE (Version 6). guage versions 2.7 and 3.4. Under the hood, ESTNLTK Cutting, D., Busch, M., Cohen, D., Gospodnetic, O., uses a number of external components: Hatcher, E., Hostetter, C., Ingersoll, G., McCandless, M., Messer, B., Naber, D., et al. (2004). Apache vabamorf, a C++ library for morphological analysis, dis- lucene. https://lucene.apache.org/. Online; ambiguation and synthesis, licensed under LGPL. accessed 20-October-2015. https://github.com/Filosoft/vabamorf Reh˚uˇ rek,ˇ R. and Sojka, P. (2010). Software framework for python-crfsuite, a python binding to CRFSUITE C++ li- topic modelling with large corpora. brary. python-crfsuite is licensed under MIT license, Ellman, J. (2003). Eurowordnet: A multilingual database CRFSUITE library is licensed under BSD license. with lexical semantic networks: Edited by piek vossen. kluwer academic publishers. 1998. isbn 0792352955. osalausestaja, a Java-based clause segmenter for Esto- Kluwer Academic Publishers. nian, licensed under GPL version 2. Kaalep, H. J. and Muischnek, K. (2012). Robust clause https://github.com/soras/osalausestaja boundary identification for corpus annotation. In LREC, Ajavt, a Java-based temporal expression tagger for Esto- pages 1632–1636. nian, licensed under GPL version 2. Kaalep, H.-J. and Vaino, T. (2001). Complete morpho- https://github.com/soras/Ajavt logical analysis in the linguist’s toolbox. Congressus Nonus Internationalis Fenno-Ugristarum Pars V, pages 7http://www.cl.ut.ee/korpused/segakorpus/ 9–16. The corresponding C++ code is available from index.php https://github.com/Filosoft/vabamorf.

2465 Kahusk, N. (2010). Eurown: an eurowordnet module for www.chokkan.org/software/crfsuite/. On- python. In Principles, Construction and Application of line; accessed 20-October-2015. Multilingual Wordnets. Proceeding of the 5th Global Orasmaa, S. (2012). Automaatne ajaväljendite tuvas- Wordnet Conference: The 5th International Conference tamine eestikeelsetes tekstides (Automatic Recognition of the Global WordNet Association (GWC–2010), pages and Normalization of Temporal Expressions in Estonian 360–364. Language Texts). Eesti Rakenduslingvistika Ühingu aas- Lafferty, J., McCallum, A., and Pereira, F. C. (2001). Con- taraamat, (8):153–169. ditional random fields: Probabilistic models for segment- Pustejovsky, J., Castano, J. M., Ingria, R., Sauri, R., ing and labeling sequence data. Gaizauskas, R. J., Setzer, A., Katz, G., and Radev, D. R. Loria, S. (2014). Textblob: Simplified text processing. (2003). Timeml: Robust specification of event and tem- TextBlob. Np. poral expressions in text. New directions in question an- Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., swering, 3:28–34. Bethard, S. J., and McClosky, D. (2014). The Stan- Tjong Kim Sang, E. F. and De Meulder, F. (2003). In- ford CoreNLP natural language processing toolkit. In troduction to the conll-2003 shared task: Language- Proceedings of 52nd Annual Meeting of the Association independent named entity recognition. In Proceedings for Computational Linguistics: System Demonstrations, of the seventh conference on Natural language learning pages 55–60. at HLT-NAACL 2003-Volume 4, pages 142–147. Associ- McCallum, A. K. (2002). Mallet: A machine learning ation for Computational Linguistics. for language toolkit. http://mallet.cs.umass. Tkachenko, A., Petmanson, T., and Laur, S. (2013). edu. Named entity recognition in estonian. In Proceedings of Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). the Workshop on Balto-Slavic NLP, page 78. Association Efficient estimation of word representations in vector for Computational Linguistics. space. arXiv preprint arXiv:1301.3781. Okazaki, N. (2007). Crfsuite: a fast implementa- tion of conditional random fields (crfs). http://

2466