ESTNLTK - NLP Toolkit for Estonian

ESTNLTK - NLP Toolkit for Estonian Siim Orasmaa, Timo Petmanson, Alexander Tkachenko, Sven Laur, Heiki-Jaan Kaalep Institute of Computer Science, University of Tartu Liivi 2, 50409 Tartu, Estonia [email protected],[email protected],[email protected],[email protected],[email protected] Abstract Although there are many tools for natural language processing tasks in Estonian, these tools are very loosely interoperable, and it is not easy to build practical applications on top of them. In this paper, we introduce a new Python library for natural language processing in Estonian, which provides a unified programming interface for various NLP components. The ESTNLTK toolkit provides utilities for basic NLP tasks including tokenization, morphological analysis, lemmatisation and named entity recognition as well as offers more advanced features such as a clause segmentation, temporal expression extraction and normalization, verb chain detection, Estonian Wordnet integration and rule-based information extraction. Accompanied by a detailed API documentation and comprehensive tutorials, ESTNLTK is suitable for a wide range of audience. We believe ESTNLTK is mature enough to be used for developing NLP-backed systems both in industry and research. ESTNLTK is freely available under the GNU GPL version 2+ license, which is standard for academic software. Keywords: natural language processing, Python, Estonian language 1. Introduction dows and Mac OS X and supports Python versions 2.7 and 3.4. Estonian scientific community has recently enjoyed an ac- tive period, when a number of major NLP components have been developed under free open source licenses. Unfortu- 2. Related Work nately, these tools are very loosely interoperable. They use There is a great variety of available tools for natural lan- different programming languages, data formats and have guage processing. Typically, they come in the form of specific hardware and software requirements. Such sit- reusable software libraries, which can be embedded into uation complicates their usage in both software develop- user applications. Toolkits like Stanford CoreNLP (Man- ment and educational setting. In this paper, we introduce ning et al., 2014) and NTLK (Bird and Klein, 2009) pro- ESTNLTK, a Python library for natural language process- vide all-in-one solution for the most common NLP tasks, ing in Estonian, which addresses this issue. such as tokenization, lemmatisation, part-of-speech tag- The ESTNLTK toolkit glues together existing software ging, named entity extraction, chunking, parsing and sen- components and makes them easily accessible via a uni- timent analysis. Others, like gensim (Reh˚uˇ rekˇ and Sojka, fied programming interface. It provides utilities for basic 2010), a topic modelling framework in Python, and Apache NLP tasks including tokenization, morphological analysis, Lucene (Cutting et al., 2004), a Java library for document lemmatisation and named entity recognition as well as of- indexing and search, are designed for specific tasks. In con- fers more advanced features such as a clause segmentation, trast, projects GATE (Cunningham et al., 2011) and Apache temporal expression extraction, verb chain detection, Esto- UIMA (Apache, 2010), represent a comprehensive family nian Wordnet integration and grammar-based information of tools for text analytics. In addition to software mod- extraction. ules, they provide tools to manage complex text processing The ESTNLTK toolkit is written in Python programming workflows, annotate corpora and support large-scale dis- language and borrows design ideas from popular NLP tributed computing. toolkits TextBlob and NLTK. Users familiar with these The design of ESTNLTK is strongly influenced by tools can easily get started with ESTNLTK. Accompanied TextBlob (Loria, 2014), a Python library that is built on by a detailed API documentation and comprehensive tutori- top of the NLTK toolkit. It provides a simplified API for als, ESTNLTK is suitable for a wide range of audience. We common NLP tasks such as part-of-speech tagging, noun believe ESTNLTK is mature enough to be used for devel- phrase extraction, sentiment analysis, classification and oping NLP-backed systems both in industry and research. translation. The central class throughout the framework is Additionally, ESTNLTK is a good environment for teach- Text, which encodes information about natural language ing NLP for students. text. Being passed through a text-processing pipeline, a Although ESTNLTK is explicitly targeted for the Estonian Text instance accumulates information provided by each language, the architecture is quite generic and the toolkit processing task. This approach differs from the typical can be used as a template for building analogous toolkits for pipeline architecture, where each processing layer modifies other languages. The only limiting factor is the availability specially formatted input text. of external NLP components that must be replaced. Our choice to implement ESTNLTK as a Python library The ESTNLTK toolkit is available under the GNU similar to TextBlob was motivated by the following rea- GPL version 2+ license from https://github.com/ sons. First, Python is widely adopted in NLP community, estnltk/estnltk. The library works on Linux, Win- due to its powerful text processing capabilities and good 2460 support for NLP and machine learning. Second, Python Annotation layers. The outcome of most NLP opera- has a good support for external libraries written in other tions can be seen as different annotation layers in the orig- compiled languages such as C or C++. This greatly sim- inal text. ESTNLTK stores each layer as a list of non- plifies integration of existing tools. Furthermore, the most overlapping regions, defined by start and end positions. The performance-critical code sections can be moved to C/C++ lists are stored as dictionary elements identified by unique extension modules. Thirdly, Python is a popular program- layer names. ming language for teaching entry-level computer science There are two types of annotations. A simple annotation courses in many universities including Estonian. Hence, no has only a single start-end position pair and is used to de- extra skills are needed to use ESTNLTK. note sentences, words, named entities and other annotations Initially, we also considered Java as a programming lan- made up of a single continuous area. Multi-region annota- guage, since many tools provide libraries written in Java. tions can have several start-end position pairs. They are As a compiled, statically-typed, object-oriented language, used to denote clauses and verb chains, which in Estonian Java is well-suited for large software projects. However, it can allocate several non-adjacent regions in a sentence. also has a verbose inflexible syntax, what makes it highly Each annotation is a dictionary, which in addition to com- unproductive for prototyping and experimentation and it pulsory start and end attributes can store any kind of cannot be used for scripting in interactive environments. relevant information. For example, named entity layer an- As an alternative to building ESTNLTK from scratch, we notations have label attribute denoting whether the an- also considered customising an existing toolkit such as notation marks a location, organisation or a person. Words NLTK or Textblob for the Estonian language. This would layer annotations store a list of morphological analysis vari- provide a benefit of using a common interface for text anal- ants, where each one is a dictionary containing a lemma, ysis in both Estonian and English. However, we found it form, part-of-speech tag and other morphological attributes difficult to achieve, since neither NLTK nor Textblob are of Estonian words. designed to internally handle the morphological ambiguity Users can create custom annotation layers simply by defin- and attributes such as cases, forms and clitics found in Es- ing new Text dictionary elements. Of course, using re- tonian. Ignoring this information would mean incomplete served layer names used by ESTNLTK is not allowed. representation of Estonian morphology. On the other hand, Users can also extend existing layers and add new attributes modifying existing framework internals would require sig- as long as the attribute names are unique. nificant rewrites and create compatibility issues. Given the Dependency management. The ESTNLTK NLP tools benefits and tradeoffs, we decided to implement ESTNLTK require specific preconditions to be satisfied before the de- as a specialised package for Estonian. That said, we do not sired operations can be performed. These dependencies are rule our integrating ESTNLTK with other toolkits in the depicted in Figure 1. Whether a dependency is satisfied or future. not, can be answered by looking directly at the dictionary contents of the Text class. This is exactly, what the code 3. Design Principles does, when the user requests a certain resource. In case the resource is not available, the code will execute the opera- The ESTNLTK toolkit exposes its NLP utilities through a tion that can provide the resource. This operation in turn single wrapper class Text. To perform a text-processing checks if all of its dependencies are available and recur- operation, the user needs to access the corresponding prop- sively executes necessary operations to provide the missing Text erty of an initialised object. After that ESTNLTK resources. will carry out all necessary pre-processing behind the scenes. Hence, an analysis pipeline can

Load more