Stanza: a Python Natural Language Processing Toolkit

Sta n z a : A Python Natural Language Processing Toolkit for Many Human Languages Peng Qi* Yuhao Zhang* Yuhui Zhang Jason Bolton Christopher D. Manning Stanford University Stanford, CA 94305 {pengqi, yuhaozhang, yuhuiz}@stanford.edu {jebolton, manning}@stanford.edu Abstract Tokenization & Sentence Split Hello! Bonjour! 你好! Hallo! EN FR ZH DE TOKENIZE !안녕하세요! ¡Hola! Здравствуйте !ﻣرﺣﺑﺎ We introduce Sta n z a , an open-source Python Multi-word Token Expansion AR KO ES RU MWT こんにちは！ Hallo! xin chào! नमकार! natural language processing toolkit support- JA NL VI HI ing 66 human languages. Compared to ex- Lemmatization Multilingual: 66 Languages LEMMA isting widely used toolkits, Sta n z a features RAW TEXT POS & Morphological Tagging a language-agnostic fully neural pipeline for POS WORDS text analysis, including tokenization, multi- Dependency Parsing Native Python Objects TOKEN word token expansion, lemmatization, part-of- DEPPARSE LEMMA POS HEAD DEPREL ... Named Entity Recognition speech and morphological feature tagging, de- WORD NER pendency parsing, and named entity recogni- Fully Neural: Language-agnostic SENTENCE tion. We have trained Sta n z a on a total of PROCESSORS DOCUMENT 112 datasets, including the Universal Depen- dencies treebanks and other multilingual cor- Figure 1: Overview of Sta n z a ’s neural NLP pipeline. pora, and show that the same neural architec- Sta n z a takes multilingual text as input, and produces ture generalizes well and achieves competitive annotations accessible as native Python objects. Be- performance on all languages tested. Addition- sides this neural pipeline, Sta n z a also features a ally, Sta n z a includes a native Python interface Python client interface to the Java CoreNLP software. to the widely used Java Stanford CoreNLP software, which further extends its function- ing downstream applications and insights obtained ality to cover other tasks such as coreference from them. Third, some tools assume input text has resolution and relation extraction. Source code, documentation, and pretrained models been tokenized or annotated with other tools, lack- for 66 languages are available at https:// ing the ability to process raw text within a unified stanfordnlp.github.io/stanza/. framework. This has limited their wide applicabil- ity to text from diverse sources. 1 Introduction We introduce Sta n z a 2, a Python natural language The growing availability of open-source natural lan- processing toolkit supporting many human language processing (NLP) toolkits has made it easier guages. As shown in Table1, compared to existing for users to build tools with sophisticated linguistic widely-used NLP toolkits, Sta n z a has the following processing. While existing NLP toolkits such as advantages: CoreNLP (Manning et al., 2014), FLAIR (Akbik • From raw text to annotations. Sta n z a fea- et al., 2019), spaCy1, and UDPipe (Straka, 2018) tures a fully neural pipeline which takes raw have had wide usage, they also suffer from several text as input, and produces annotations includ- limitations. First, existing toolkits often support ing tokenization, multi-word token expansion, only a few major languages. This has significantly lemmatization, part-of-speech and morpholog- limited the community’s ability to process multilin- ical feature tagging, dependency parsing, and gual text. Second, widely used tools are sometimes named entity recognition. under-optimized for accuracy either due to a focus on efficiency (e.g., spaCy) or use of less power- • Multilinguality. Sta n z a ’s architectural de- ful models (e.g., CoreNLP), potentially mislead- sign is language-agnostic and data-driven, which allows us to release models support- ∗Equal contribution. Order decided by a tossed coin. 1https://spacy.io/ 2The toolkit was called StanfordNLP prior to v1.0.0. 101 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 101–108 July 5 - July 10, 2020. c 2020 Association for Computational Linguistics System # Human Programming Raw Text Fully Pretrained State-of-the-art Languages Language Processing Neural Models Performance CoreNLP 6 Java !! FLAIR 12 Python !!! spaCy 10 Python !! UDPipe 61 C++ !!! Sta n z a 66 Python !!!! Table 1: Feature comparisons of Sta n z a against other popular natural language processing toolkits. ing 66 languages, by training the pipeline on (fr) L’Association des Hôtels (en) The Association of Hotels the Universal Dependencies (UD) treebanks (fr) Il y a des hôtels en bas de la rue and other multilingual corpora. (en) There are hotels down the street • State-of-the-art performance. We evaluate Figure 2: An example of multi-word tokens in French. Sta n z a on a total of 112 datasets, and find its The des in the first sentence corresponds to two syntac- neural pipeline adapts well to text of different tic words, de and les; the second des is a single word. genres, achieving state-of-the-art or competitive performance at each step of the pipeline. Tokenization and Sentence Splitting. When presented raw text, Sta n z a tokenizes it and groups Additionally, Sta n z a features a Python interface tokens into sentences as the first step of processing. to the widely used Java CoreNLP package, allow- Unlike most existing toolkits, Sta n z a combines tok- ing access to additional tools such as coreference enization and sentence segmentation from raw text resolution and relation extraction. into a single module. This is modeled as a tagging Sta n z a is fully open source and we make pre- problem over character sequences, where the model trained models for all supported languages and predicts whether a given character is the end of a datasets available for public download. We hope Sta token, end of a sentence, or end of a multi-word n z a can facilitate multilingual NLP research and ap- token (MWT, see Figure2). 3 We choose to predict plications, and drive future research that produces MWTs jointly with tokenization because this task insights from human languages. is context-sensitive in some languages. 2 System Design and Architecture Multi-word Token Expansion. Once MWTs are identified by the tokenizer, they are expanded At the top level, Sta n z a consists of two individual into the underlying syntactic words as the basis components: (1) a fully neural multilingual NLP of downstream processing. This is achieved with pipeline; (2) a Python client interface to the Java an ensemble of a frequency lexicon and a neural Stanford CoreNLP software. In this section we sequence-to-sequence (seq2seq) model, to ensure introduce their designs. that frequently observed expansions in the training set are always robustly expanded while maintaining 2.1 Neural Multilingual NLP Pipeline flexibility to model unseen words statistically. Sta n z a ’s neural pipeline consists of models that range from tokenizing raw text to performing syn- POS and Morphological Feature Tagging. For tactic analysis on entire sentences (see Figure1). each word in a sentence, Sta n z a assigns it a part- All components are designed with processing many of-speech (POS), and analyzes its universal morphological features (UFeats, e.g., singular/plural, human languages in mind, with high-level design st nd rd choices capturing common phenomena in many 1 /2 /3 person, etc.). To predict POS and UFeats, languages and data-driven models that learn the dif- we adopt a bidirectional long short-term mem- ference between these languages from data. More- ory network (Bi-LSTM) as the basic architecture. For consistency among universal POS (UPOS), over, the implementation of Sta n z a components is highly modular, and reuses basic model architec- 3Following Universal Dependencies (Nivre et al., 2020), tures when possible for compactness. We highlight we make a distinction between tokens (contiguous spans of characters in the input text) and syntactic words. These are the important design choices here, and refer the interchangeable aside from the cases of MWTs, where one reader to Qi et al.(2018) for modeling details. token can correspond to multiple words. 102 treebank-specific POS (XPOS), and UFeats, we existing server interface in CoreNLP, and imple- adopt the biaffine scoring mechanism from Dozat ment a robust client as its Python interface. and Manning(2017) to condition XPOS and When the CoreNLP client is instantiated, Sta n z UFeats prediction on that of UPOS. a will automatically start the CoreNLP server as a local process. The client then communicates with Lemmatization. Sta n z a also lemmatizes each the server through its RESTful APIs, after which word in a sentence to recover its canonical form annotations are transmitted in Protocol Buffers, and (e.g., did!do). Similar to the multi-word token ex- converted back to native Python objects. Users can pander, Sta n z a ’s lemmatizer is implemented as an also specify JSON or XML as annotation format. ensemble of a dictionary-based lemmatizer and a To ensure robustness, while the client is being used, neural seq2seq lemmatizer. An additional classifier Sta n z a periodically checks the health of the server, is built on the encoder output of the seq2seq model, and restarts it if necessary. to predict shortcuts such as lowercasing and iden- tity copy for robustness on long input sequences 3 System Usage such as URLs. Sta n z a ’s user interface is designed to allow quick Dependency Parsing. Sta n z a parses each sen- out-of-the-box processing of multilingual text. To tence for its syntactic structure, where each word achieve this, Sta n z a supports automated model in the sentence is assigned a syntactic head that download via Python code and pipeline customiza- is either another word in the sentence, or in the tion with processors of choice. Annotation results case of the root word, an artificial root symbol.

Stanza: a Python Natural Language Processing Toolkit

Arxiv:1908.07448V1

Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format

Universal Dependencies According to BERT: Both More Specific and More General

Universal Dependencies for Japanese

Universal Dependencies for Persian

Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

Foreword to the Special Issue on Uralic Languages

The Universal Dependencies Treebank of Spoken Slovenian

Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Conll-2017 Shared Task

Conferenceabstracts

A Multilingual Collection of Conll-U-Compatible Morphological Lexicons