The Stanford Corenlp Natural Language Processing Toolkit Christopher D

The Stanford CoreNLP Natural Language Processing Toolkit Christopher D. Manning Mihai Surdeanu John Bauer Linguistics & Computer Science SISTA Dept of Computer Science Stanford University University of Arizona Stanford University [email protected] [email protected] [email protected] Jenny Finkel Steven J. Bethard David McClosky Prismatic Inc. Computer and Information Sciences IBM Research [email protected] U. of Alabama at Birmingham [email protected] [email protected] Abstract Tokeniza)on* (tokenize)* Sentence*Spli0ng* We describe the design and use of the (ssplit)* Part4of4speech*Tagging* Raw* Stanford CoreNLP toolkit, an extensible (pos)* text* Morphological*Analysis* pipeline that provides core natural lan- (lemma)* Annota)on* guage analysis. This toolkit is quite widely Named*En)ty*Recogni)on* Object* (ner)* used, both in the research NLP community Execu)on*Flow* Syntac)c*Parsing* Annotated* (parse)* text* and also among commercial and govern- Coreference*Resolu)on** ment users of open source NLP technol- (dcoref)* Other*Annotators* ogy. We suggest that this follows from (gender, sentiment)! a simple, approachable design, straightforward interfaces, the inclusion of ro- Figure 1: Overall system architecture: Raw text bust and good quality analysis compo- is put into an Annotation object and then a se- nents, and not requiring use of a large quence of Annotators add information in an analy- amount of associated baggage. sis pipeline. The resulting Annotation, containing all the analysis information added by the Annota- 1 Introduction tors, can be output in XML or plain text forms. This paper describe the design and development of Stanford CoreNLP, a Java (or at least JVM-based) annotation pipeline was developed in 2006 in or- annotation pipeline framework, which provides der to replace this jumble with something better. most of the common core natural language pro- A uniform interface was provided for an Annota- cessing (NLP) steps, from tokenization through to tor that adds some kind of analysis information to coreference resolution. We describe the original some text. An Annotator does this by taking in an design of the system and its strengths (section 2), Annotation object to which it can add extra infor- simple usage patterns (section 3), the set of pro- mation. An Annotation is stored as a typesafe het- vided annotators and how properties control them erogeneous map, following the ideas for this data (section 4), and how to add additional annotators type presented by Bloch (2008). This basic archi- (section 5), before concluding with some higher- tecture has proven quite successful, and is still the level remarks and additional appendices. While basis of the system described here. It is illustrated there are several good natural language analysis in figure 1. The motivations were: toolkits, Stanford CoreNLP is one of the most • To be able to quickly and painlessly get linguis- used, and a central theme is trying to identify the tic annotations for a text. attributes that contributed to its success. • To hide variations across components behind a common API. 2 Original Design and Development • To have a minimal conceptual footprint, so the Our pipeline system was initially designed for in- system is easy to learn. ternal use. Previously, when combining multiple • To provide a lightweight framework, using plain natural language analysis components, each with Java objects (rather than something of heav- their own ad hoc APIs, we had tied them together ier weight, such as XML or UIMA’s Common with custom glue code. The initial version of the Analysis System (CAS) objects). In 2009, initially as part of a multi-site grant quality linguistic analysis components, which can project, the system was extended to be more easily be easily invoked for common scenarios. While usable by a broader range of users. We provided the builder of a larger system may have made over- a command-line interface and the ability to write all design choices, such as how to handle scale- out an Annotation in various formats, including out, they are unlikely to be an NLP expert, and XML. Further work led to the system being re- are hence looking for NLP components that just leased as free open source software in 2010. work. This is a huge advantage that Stanford On the one hand, from an architectural perspec- CoreNLP and GATE have over the empty tool- tive, Stanford CoreNLP does not attempt to do ev- box of an Apache UIMA download, something erything. It is nothing more than a straightforward addressed in part by the development of well- pipeline architecture. It provides only a Java API.1 integrated component packages for UIMA, such It does not attempt to provide multiple machine as ClearTK (Bethard et al., 2014), DKPro Core scale-out (though it does provide multi-threaded (Gurevych et al., 2007), and JCoRe (Hahn et al., processing on a single machine). It provides a sim- 2008). However, the solution provided by these ple concrete API. But these requirements satisfy packages remains harder to learn, more complex a large percentage of potential users, and the re- and heavier weight for users than the pipeline de- sulting simplicity makes it easier for users to get scribed here. started with the framework. That is, the primary These attributes echo what Patricio (2009) ar- advantage of Stanford CoreNLP over larger frame- gued made Hibernate successful, including: (i) do works like UIMA (Ferrucci and Lally, 2004) or one thing well, (ii) avoid over-design, and (iii) GATE (Cunningham et al., 2002) is that users do up and running in ten minutes or less! Indeed, not have to learn UIMA or GATE before they can the design and success of Stanford CoreNLP also get started; they only need to know a little Java. reflects several other of the factors that Patricio In practice, this is a large and important differ- highlights, including (iv) avoid standardism, (v) entiator. If more complex scenarios are required, documentation, and (vi) developer responsiveness. such as multiple machine scale-out, they can nor- While there are many factors that contribute to the mally be achieved by running the analysis pipeline uptake of a project, and it is hard to show causal- within a system that focuses on distributed work- ity, we believe that some of these attributes ac- flows (such as Hadoop or Spark). Other systems count for the fact that Stanford CoreNLP is one of attempt to provide more, such as the UIUC Cu- the more used NLP toolkits. While we certainly rator (Clarke et al., 2012), which includes inter- have not done a perfect job, compared to much machine client-server communication for process- academic software, Stanford CoreNLP has gained ing and the caching of natural language analyses. from attributes such as clear open source licens- But this functionality comes at a cost. The system ing, a modicum of attention to documentation, and is complex to install and complex to understand. attempting to answer user questions. Moreover, in practice, an organization may well be committed to a scale-out solution which is dif- 3 Elementary Usage ferent from that provided by the natural language analysis toolkit. For example, they may be using A key design goal was to make it very simple to Kryo or Google’s protobuf for binary serialization set up and run processing pipelines, from either rather than Apache Thrift which underlies Cura- the API or the command-line. Using the API, run- tor. In this case, the user is better served by a fairly ning a pipeline can be as easy as figure 2. Or, small and self-contained natural language analysis at the command-line, doing linguistic processing system, rather than something which comes with for a file can be as easy as figure 3. Real life is a lot of baggage for all sorts of purposes, most of rarely this simple, but the ability to get started us- which they are not using. ing the product with minimal configuration code On the other hand, most users benefit greatly gives new users a very good initial experience. from the provision of a set of stable, robust, high Figure 4 gives a more realistic (and complete) example of use, showing several key properties of 1Nevertheless, it can call an analysis component written in the system. An annotation pipeline can be applied other languages via an appropriate wrapper Annotator, and in turn, it has been wrapped by many people to provide Stan- to any text, such as a paragraph or whole story ford CoreNLP bindings for other languages. rather than just a single sentence. The behavior of Annotator pipeline = new StanfordCoreNLP(); guages, including C# and F#. Annotation annotation = new Annotation( "Can you parse my sentence?"); pipeline.annotate(annotation); 4 Provided annotators Figure 2: Minimal code for an analysis pipeline. The annotators provided with StanfordCoreNLP can work with any character encoding, making use export StanfordCoreNLP_HOME /where/installed of Java’s good Unicode support, but the system java -Xmx2g -cp $StanfordCoreNLP_HOME/* defaults to UTF-8 encoding. The annotators also edu.stanford.nlp.StanfordCoreNLP -file input.txt support processing in various human languages, providing that suitable underlying models or re- Figure 3: Minimal command-line invocation. sources are available for the different languages. import java.io.*; The system comes packaged with models for En- import java.util.*; import edu.stanford.nlp.io.*; glish. Separate model packages provide support import edu.stanford.nlp.ling.*; import edu.stanford.nlp.pipeline.*; for Chinese and for case-insensitive processing of import edu.stanford.nlp.trees.*; English. Support for other languages is less com- import edu.stanford.nlp.trees.TreeCoreAnnotations.*; import edu.stanford.nlp.util.*; plete, but many of the Annotators also support public class StanfordCoreNlpExample { models for French, German, and Arabic (see ap- public static void main(String[] args) throws IOException { pendix B), and building models for further lan- PrintWriter xmlOut = new PrintWriter("xmlOutput.xml"); Properties props = new Properties(); guages is possible using the underlying tools.

The Stanford Corenlp Natural Language Processing Toolkit Christopher D

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support