NLTK: The Natural Language Toolkit
Edward Loper and Steven Bird Department of Computer and Information Science University of Pennsylvania, Philadelphia, PA 19104-6389, USA
Abstract An unfortunate consequence is that a significant part of such courses must be devoted NLTK, the Natural Language Toolkit, to teaching programming languages. Further, is a suite of open source program many interesting projects span a variety of modules, tutorials and problem sets, domains, and would require that multiple providing ready-to-use computational languages be bridged. For example, a student linguistics courseware. NLTK covers project that involved syntactic parsing of corpus symbolic and statistical natural lan- data from a morphologically rich language might guage processing, and is interfaced to involve all three of the languages mentioned annotated corpora. Students augment above: Perl for string processing; a finite state and replace existing components, learn toolkit for morphological analysis; and Prolog structured programming by example, for parsing. It is clear that these considerable and manipulate sophisticated models overheads and shortcomings warrant a fresh from the outset. approach. Apart from the practical component, compu- tational linguistics courses may also depend on 1 Introduction software for in-class demonstrations. This con- Teachers of introductory courses on compu- text calls for highly interactive graphical user tational linguistics are often faced with the interfaces, making it possible to view program challenge of setting up a practical programming state (e.g. the chart of a chart parser), observe component for student assignments and program execution step-by-step (e.g. execu- projects. This is a difficult task because tion of a finite-state machine), and even make different computational linguistics domains minor modifications to programs in response to require a variety of different data structures “what if” questions from the class. Because and functions, and because a diverse range of of these difficulties it is common to avoid live topics may need to be included in the syllabus. demonstrations, and keep classes for theoreti- A widespread practice is to employ multiple cal presentations only. Apart from being dull, programming languages, where each language this approach leaves students to solve important provides native data structures and functions practical problems on their own, or to deal with that are a good fit for the task at hand. For them less efficiently in office hours. example, a course might use Prolog for pars- In this paper we introduce a new approach to ing, Perl for corpus processing, and a finite-state the above challenges, a streamlined and flexible toolkit for morphological analysis. By relying way of organizing the practical component on the built-in features of various languages, the of an introductory computational linguistics teacher avoids having to develop a lot of software course. We describe NLTK, the Natural infrastructure. Language Toolkit, which we have developed in conjunction with a course we have taught at 3 Design Criteria the University of Pennsylvania. Several criteria were considered in the design The Natural Language Toolkit is avail- and implementation of the toolkit. These design able under an open source license from criteria are listed in the order of their impor- http://nltk.sf.net/. NLTK runs on all tance. It was also important to decide what platforms supported by Python, including goals the toolkit would not attempt to accom- Windows, OS X, Linux, and Unix. plish; we therefore include an explicit set of non- requirements, which the toolkit is not expected to satisfy. 2 Choice of Programming Language 3.1 Requirements Ease of Use. The primary purpose of the The most basic step in setting up a practical toolkit is to allow students to concentrate on component is choosing a suitable programming building natural language processing (NLP) sys- language. A number of considerations tems. The more time students must spend learn- influenced our choice. First, the language must ing to use the toolkit, the less useful it is. have a shallow learning curve, so that novice programmers get immediate rewards for their Consistency. The toolkit should use consis- efforts. Second, the language must support tent data structures and interfaces. rapid prototyping and a short develop/test Extensibility. The toolkit should easily cycle; an obligatory compilation step is a accommodate new components, whether those serious detraction. Third, the code should be components replicate or extend the toolkit’s self-documenting, with a transparent syntax and existing functionality. The toolkit should semantics. Fourth, it should be easy to write be structured in such a way that it is obvious structured programs, ideally object-oriented but where new extensions would fit into the toolkit’s without the burden associated with languages infrastructure. like C++. Finally, the language must have an easy-to-use graphics library to support the Documentation. The toolkit, its data development of graphical user interfaces. structures, and its implementation all need to In surveying the available languages, we be carefully and thoroughly documented. All believe that Python offers an especially good nomenclature must be carefully chosen and fit to the above requirements. Python is an consistently used. object-oriented scripting language developed Simplicity. The toolkit should structure the by Guido van Rossum and available on all complexities of building NLP systems, not hide platforms (www.python.org). Python offers them. Therefore, each class defined by the a shallow learning curve; it was designed to toolkit should be simple enough that a student be easily learnt by children (van Rossum, could implement it by the time they finish an 1999). As an interpreted language, Python is introductory course in computational linguis- suitable for rapid prototyping. Python code is tics. exceptionally readable, and it has been praised as “executable pseudocode.” Python is an Modularity. The interaction between differ- object-oriented language, but not punitively ent components of the toolkit should be kept so, and it is easy to encapsulate data and to a minimum, using simple, well-defined inter- methods inside Python classes. Finally, Python faces. In particular, it should be possible to has an interface to the Tk graphics toolkit complete individual projects using small parts (Lundh, 1999), and writing graphical interfaces of the toolkit, without worrying about how they is straightforward. interact with the rest of the toolkit. This allows students to learn how to use the toolkit incre- Four modules provide implementations mentally throughout a course. Modularity also for these abstract interfaces. The srparser makes it easier to change and extend the toolkit. module implements a simple shift-reduce parser. The chartparser module defines a 3.2 Non-Requirements flexible parser that uses a chart to record Comprehensiveness. The toolkit is not hypotheses about syntactic constituents. The intended to provide a comprehensive set of pcfgparser module provides a variety of tools. Indeed, there should be a wide variety of different parsers for probabilistic grammars. ways in which students can extend the toolkit. And the rechunkparser module defines a transformational regular-expression based Efficiency. The toolkit does not need to implementation of the chunk parser interface. be highly optimized for runtime performance. However, it should be efficient enough that Tagging Modules students can use their NLP systems to perform The tagger module defines a standard interface real tasks. for augmenting each token of a text with supple- mentary information, such as its part of speech Cleverness. Clear designs and implementa- tions are far preferable to ingenious yet inde- or its WordNet synset tag; and provides several cipherable ones. different implementations for this interface. Finite State Automata 4 Modules The fsa module defines a data type for encod- The toolkit is implemented as a collection of ing finite state automata; and an interface for independent modules, each of which defines a creating automata from regular expressions. specific data structure or task. A set of core modules defines basic data Type Checking types and processing systems that are used Debugging time is an important factor in the throughout the toolkit. The token module toolkit’s ease of use. To reduce the amount of provides basic classes for processing individual time students must spend debugging their code, elements of text, such as words or sentences. we provide a type checking module, which can The tree module defines data structures for be used to ensure that functions are given valid representing tree structures over text, such arguments. The type checking module is used as syntax trees and morphological trees. The by all of the basic data types and processing probability module implements classes that classes. encode frequency distributions and probability Since type checking is done explicitly, it can distributions, including a variety of statistical slow the toolkit down. However, when efficiency smoothing techniques. is an issue, type checking can be easily turned The remaining modules define data structures off; and with type checking is disabled, there is and interfaces for performing specific NLP tasks. no performance penalty. This list of modules will grow over time, as we add new tasks and algorithms to the toolkit. Visualization Visualization modules define graphical Parsing Modules interfaces for viewing and manipulating The parser module defines a high-level inter- data structures, and graphical tools for face for producing trees that represent the struc- experimenting with NLP tasks. The draw.tree tures of texts. The chunkparser module defines module provides a simple graphical inter- a sub-interface for parsers that identify non- face for displaying tree structures. The overlapping linguistic groups (such as base noun draw.tree edit module provides an interface phrases) in unrestricted text. for building and modifying tree structures. The draw.plot graph module can be used to such as tagging, probabilistic systems, or text graph mathematical functions. The draw.fsa classification. The tutorials include a high-level module provides a graphical tool for displaying discussion that explains and motivates the and simulating finite state automata. The domain, followed by a detailed walk-through draw.chart module provides an interactive that uses examples to show how NLTK can be graphical tool for experimenting with chart used to perform specific tasks. parsers. The visualization modules provide interfaces Reference Documentation provides precise for interaction and experimentation; they do definitions for every module, interface, class, not directly implement NLP data structures or method, function, and variable in the toolkit. It tasks. Simplicity of implementation is therefore is automatically extracted from docstring com- less of an issue for the visualization modules ments in the Python source code, using Epydoc than it is for the rest of the toolkit. (Loper, 2002). explain and justify the Text Classification Technical Reports toolkit’s design and implementation. They are The classifier module defines a standard used by the developers of the toolkit to guide interface for classifying texts into categories. and document the toolkit’s construction. Stu- This interface is currently implemented by two dents can also consult these reports if they would modules. The classifier.naivebayes module like further information about how the toolkit is defines a text classifier based on the Naive Bayes designed, and why it is designed that way. assumption. The classifier.maxent module defines the maximum entropy model for text 6 Uses of NLTK classification, and implements two algorithms for training the model: Generalized Iterative 6.1 Assignments Scaling and Improved Iterative Scaling. NLTK can be used to create student assign- The classifier.feature module provides ments of varying difficulty and scope. In the a standard encoding for the information that simplest assignments, students experiment with is used to make decisions for a particular an existing module. The wide variety of existing classification task. This standard encoding modules provide many opportunities for creat- allows students to experiment with the ing these simple assignments. Once students differences between different text classification become more familiar with the toolkit, they can algorithms, using identical feature sets. be asked to make minor changes or extensions to The classifier.featureselection module an existing module. A more challenging task is defines a standard interface for choosing which to develop a new module. Here, NLTK provides features are relevant for a particular classifica- some useful starting points: predefined inter- tion task. Good feature selection can signifi- faces and data structures, and existing modules cantly improve classification performance. that implement the same interface. 5 Documentation Example: Chunk Parsing As an example of a moderately difficult The toolkit is accompanied by extensive assignment, we asked students to construct documentation that explains the toolkit, and a chunk parser that correctly identifies base describes how to use and extend it. This noun phrase chunks in a given text, by documentation is divided into three primary defining a cascade of transformational chunking categories: rules. The NLTK rechunkparser module Tutorials teach students how to use the provides a variety of regular-expression toolkit, in the context of performing specific based rule types, which the students can tasks. Each tutorial focuses on a single domain, instantiate to construct complete rules. For example, ChunkRule(’
10 Acknowledgments John M. Lawler and Helen Aristar Dry, editors. 1998. Using Computers in Linguistics. London: We are indebted to our students for feedback Routledge. on the toolkit, and to anonymous reviewers, Jee Edward Loper. 2002. Epydoc. Bang, and the workshop organizers for com- http://epydoc.sourceforge.net/. ments on an earlier version of this paper. We are grateful to Mitch Marcus and the Department of Fredrik Lundh. 1999. An introduction to tkinter. Computer and Information Science at the Uni- http://www.pythonware.com/library/ tkinter/introduction/index.htm. versity of Pennsylvania for sponsoring the work reported here. Kazuaki Maeda, Steven Bird, Xiaoyi Ma, and Hae- joong Lee. 2002. Creating annotation tools with the annotation graph toolkit. In Proceedings of the Third International Conference on Language References Resources and Evaluation. http://arXiv.org/ Jason Baldridge, John Dowding, and Susana Early. abs/cs/0204005. 2002a. Leo: an architecture for sharing resources Fernando C. N. Pereira and David H. D. Warren. for unification-based grammars. In Proceedings 1980. Definite clause grammars for language anal- of the Third Language Resources and Evaluation ysis – a survey of the formalism and a comparison Conference. Paris: European Language Resources with augmented transition grammars. Artificial Association. Intelligence, 13:231–78. http://www.iccs.informatics.ed.ac.uk/ ~jmb/leo-lrec.ps.gz. Carl Pollard and Ivan A. Sag. 1994. Head-Driven Phrase Structure Grammar. Chicago University Jason Baldridge, Thomas Morton, and Gann Press. Bierner. 2002b. The MaxEnt project. http://maxent.sourceforge.net/. Guido van Rossum. 1999. Computer program- ming for everybody. Technical report, Corpo- Kenneth R. Beesley and Lauri Karttunen. 2002. ration for National Research Initiatives. http: Finite-State Morphology: Xerox Tools and Tech- //www.python.org/doc/essays/cp4e.html. niques. Studies in Natural Language Processing. Cambridge University Press.