<<

NLTK: The Natural Language Toolkit

Edward Loper and Steven Bird Department of Computer and Information Science University of Pennsylvania, Philadelphia, PA 19104-6389, USA

Abstract An unfortunate consequence is that a significant part of such courses must be devoted NLTK, the Natural Language Toolkit, to teaching programming languages. Further, is a suite of open source program many interesting projects span a variety of modules, tutorials and problem sets, domains, and would require that multiple providing ready-to-use computational languages be bridged. For example, a student courseware. NLTK covers project that involved syntactic of corpus symbolic and statistical natural lan- data from a morphologically rich language might guage processing, and is interfaced to involve all three of the languages mentioned annotated corpora. Students augment above: Perl for string processing; a finite state and replace existing components, learn toolkit for morphological analysis; and Prolog structured programming by example, for parsing. It is clear that these considerable and manipulate sophisticated models overheads and shortcomings warrant a fresh from the outset. approach. Apart from the practical component, compu- tational linguistics courses may also depend on 1 Introduction software for in-class demonstrations. This con- Teachers of introductory courses on compu- text calls for highly interactive graphical user tational linguistics are often faced with the interfaces, making it possible to view program challenge of setting up a practical programming state (e.g. the chart of a chart parser), observe component for student assignments and program execution step-by-step (e.g. execu- projects. This is a difficult task because tion of a finite-state machine), and even make different computational linguistics domains minor modifications to programs in response to require a variety of different data structures “what if” questions from the class. Because and functions, and because a diverse range of of these difficulties it is common to avoid live topics may need to be included in the syllabus. demonstrations, and keep classes for theoreti- A widespread practice is to employ multiple cal presentations only. Apart from being dull, programming languages, where each language this approach leaves students to solve important provides native data structures and functions practical problems on their own, or to deal with that are a good fit for the task at hand. For them less efficiently in office hours. example, a course might use Prolog for pars- In this paper we introduce a new approach to ing, Perl for corpus processing, and a finite-state the above challenges, a streamlined and flexible toolkit for morphological analysis. By relying way of organizing the practical component on the built-in features of various languages, the of an introductory computational linguistics teacher avoids having to develop a lot of software course. We describe NLTK, the Natural infrastructure. Language Toolkit, which we have developed in conjunction with a course we have taught at 3 Design Criteria the University of Pennsylvania. Several criteria were considered in the design The Natural Language Toolkit is avail- and implementation of the toolkit. These design able under an open source license from criteria are listed in the order of their impor- http://nltk.sf.net/. NLTK runs on all tance. It was also important to decide what platforms supported by Python, including goals the toolkit would not attempt to accom- Windows, OS X, Linux, and Unix. plish; we therefore include an explicit set of non- requirements, which the toolkit is not expected to satisfy. 2 Choice of 3.1 Requirements Ease of Use. The primary purpose of the The most basic step in setting up a practical toolkit is to allow students to concentrate on component is choosing a suitable programming building natural language processing (NLP) sys- language. A number of considerations tems. The more time students must spend learn- influenced our choice. First, the language must ing to use the toolkit, the less useful it is. have a shallow learning curve, so that novice programmers get immediate rewards for their Consistency. The toolkit should use consis- efforts. Second, the language must support tent data structures and interfaces. rapid prototyping and a short develop/test Extensibility. The toolkit should easily cycle; an obligatory compilation step is a accommodate new components, whether those serious detraction. Third, the code should be components replicate or extend the toolkit’s self-documenting, with a transparent syntax and existing functionality. The toolkit should semantics. Fourth, it should be easy to write be structured in such a way that it is obvious structured programs, ideally object-oriented but where new extensions would fit into the toolkit’s without the burden associated with languages infrastructure. like C++. Finally, the language must have an easy-to-use graphics library to support the Documentation. The toolkit, its data development of graphical user interfaces. structures, and its implementation all need to In surveying the available languages, we be carefully and thoroughly documented. All believe that Python offers an especially good nomenclature must be carefully chosen and fit to the above requirements. Python is an consistently used. object-oriented scripting language developed Simplicity. The toolkit should structure the by Guido van Rossum and available on all complexities of building NLP systems, not hide platforms (www.python.org). Python offers them. Therefore, each class defined by the a shallow learning curve; it was designed to toolkit should be simple enough that a student be easily learnt by children (van Rossum, could implement it by the time they finish an 1999). As an interpreted language, Python is introductory course in computational linguis- suitable for rapid prototyping. Python code is tics. exceptionally readable, and it has been praised as “executable pseudocode.” Python is an Modularity. The interaction between differ- object-oriented language, but not punitively ent components of the toolkit should be kept so, and it is easy to encapsulate data and to a minimum, using simple, well-defined inter- methods inside Python classes. Finally, Python faces. In particular, it should be possible to has an interface to the Tk graphics toolkit complete individual projects using small parts (Lundh, 1999), and writing graphical interfaces of the toolkit, without worrying about how they is straightforward. interact with the rest of the toolkit. This allows students to learn how to use the toolkit incre- Four modules provide implementations mentally throughout a course. Modularity also for these abstract interfaces. The srparser makes it easier to change and extend the toolkit. module implements a simple shift-reduce parser. The chartparser module defines a 3.2 Non-Requirements flexible parser that uses a chart to record Comprehensiveness. The toolkit is not hypotheses about syntactic constituents. The intended to provide a comprehensive set of pcfgparser module provides a variety of tools. Indeed, there should be a wide variety of different parsers for probabilistic grammars. ways in which students can extend the toolkit. And the rechunkparser module defines a transformational regular-expression based Efficiency. The toolkit does not need to implementation of the chunk parser interface. be highly optimized for runtime performance. However, it should be efficient enough that Tagging Modules students can use their NLP systems to perform The tagger module defines a standard interface real tasks. for augmenting each token of a text with supple- mentary information, such as its part of speech Cleverness. Clear designs and implementa- tions are far preferable to ingenious yet inde- or its WordNet synset tag; and provides several cipherable ones. different implementations for this interface. Finite State Automata 4 Modules The fsa module defines a data type for encod- The toolkit is implemented as a collection of ing finite state automata; and an interface for independent modules, each of which defines a creating automata from regular expressions. specific data structure or task. A set of core modules defines basic data Type Checking types and processing systems that are used Debugging time is an important factor in the throughout the toolkit. The token module toolkit’s ease of use. To reduce the amount of provides basic classes for processing individual time students must spend debugging their code, elements of text, such as words or sentences. we provide a type checking module, which can The tree module defines data structures for be used to ensure that functions are given valid representing tree structures over text, such arguments. The type checking module is used as syntax trees and morphological trees. The by all of the basic data types and processing probability module implements classes that classes. encode frequency distributions and probability Since type checking is done explicitly, it can distributions, including a variety of statistical slow the toolkit down. However, when efficiency smoothing techniques. is an issue, type checking can be easily turned The remaining modules define data structures off; and with type checking is disabled, there is and interfaces for performing specific NLP tasks. no performance penalty. This list of modules will grow over time, as we add new tasks and algorithms to the toolkit. Visualization Visualization modules define graphical Parsing Modules interfaces for viewing and manipulating The parser module defines a high-level inter- data structures, and graphical tools for face for producing trees that represent the struc- experimenting with NLP tasks. The draw.tree tures of texts. The chunkparser module defines module provides a simple graphical inter- a sub-interface for parsers that identify non- face for displaying tree structures. The overlapping linguistic groups (such as base noun draw.tree edit module provides an interface phrases) in unrestricted text. for building and modifying tree structures. The draw.plot graph module can be used to such as tagging, probabilistic systems, or text graph mathematical functions. The draw.fsa classification. The tutorials include a high-level module provides a graphical tool for displaying discussion that explains and motivates the and simulating finite state automata. The domain, followed by a detailed walk-through draw.chart module provides an interactive that uses examples to show how NLTK can be graphical tool for experimenting with chart used to perform specific tasks. parsers. The visualization modules provide interfaces Reference Documentation provides precise for interaction and experimentation; they do definitions for every module, interface, class, not directly implement NLP data structures or method, function, and variable in the toolkit. It tasks. Simplicity of implementation is therefore is automatically extracted from docstring com- less of an issue for the visualization modules ments in the Python source code, using Epydoc than it is for the rest of the toolkit. (Loper, 2002). explain and justify the Text Classification Technical Reports toolkit’s design and implementation. They are The classifier module defines a standard used by the developers of the toolkit to guide interface for classifying texts into categories. and document the toolkit’s construction. Stu- This interface is currently implemented by two dents can also consult these reports if they would modules. The classifier.naivebayes module like further information about how the toolkit is defines a text classifier based on the Naive Bayes designed, and why it is designed that way. assumption. The classifier.maxent module defines the maximum entropy model for text 6 Uses of NLTK classification, and implements two algorithms for training the model: Generalized Iterative 6.1 Assignments Scaling and Improved Iterative Scaling. NLTK can be used to create student assign- The classifier.feature module provides ments of varying difficulty and scope. In the a standard encoding for the information that simplest assignments, students experiment with is used to make decisions for a particular an existing module. The wide variety of existing classification task. This standard encoding modules provide many opportunities for creat- allows students to experiment with the ing these simple assignments. Once students differences between different text classification become more familiar with the toolkit, they can algorithms, using identical feature sets. be asked to make minor changes or extensions to The classifier.featureselection module an existing module. A more challenging task is defines a standard interface for choosing which to develop a new module. Here, NLTK provides features are relevant for a particular classifica- some useful starting points: predefined inter- tion task. Good feature selection can signifi- faces and data structures, and existing modules cantly improve classification performance. that implement the same interface. 5 Documentation Example: Chunk Parsing As an example of a moderately difficult The toolkit is accompanied by extensive assignment, we asked students to construct documentation that explains the toolkit, and a chunk parser that correctly identifies base describes how to use and extend it. This noun phrase chunks in a given text, by documentation is divided into three primary defining a cascade of transformational chunking categories: rules. The NLTK rechunkparser module Tutorials teach students how to use the provides a variety of regular-expression toolkit, in the context of performing specific based rule types, which the students can tasks. Each tutorial focuses on a single domain, instantiate to construct complete rules. For example, ChunkRule(’’) builds 6.2 Class demonstrations chunks from sequences of consecutive nouns; NLTK provides graphical tools that can be used ChinkRule(’’) excises verbs from in class demonstrations to help explain basic existing chunks; SplitRule(’’, ’

’) NLP concepts and algorithms. These interactive splits any existing chunk that contains a tools can be used to display relevant data struc- singular noun followed by determiner into tures and to show the step-by-step execution of two pieces; and MergeRule(’’, ’’) algorithms. Both data structures and control combines two adjacent chunks where the first flow can be easily modified during the demon- chunk ends and the second chunk starts with stration, in response to questions from the class. adjectives. Since these graphical tools are included with The chunking tutorial motivates chunk pars- the toolkit, they can also be used by students. ing, describes each rule type, and provides all This allows students to experiment at home with the necessary code for the assignment. The pro- the algorithms that they have seen presented in vided code is responsible for loading the chun- class. ked, part-of-speech tagged text using an existing tokenizer, creating an unchunked version of the Example: The Chart Parsing Tool text, applying the chunk rules to the unchunked text, and scoring the result. Students focus on The chart parsing tool is an example of a the NLP task only – providing a rule set with graphical tool provided by NLTK. This tool can the best coverage. be used to explain the basic concepts behind In the remainder of this section we reproduce chart parsing, and to show how the algorithm some of the cascades created by the students. works. Chart parsing is a flexible parsing algo- The first example illustrates a combination of rithm that uses a data structure called a chart to several rule types: record hypotheses about syntactic constituents. cascade = [ Each hypothesis is represented by a single edge ChunkRule(’
’), on the chart. A set of rules determine when new ChunkRule(’
’), ChunkRule(’<.*>’), edges can be added to the chart. This set of rules UnChunkRule(’’), controls the overall behavior of the parser (e.g., UnChunkRule("<,| .|‘‘|’’>"), whether it parses top-down or bottom-up). MergeRule(’’,\\ ’’), The chart parsing tool demonstrates the pro- SplitRule(’’, ’’) cess of parsing a single sentence, with a given ] grammar and lexicon. Its display is divided into The next example illustrates a brute-force sta- three sections: the bottom section displays the tistical approach. The student calculated how chart; the middle section displays the sentence; often each part-of-speech tag was included in and the top section displays the partial syntax a noun phrase. They then constructed chunks tree corresponding to the selected edge. But- from any sequence of tags that occurred in a tons along the bottom of the window are used noun phrase more than 50% of the time. to control the execution of the algorithm. The cascade = [ main display window for the chart parsing tool ChunkRule(’< $|CD|DT|EX|PDT |PRP.*|WP.*|\\ #|FW is shown in Figure 1. |JJ.*|NN.*|POS|RBS|WDT>*’)\\ This tool can be used to explain several dif- ] ferent aspects of chart parsing. First, it can be In the third example, the student constructed used to explain the basic chart data structure, a single chunk containing the entire text, and and to show how edges can represent hypothe- then excised all elements that did not belong. ses about syntactic constituents. It can then cascade = [ be used to demonstrate and explain the indi- ChunkRule(’<.*>+’) ChinkRule(’+’) vidual rules that the chart parser uses to create ] new edges. Finally, it can be used to show how for student projects included text generation, word sense disambiguation, collocation analysis, and morphological analysis. NLTK eliminates the tedious infrastructure- building that is typically associated with advanced student projects by providing students with the basic data structures, tools, and interfaces that they need. This allows the students to concentrate on the problems that interest them. The collaborative, open-source nature of the toolkit can provide students with a sense that their projects are meaningful contributions, and not just exercises. Several of the students in our course have expressed interest in incorporating Figure 1: Chart Parsing Tool their projects into the toolkit. Finally, many of the modules included in the toolkit provide students with good examples these individual rules combine to find a complete of what projects should look like, with well parse for a given sentence. thought-out interfaces, clean code structure, and To reduce the overhead of setting up demon- thorough documentation. strations during lecture, the user can define a list of preset charts. The tool can then be reset Example: Probabilistic Parsing to any one of these charts at any time. The chart parsing tool allows for flexible con- The probabilistic parsing module was created trol of the parsing algorithm. At each step of as a class project for a statistical NLP course. the algorithm, the user can select which rule or The toolkit provided the basic data types and strategy they wish to apply. This allows the user interfaces for parsing. The project extended to experiment with mixing different strategies these, adding a new probabilistic parsing inter- (e.g., top-down and bottom-up). The user can face, and using subclasses to create a prob- exercise fine-grained control over the algorithm abilistic version of the context free grammar by selecting which edge they wish to apply a rule data structure. These new components were to. This flexibility allows lecturers to use the used in conjunction with several existing compo- tool to respond to a wide variety of questions; nents, such as the chart data structure, to define and allows students to experiment with different two implementations of the probabilistic parsing variations on the chart parsing algorithm. interface. Finally, a tutorial was written that explained the basic motivations and concepts 6.3 Advanced Projects behind probabilistic parsing, and described the new interfaces, data structures, and parsers. NLTK provides students with a flexible frame- work for advanced projects. Typical projects involve the development of entirely new func- 7 Evaluation tionality for a previously unsupported NLP task, or the development of a complete system out of We used NLTK as a basis for the assignments existing and new modules. and student projects in CIS-530, an introduc- The toolkit’s broad coverage allows students tory computational linguistics class taught at to explore a wide variety of topics. In our intro- the University of Pennsylvania. CIS-530 is a ductory computational linguistics course, topics graduate level class, although some advanced undergraduates were also enrolled. Most stu- Examples of such books are: Using Computers dents had a background in either computer sci- in Linguistics (Lawler and Dry, 1998), and Pro- ence or linguistics (and occasionally both). Stu- gramming for Linguistics: Java Technology for dents were required to complete five assign- Language Researchers (Hammond, 2002). ments, two exams, and a final project. All class Grammar Developers. Infrastructure materials are available from the course website for grammar development has a long history http://www.cis.upenn.edu/~cis530/. in unification-based (or constraint-based) The experience of using NLTK was very pos- grammar frameworks, from DCG (Pereira itive, both for us and for the students. The and Warren, 1980) to HPSG (Pollard and students liked the fact that they could do inter- Sag, 1994). Recent work includes (Copestake, esting projects from the outset. They also liked 2000; Baldridge et al., 2002a). A concurrent being able to run everything on their computer development has been the finite state toolkits, at home. The students found the extensive doc- such as the Xerox toolkit (Beesley and umentation very helpful for learning to use the Karttunen, 2002). This work has found toolkit. They found the interfaces defined by widespread pedagogical application. NLTK intuitive, and appreciated the ease with Other Researchers and Developers. which they could combine different components A variety of toolkits have been created for to create complete NLP systems. research or R&D purposes. Examples include We did encounter a few difficulties during the the CMU-Cambridge Statistical Language semester. One problem was finding large clean Modeling Toolkit (Clarkson and Rosenfeld, corpora that the students could use for their 1997), the EMU Speech Database System assignments. Several of the students needed (Harrington and Cassidy, 1999), the General assistance finding suitable corpora for their Architecture for Text Engineering (Bontcheva final projects. Another issue was the fact that et al., 2002), the Maxent Package for Maximum we were actively developing NLTK during the Entropy Models (Baldridge et al., 2002b), and semester; some modules were only completed the Annotation Graph Toolkit (Maeda et al., one or two weeks before the students used 2002). Although not originally motivated by them. As a result, students who worked at pedagogical needs, all of these toolkits have home needed to download new versions of the pedagogical applications and many have already toolkit several times throughout the semester. been used in teaching. Luckily, Python has extensive support for installation scripts, which made these upgrades simple. The students encountered a couple of 9 Conclusions and Future Work bugs in the toolkit, but none were serious, and all were quickly corrected. NLTK provides a simple, extensible, uniform framework for assignments, projects, and class 8 Other Approaches demonstrations. It is well documented, easy to learn, and simple to use. We hope that NLTK The computational component of computational will allow computational linguistics classes to linguistics courses takes many forms. In this sec- include more hands-on experience with using tion we briefly review a selection of approaches, and building NLP components and systems. classified according to the (original) target audi- NLTK is unique in its combination of three ence. factors. First, it was deliberately designed as Linguistics Students. Various books intro- courseware and gives pedagogical goals primary duce programming or computing to linguists. status. Second, its target audience consists of These are elementary on the computational side, both linguists and computer scientists, and it providing a gentle introduction to students hav- is accessible and challenging at many levels of ing no prior experience in computer science. prior computational skill. Finally, it is based on an object-oriented scripting language support- Kalina Bontcheva, Hamish Cunningham, Valentin ing rapid prototyping and literate programming. Tablan, Diana Maynard, and Oana Hamza. 2002. Using GATE as an environment for teaching NLP. We plan to continue extending the breadth In Proceedings of the ACL Workshop on Effective of materials covered by the toolkit. We are Tools and Methodologies for Teaching NLP and currently working on NLTK modules for Hidden CL. Somerset, NJ: Association for Computational Markov Models, language modeling, and tree Linguistics. adjoining grammars. We also plan to increase Philip R. Clarkson and Ronald Rosenfeld. the number of algorithms implemented by some 1997. Statistical language modeling using existing modules, such as the text classification the CMU-Cambridge Toolkit. In Proceedings module. of the 5th European Conference on Speech Communication and Technology (EUROSPEECH Finding suitable corpora is a prerequisite for ’97). http://svr-www.eng.cam.ac.uk/~prc14/ many student assignments and projects. We are eurospeech97.ps. therefore putting together a collection of corpora Ann Copestake. 2000. The (new) LKB system. containing data appropriate for every module http://www-csli.stanford.edu/~aac/doc5-2. defined by the toolkit. pdf. NLTK is an open source project, and we wel- Michael Hammond. 2002. Programming for Linguis- come any contributions. Readers who are inter- tics: Java Technology for Language Researchers. ested in contributing to NLTK, or who have Oxford: Blackwell. In press. suggestions for improvements, are encouraged to contact the authors. Jonathan Harrington and Steve Cassidy. 1999. Tech- niques in Speech Acoustics. Kluwer.

10 Acknowledgments John M. Lawler and Helen Aristar Dry, editors. 1998. Using Computers in Linguistics. London: We are indebted to our students for feedback Routledge. on the toolkit, and to anonymous reviewers, Jee Edward Loper. 2002. Epydoc. Bang, and the workshop organizers for com- http://epydoc.sourceforge.net/. ments on an earlier version of this paper. We are grateful to Mitch Marcus and the Department of Fredrik Lundh. 1999. An introduction to tkinter. Computer and Information Science at the Uni- http://www.pythonware.com/library/ tkinter/introduction/index.htm. versity of Pennsylvania for sponsoring the work reported here. Kazuaki Maeda, Steven Bird, Xiaoyi Ma, and Hae- joong Lee. 2002. Creating annotation tools with the annotation graph toolkit. In Proceedings of the Third International Conference on Language References Resources and Evaluation. http://arXiv.org/ Jason Baldridge, John Dowding, and Susana Early. abs/cs/0204005. 2002a. Leo: an architecture for sharing resources Fernando C. N. Pereira and David H. D. Warren. for unification-based grammars. In Proceedings 1980. Definite clause grammars for language anal- of the Third Language Resources and Evaluation ysis – a survey of the formalism and a comparison Conference. Paris: European Language Resources with augmented transition grammars. Artificial Association. Intelligence, 13:231–78. http://www.iccs.informatics.ed.ac.uk/ ~jmb/leo-lrec.ps.gz. Carl Pollard and Ivan A. Sag. 1994. Head-Driven Phrase Structure Grammar. Chicago University Jason Baldridge, Thomas Morton, and Gann Press. Bierner. 2002b. The MaxEnt project. http://maxent.sourceforge.net/. Guido van Rossum. 1999. Computer program- ming for everybody. Technical report, Corpo- Kenneth R. Beesley and Lauri Karttunen. 2002. ration for National Research Initiatives. http: Finite-State Morphology: Xerox Tools and Tech- //www.python.org/doc/essays/cp4e.html. niques. Studies in Natural Language Processing. Cambridge University Press.