NLTK: the Natural Language Toolkit

NLTK: The Natural Language Toolkit Edward Loper and Steven Bird Department of Computer and Information Science University of Pennsylvania, Philadelphia, PA 19104-6389, USA Abstract An unfortunate consequence is that a signi¯cant part of such courses must be devoted NLTK, the Natural Language Toolkit, to teaching programming languages. Further, is a suite of open source program many interesting projects span a variety of modules, tutorials and problem sets, domains, and would require that multiple providing ready-to-use computational languages be bridged. For example, a student linguistics courseware. NLTK covers project that involved syntactic parsing of corpus symbolic and statistical natural lan- data from a morphologically rich language might guage processing, and is interfaced to involve all three of the languages mentioned annotated corpora. Students augment above: Perl for string processing; a ¯nite state and replace existing components, learn toolkit for morphological analysis; and Prolog structured programming by example, for parsing. It is clear that these considerable and manipulate sophisticated models overheads and shortcomings warrant a fresh from the outset. approach. Apart from the practical component, computational linguistics courses may also depend on 1 Introduction software for in-class demonstrations. This con- Teachers of introductory courses on compu- text calls for highly interactive graphical user tational linguistics are often faced with the interfaces, making it possible to view program challenge of setting up a practical programming state (e.g. the chart of a chart parser), observe component for student assignments and program execution step-by-step (e.g. execu- projects. This is a di±cult task because tion of a ¯nite-state machine), and even make di®erent computational linguistics domains minor modi¯cations to programs in response to require a variety of di®erent data structures \what if" questions from the class. Because and functions, and because a diverse range of of these di±culties it is common to avoid live topics may need to be included in the syllabus. demonstrations, and keep classes for theoreti- A widespread practice is to employ multiple cal presentations only. Apart from being dull, programming languages, where each language this approach leaves students to solve important provides native data structures and functions practical problems on their own, or to deal with that are a good ¯t for the task at hand. For them less e±ciently in o±ce hours. example, a course might use Prolog for pars- In this paper we introduce a new approach to ing, Perl for corpus processing, and a ¯nite-state the above challenges, a streamlined and flexible toolkit for morphological analysis. By relying way of organizing the practical component on the built-in features of various languages, the of an introductory computational linguistics teacher avoids having to develop a lot of software course. We describe NLTK, the Natural infrastructure. Language Toolkit, which we have developed in conjunction with a course we have taught at 3 Design Criteria the University of Pennsylvania. Several criteria were considered in the design The Natural Language Toolkit is avail- and implementation of the toolkit. These design able under an open source license from criteria are listed in the order of their impor- http://nltk.sf.net/. NLTK runs on all tance. It was also important to decide what platforms supported by Python, including goals the toolkit would not attempt to accom- Windows, OS X, Linux, and Unix. plish; we therefore include an explicit set of non- requirements, which the toolkit is not expected to satisfy. 2 Choice of Programming Language 3.1 Requirements Ease of Use. The primary purpose of the The most basic step in setting up a practical toolkit is to allow students to concentrate on component is choosing a suitable programming building natural language processing (NLP) sys- language. A number of considerations tems. The more time students must spend learn- influenced our choice. First, the language must ing to use the toolkit, the less useful it is. have a shallow learning curve, so that novice programmers get immediate rewards for their Consistency. The toolkit should use consis- e®orts. Second, the language must support tent data structures and interfaces. rapid prototyping and a short develop/test Extensibility. The toolkit should easily cycle; an obligatory compilation step is a accommodate new components, whether those serious detraction. Third, the code should be components replicate or extend the toolkit's self-documenting, with a transparent syntax and existing functionality. The toolkit should semantics. Fourth, it should be easy to write be structured in such a way that it is obvious structured programs, ideally object-oriented but where new extensions would ¯t into the toolkit's without the burden associated with languages infrastructure. like C++. Finally, the language must have an easy-to-use graphics library to support the Documentation. The toolkit, its data development of graphical user interfaces. structures, and its implementation all need to In surveying the available languages, we be carefully and thoroughly documented. All believe that Python o®ers an especially good nomenclature must be carefully chosen and ¯t to the above requirements. Python is an consistently used. object-oriented scripting language developed Simplicity. The toolkit should structure the by Guido van Rossum and available on all complexities of building NLP systems, not hide platforms (www.python.org). Python o®ers them. Therefore, each class de¯ned by the a shallow learning curve; it was designed to toolkit should be simple enough that a student be easily learnt by children (van Rossum, could implement it by the time they ¯nish an 1999). As an interpreted language, Python is introductory course in computational linguis- suitable for rapid prototyping. Python code is tics. exceptionally readable, and it has been praised as \executable pseudocode." Python is an Modularity. The interaction between di®er- object-oriented language, but not punitively ent components of the toolkit should be kept so, and it is easy to encapsulate data and to a minimum, using simple, well-de¯ned inter- methods inside Python classes. Finally, Python faces. In particular, it should be possible to has an interface to the Tk graphics toolkit complete individual projects using small parts (Lundh, 1999), and writing graphical interfaces of the toolkit, without worrying about how they is straightforward. interact with the rest of the toolkit. This allows students to learn how to use the toolkit incre- Four modules provide implementations mentally throughout a course. Modularity also for these abstract interfaces. The srparser makes it easier to change and extend the toolkit. module implements a simple shift-reduce parser. The chartparser module de¯nes a 3.2 Non-Requirements flexible parser that uses a chart to record Comprehensiveness. The toolkit is not hypotheses about syntactic constituents. The intended to provide a comprehensive set of pcfgparser module provides a variety of tools. Indeed, there should be a wide variety of di®erent parsers for probabilistic grammars. ways in which students can extend the toolkit. And the rechunkparser module de¯nes a transformational regular-expression based E±ciency. The toolkit does not need to implementation of the chunk parser interface. be highly optimized for runtime performance. However, it should be e±cient enough that Tagging Modules students can use their NLP systems to perform The tagger module de¯nes a standard interface real tasks. for augmenting each token of a text with supple- mentary information, such as its part of speech Cleverness. Clear designs and implementations are far preferable to ingenious yet inde- or its WordNet synset tag; and provides several cipherable ones. di®erent implementations for this interface. Finite State Automata 4 Modules The fsa module de¯nes a data type for encod- The toolkit is implemented as a collection of ing ¯nite state automata; and an interface for independent modules, each of which de¯nes a creating automata from regular expressions. speci¯c data structure or task. A set of core modules de¯nes basic data Type Checking types and processing systems that are used Debugging time is an important factor in the throughout the toolkit. The token module toolkit's ease of use. To reduce the amount of provides basic classes for processing individual time students must spend debugging their code, elements of text, such as words or sentences. we provide a type checking module, which can The tree module de¯nes data structures for be used to ensure that functions are given valid representing tree structures over text, such arguments. The type checking module is used as syntax trees and morphological trees. The by all of the basic data types and processing probability module implements classes that classes. encode frequency distributions and probability Since type checking is done explicitly, it can distributions, including a variety of statistical slow the toolkit down. However, when e±ciency smoothing techniques. is an issue, type checking can be easily turned The remaining modules de¯ne data structures o®; and with type checking is disabled, there is and interfaces for performing speci¯c NLP tasks. no performance penalty. This list of modules will grow over time, as we add new tasks and algorithms to the toolkit. Visualization Visualization modules de¯ne graphical Parsing Modules interfaces for viewing and manipulating The parser module de¯nes a high-level inter- data structures, and graphical tools for face for producing trees that represent the struc- experimenting with NLP tasks. The draw.tree tures of texts. The chunkparser module de¯nes module provides a simple graphical inter- a sub-interface for parsers that identify non- face for displaying tree structures. The overlapping linguistic groups (such as base noun draw.tree edit module provides an interface phrases) in unrestricted text. for building and modifying tree structures. The draw.plot graph module can be used to such as tagging, probabilistic systems, or text graph mathematical functions.

Load more