A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis

Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis Amjad Abu-Jbara Dragomir Radev EECS Department EECS Department and University of Michigan School of Information Ann Arbor, MI, USA University of Michigan [email protected] Ann Arbor, MI, USA [email protected] Abstract NLTK (Bird and Loper, 2004) and GATE (Cun- ningham et al., 2002) for Natural Language Pro- In this paper we present Clairlib, an open- cessing; Pajek (Batagelj and Mrvar, 2003) and source toolkit for Natural Language Process- GUESS (Adar, 2006) for Network Analysis and Vi- ing, Information Retrieval, and Network Anal- 1 ysis. Clairlib provides an integrated frame- sualization; and Lemur for Language Modeling and work intended to simplify a number of generic Information Retrieval. tasks within and across those three areas. It This paper presents Clairlib, an open-source has a command-line interface, a graphical in- toolkit that contains a suit of modules for generic terface, and a documented API. Clairlib is tasks in Natural Language Processing (NLP), Infor- compatible with all the common platforms and mation Retrieval (IR), and Network Analysis (NA). operating systems. In addition to its own func- While many systems have been developed to address tionality, it provides interfaces to external software and corpora. Clairlib comes with a com- tasks or subtasks in one of these areas as we have prehensive documentation and a rich set of tu- just mentioned, Clairlib provides one integrated en- torials and visual demos. vironment that addresses tasks in the three areas. This makes it useful for a wide range of applications 1 Introduction within and across the three domains. Clairlib is designed to meet the needs of re- The development of software packages and code li- searchers and educators with varying purposes and braries that implement algorithms and perform tasks backgrounds. For this purpose, Clairlib provides in scientific areas is of great advantage for both re- three different interfaces to its functionality: a searchers and educators. The availability of these graphical interface, a command-line interface, and tools saves the researchers a lot of the time and the an application programming interface (API). effort needed to implement the new approaches they Clairlib is developed and maintained by the Com- propose and conduct experiments to verify their hy- putational Linguistics and Information Retrieval potheses. Educators also find these tools useful in (CLAIR) group at the University of Michigan. The class demonstrations and for setting up practical pro- first version of Clairlib was released in the year gramming assignments and projects for their stu- 2007. It has been heavily developed since then until dents. it witnessed a qualitative leap by adding the Graphi- A large number of systems have been developed cal Interface and many new features to the latest ver- over the years to solve problems and perform tasks sion that we are presenting here. in Natural Language Processing, Information Re- Clairlib core modules are written in Perl. The trieval, or Network Analysis. Many of these sys- GUI was written in Java. The Perl back-end and the tems perform specific tasks such as parsing, Graph Java front-end are efficiently tied together through a Partitioning, co-reference resolution, web crawling communication module. Clairlib is compatible with etc. Some other systems are frameworks for per- forming generic tasks in one area of focus such as 1http://www.lemurproject.org/ 121 Proceedings of the ACL-HLT 2011 System Demonstrations, pages 121–126, Portland, Oregon, USA, 21 June 2011. c 2011 Association for Computational Linguistics all the common platforms and operating systems. 2.2 Command-line Interface The only requirements are a Perl interpreter and Java The command-line interface provides an easy access Runtime Environment (JRE). to many of the tasks that Clairlib modules imple- Clairlib has been used in several research projects ment. It provides more than 50 different commands. to implement systems and conduct experiments. It Each command is documented and demonstrated in also has been used in several academic courses. one or more tutorials. The function of each com- The rest of this paper is organized as follows. In mand can be customized by passing arguments with Section 2, we describe the structure of Clairlib. In the command. For example, the command Section 3, we present its functionality. Section 4 partition.pl -graph graph.net -method GirvanNewman -n 4 presents some usage examples. We conclude in Sec- tion 5. uses the GrivanNewman algorithm to divide a given graph into 4 partitions. 2 System Overview 2.3 Graphical User Interface Clairlib consists of three main components: the core The graphical user interface (GUI) is an impor- library, the command-line interface, and the graph- tant feature that has been recently added to Clairlib ical user interface. The three components were de- and constituted a quantum leap in its development. signed and connected together in a manner that aims The main purpose of the GUI is to make the rich to achieve simplicity, integration, and ease of use. In set of Clairlib functionalities easier to access by a the following subsections, we briefly describe each larger number of users from various levels and back- of the three components. grounds especially students and users with limited or no programming experience. 2.1 Modules It is also intended to help students do their assign- The core of Clairlib is a collection of more than 100 ments, projects, and research experiments in an in- modules organized in a shallow hierarchy, each of teractive environment. We believe that visual tools which performs a specific task or implements a cer- facilitate understanding and make learning a more tain algorithm. A set of core modules define the data enjoyable experience for many students. Focusing structures and perform the basic processing tasks. on this purpose, the GUI is tuned for simplicity and For example, Clair::Document defines a data struc- ease of use more than high computational efficiency. ture for holding textual data in various formats, and Therefore, while it is suitable for small and medium performs the basic text processing tasks such as tok- scale projects, it is not guaranteed to work efficiently enization, stemming, tag stripping, etc. for large projects that involve large datasets and re- Another set of modules perform more specific quire heavy processing. The command-line inter- tasks in the three areas of focus (NLP, IR, and NA). face is a better choice for large projects. For example, Clair::Bio::GIN::Interaction is de- The GUI consists of three components: the Net- voted to protein-protein interaction extraction from work Editor/Visualizer/Analyzer, the Text Proces- biomedical text. sor, and the Corpus Processor. The Network com- A third set contains modules that interface Clair- ponent allows the user to 1) build a new network lib to external tools. For example, Clair::Utils::Parse using a set of drawing and editing tools, 2) open provides an interface to Charniak parser (Charniak, existing networks stored in files in several different 2000), Stanford parser (Klein and Manning, 2003), formats, 3) visualize a network and interact with it, 2 and Chunklink . 4) compute different statistics for a network such as Each module has a well-defined API. The API is diameter, clustering coefficient, degree distribution, oriented to developers to help them write applica- etc., and 5) perform several operations on a network tions and build systems on top of Clairlib modules; such as random walk, label propagation, partition- and to researchers to help them write applications ing, etc. This component uses the open source li- and setup custom experiments for their research. brary, JUNG3 to visualize networks. Figure 1 shows 2http://ilk.uvt.nl/team/sabine/chunklink/README.html 3http://jung.sourceforge.net/ 122 Figure 1: A screenshot for the network visualization component of Clairlib a screenshot for the Network Visualizer. 2.4 Documentation Clairlib comes with an extensive documentation. The Text Processing component allows users to The documentation contains the installation infor- process textual data published on the internet or im- mation for different platforms, a description of all ported from a file stored on the disk. It can process Clairlib components and modules, and a lot of usage data in plain, html, or PDF format. Most of the text examples. In addition to this documentation, Clair- processing capabilities implemented in Clairlib core lib provides three other resources: library are available through this component. Fig- ure 2 shows a screenshot of the text processing com- API Reference ponent. The API Reference provides a complete description of each module in the library. It describes each The Corpus Processing component allows users subroutine, the task it performs, the arguments it to build a corpus of textual data out of a collection takes, the value it returns, etc. This reference is use- of files in plain, HTML, or PDF format; or by crawl- ful for developers who want to use Clairlib modules ing a website. Several tasks could be performed on in their own applications and systems. The API Ref- a corpus such as indexing, querying, summarization, erence is published on the internet. information extraction, hyperlink network construc- tion, etc. Tutorials Although these components can be run indepen- Tutorials teach users how to use Clairlib by ex- dently, they are very integrated and designed to eas- amples. Each tutorial addresses a specific task and ily interact with each other. For example, a user can provides a set of instructions to complete the task crawl a website using the Corpus component, then using Clairlib command-line tools or its API. switch to the Text Processing component to extract the text from the web documents and stem all the Visual Demos words, then switch back to the Corpus component Visual demos target the users of the graphical in- to build a document similarity graph.

Load more