Edlin: an easy to read linear learning framework

Kuzman Ganchev ∗ Georgi Georgiev University of Pennsylvania Ontotext AD 3330 Walnut St, Philaldelphia PA 135 Tsarigradsko Ch., Sofia , Bulgaria [email protected] [email protected]

Abstract main advantage of Edlin is that its code is easy The Edlin toolkit provides a machine to read, understand and modify, meaning that learning framework for linear models, variations are easy to experiment with. For in- designed to be easy to read and un- dustrial users, the simplicity of the code as well derstand. The main goal is to provide as relatively few dependencies means that it is easy to edit working examples of im- easier to integrate into existing codebases. plementations for popular learning algo- Edlin implements learning algorithms for rithms. The toolkit consists of 27 Java classes with a total of about 1400 lines linear models. Currently implemented are: of code, of which about 25% are I/O and Naive Bayes, maximum entropy models, the driver classes for examples. A version Perceptron and one-best MIRA (optionally of Edlin has been integrated as a pro- with averaging), AdaBoost, structured Percep- cessing resource for the GATE architec- tron and structured one-best MIRA (option- ture, and has been used for gene tagging, ally with averaging) and conditional random gene name normalization, named entity fields. Because of the focus on clarity and con- recognition in Bulgarian and biomedical ciseness, some optimizations that would make relation extraction. the code harder to read have not been made. This makes the framework slightly slower than it could be, but implementations are asymp- Keywords totically fast and suitable for use on medium Information Extraction, Classification, Software Tools to large datasets. The rest of this paper is organized as fol- lows: 2 describes the code organization; 3- 4 1 Introduction describes§ an integration with the GATE frame-§ § work and an example application; 5 describes The Edlin toolkit provides a machine learn- related software; and 6 discusses§ future work ing framework for linear models, designed to and concludes the paper.§ be easy to read and understand. The main goal is to provide easy to edit working exam- ples of implementations for popular learning 2 Overview of the code algorithms. To minimize programmer over- head, Edlin depends only on GNU Trove1 for The goal of machine learning is to choose from fast data structures and JUnit2 for unit tests. a (possibly infinite) set of functions mapping A version of Edlin has been integrated as a from some input space to some output space. processing resource for the GATE [7] architec- Let x X be a variable denoting an input - ample∈ and y Y range over possible labels for ture, and has been used in-house for gene tag- ∈ ging, gene name normalization, named entity x. A linear model will choose a according recognition in Bulgarian and biomedical rela- to tion extraction. For researchers we expect the h(x) = arg max f(x, y) w (1) y · ∗Supported by ARO MURI SUBTLE W911NF-07- where f(x, y) is a feature function and w is a 1-0216 and by the European projects AsIsKnown (FP6- 028044) and LTfLL (FP7-212578) parameter vector. We take the inner product 1 http://trove4j.sourceforge.net/ of the feature vector with the model parame- 2 http://www.junit.org ters w and select the output y that has high-

94

International Conference RANLP 2009 - Borovets, Bulgaria, pages 94–98 est such score. The feature function f(x, y) is sentences are read from disk and converted to specified by the user, while the parameter vec- a sparse vector representing f1(x) by a class tor w is learned using training data. in the io package. For example we might ex- Even though the learning and inference algo- tract suffixes of length 2 to 5 from each word rithms are generic, and can be used for differ- in a sentence. We look these up in an al- ent applications, Edlin is implemented with an phabet that maps them to a unique dimen- natural language tasks in mind. The classes sion, and store the counts of the words in a related to classification, are implemented in sparse vector for each word. The alphabet and the classification package, while those re- sparse vector are implemented in Alphabet lated to sequence tasks are implemented in the and SparseVector respectively. The array sequence package. The code to perform gra- of sparse vectors for a sentence (recall there dient ascent and conjugate gradient is in an is one for each word) and alphabet are bun- algo package. There are three helper pack- dled together in an SequenceInstance object ages. Two (experiments and io) are code for along with the true label. Next we want to reading input and for driver classes for the ex- train a linear sequence model using the percep- amples. The final package, called types con- tron algorithm on the training portion of our tains infrastructure code such as an implemen- data. We construct a sequence.Perceptron tation of sparse vectors, elementary arithmetic object and call its batchTrain method. Fig- operations such as the inner product, and other ure 1 reproduces the implementation. The widely used operations whose implementation method takes the training data as a Java is not interesting from the point of view of un- ArrayList of SequenceInstance objects, and derstanding the learning algorithms. This code the Perceptron class has parameters for organization, as well as the data structures we whether averaging is turned on and the num- employ are similar to other learning packages ber of passes to make over the data. It also such as StructLearn [12] and MALLET [11]. contains a SequenceFeatureFunction object One attribute that distinguishes Edlin from (fxy in Figure 1) that implements f2 from both of these packages is the decomposition of above. For part of speech tagging, it is typ- (t 1) (t) the feature function into ical to let ft(x, y − , y ) conjoin f1(x) with (t) (t) (t 1) y and also conjoin y with y − , but not f(x, y) = f2(f1(x), y) (2) to have any features that look x, y(t) and (t 1) where f maps the input into a sparse vector y − all at the same . By contrast 1 for named entities it is typical to have fea- and f2 combines it with a possible output in or- der to generate the final sparse vector used to tures that look at all three. The linear se- assess the compatibility of the label for this in- quence model is created in the first line of the put. By contrast, many other learning frame- batchTrain method as a LinearTagger ob- ject, which has access to the alphabet used works only allow the user to specify f1 and hard-code an implementation of f as conjoin- in the initial construction of the sparse vec- 2 tors, the label alphabet (yAlphabet in Figure ing the input predicates (f1 in the notation above) with the label. By allowing the user 1) and f2 (fxy in Figure 1). It computes the prediction which is represented as an int array, to specify f2, we allow them to tie parameters and add domain information about how differ- with the interpretation yhat[t]=j as word t ent outputs are related. See the illustration has the label at position j in the label alphabet below for an example. (accessible via yAlphabet.lookupIndex(j)). The batchTrain method returns the linear se- quence model. 2.1 Example Application Perhaps the best way to describe the minimal amount to make reading the code easy is to 3 GATE integration trace how information is propagated and trans- formed in an example application. Take a POS GATE [8, 7] is a framework for engineering tagging task as an example. Suppose we are NLP applications along with a graphical de- given a collection of sentences that have been velopment environment for developing compo- manually annotated and these have been split nents. GATE divides language processing re- for us into a training set and a testing set. The sources into language resources, processing re-

95 public LinearTagger batchTrain( both Edlin and several GATE processing com- ArrayList trainingData) { LinearTagger w = new LinearTagger(xAlphabet, ponents. The results are described in [9]. yAlphabet, fxy); Following BioNLP terminology, we use the LinearTagger theta = null; if (performAveraging) term proteins to refer to both genes and gene theta = new LinearTagger(xAlphabet, yAlphabet, products. Both trigger chunks and proteins are fxy); for (int iter = 0; iter < numIterations; iter++) { called participants. For example, the text “... for (SequenceInstance inst : trainingData) { phosphorylation of TRAF2 ...” would be a re- int[] yhat = w.label(inst.x); // if y = yhat then this update won’t change w. lation of Phosphorylation with a theme of StaticUtils.plusEquals(w.w, fxy.apply(inst.x, TRAF2. The relation is called an event, while inst.y)); StaticUtils.plusEquals(w.w, fxy.apply(inst.x, the string “phosphorylation” is called a trig- yhat), -1); ger. Gene boundary annotations are provided if (performAveraging) StaticUtils.plusEquals(theta.w, w.w, 1); by the task organizers. In general, there are } events with multiple participants in addition to } if (performAveraging) return theta; the trigger. The event instances are organized return w; into the structure of the Gene Ontology [5]. } We separated the task in two main sub-tasks Fig. 1: Edlin’s perceptron implementation, re- (i) recognition of trigger chunks using an Edlin produced verbatim to show code organization. sequence tagger and, (ii) classification of trig- gers and proteins as either forming an event from one of 9 predefined types or not partici- sources, and graphical interfaces. We have pating in an event together. At the end of the integrated a version of Edlin into the GATE section we discuss the final pipeline of proces- framework as a set of processing resources, sors used in this relation extraction task. by defining interfaces in Edlin for training, classification, and sequence tagging. These 4.1 Gene and Trigger Tagging interfaces are used to communicate between Edlin’s machine learning implementations and The trigger chunks are simply words and the concrete implementations of tagger and phrases that describe the events linking pro- classifier processors in GATE. The integration teins. For instance binds is such a trigger word allows Edlin to be used for robust, complex that would link two or genes in a Bind- text processing applications, relying on GATE ing event. We used the Edlin GATE integra- processors such as tokenizers, sentence split- tion described in Section 3 to create one GATE ters and parsers, to preprocess the data. The processing resource that trains an Edlin lin- integration also makes it easy to pipeline Edlin- ear sequence model and another that uses that trained linear models using the GATE infras- Edlin sequence model to tag trigger chunks. tructure for processing pipelines. Since Edlin Both processors work in a pipeline with has very readable code, this makes it easy for a GATE preprocessors including a tokenizer, researcher or engineer to try a modified learn- sentence splitter, POS tagger and chunker. Be- ing algorithm if they already use the GATE cause Edlin represents linear models trained framework. using different algorithms in the same way it was easy for us to compare different learning algorithms for the task. For this application 4 Biomedical Relation Ex- tagger recall is an upper bound on system per- formance, and used MIRA with a loss function traction designed to achieve high recall since that per- In this section we show an example text pro- formed best. cessing application within the Edlin and GATE architectures, focusing on the text processing 4.2 Relation Extraction components organization. Our problem do- main is the BioNLP 2009 shared task [17], We separate the process of relation extraction a biomedical relation extraction task. The into two stages: in the first stage, we gener- goal is to identify relations between genes/gene ate events corresponding to relations between a products. We chose this task as an exam- trigger word and one or more proteins (simple ple because it is relatively complex and uses events), while in the second stage, we gener-

96 ate events that correspond to relations between 5 Related Work trigger words, proteins and simple events (we call the new events complex events). There are a number of machine learning tools For the purpose of this task we designed and available either as open source packages, or implemented four GATE processing resources with source code for research purposes. To for two for training and two for classification of our knowledge Edlin is the only framework genes and trigger chunks into the 9 predefined that represents linear models in a uniform types of events. The training of an Edlin lin- fashion, and is also the only learning frame- ear model and classification using that model work that prioritizes code readability. The are again done using the Edlin-GATE integra- NLTK [3, 4] (natural language toolkit) empha- tion, and are integrated in a GATE pipeline sizes code readability but focuses on natural that now also includes dependency and phrase- language processing rather than learning. structure parsers. MALLET [11] is a Java toolkit for machine As with finding trigger words in the previous learning. MALLET implements most of the section, the uniform representation of linear learning algorithms available in Edlin in addi- models allowed us to compare different learn- tion to many others. The exceptions are per- ing methods. We compared max entropy, per- ceptron and MIRA, which are available as a ceptron and one-best MIRA, and again chose separate MALLET-compatible package called MIRA with a loss function designed to increase StructLearn [12, 6]. For sequence data, one of recall, since getting high recall was the most MALLET’s main strengths is a way to easily challenging part of the task. Finally, this tun- create predicate functions (f1 in the notation of able loss function was appealing for us because Section 2). Edlin does not have sophisticated it allows application-specific tuning. For exam- code for feature engineering, and in our ex- ple, a search might require high recall, but high periments we used GATE to generate features. precision might be more important for adding MALLET also contains a very general imple- relations to a knowledge base. mentation of CRFs that allows linear-chain

AEapiainPipeline application GATE models with varying order n Markov proper- Sentence splitter ties. These enhancements lead to a larger and Tokenizer hence harder to read code-base. For example POS tagger, chunker the CRF model implementation in MALLET Edlin trigger tagger comprises 1513 lines of code compared to 56 for 3 parsers Edlin’s simplistic implementation. Note that Edlin simple event extractor the authors actively recommend MALLET in Edlin complex event extractor particular for CRFs, however it serves a differ- ent need than Edlin. While MALLET is very general and easy to use, Edlin is very simple Fig. 2: Graphical view of our relation extrac- and easy to understand. tion system pipeline. LingPipe[2] is a Java toolkit for linguistic analysis of free text. The framework provides Figure 2 shows our event extraction pipeline, tools for classification, sequence tagging, clus- stringing together different GATE text proces- tering and a variety of problem-specific tasks sors. The first stages of the pipeline as well such as spelling correction, word segmentation as the parsers are included in order to create named entity normalization and parsing for features useful for later stages. biomedical text among others. Some trained As described above, the trigger tagging stage models are provided, but it is possible to train uses an Edlin GATE processor trained us- a new models for new tasks and data. The soft- ing one-best MIRA. Furthermore, we employ ware is available along with source code. We a maximum entropy constituency parser [1] did not investigate source code complexity due and a dependency parser [13]. These compo- to time constraints, but the full featured nature nents are also represented as GATE processors. of the software and its marketing to enterprise In the last stage of the pipeline we use two customers suggests that its focus is on stabil- components, one for simple and one for com- ity, and scalability rather than code simplicity plex events, based on the classification version and readability. of one-best MIRA algorithm implemented in Edlin and used as GATE processors. 3 Counted with cloc http://cloc.sourceforge.net/

97 Weka [18] is a widely used framework de- tegration between Edlin and GATE. Finally, veloped at the University of Waikato in New we intend to improve the implementation of Zealand and comprises a collection of learning the optimization algorithms to improve train- algorithms for classification, clustering, feature ing run-time for maximum entropy models and selection, and visualizations. Weka includes a CRFs. very friendly graphical user interface, and is targeted largely at researchers in the sciences References or social sciences who would like to experi- ment with different algorithms to analyze their [1] OpenNLP. http://opennlp.sourceforge.net, 2009. data. Weka does not contain code for struc- [2] Alias-i. LingPipe. http://alias-i.com/lingpipe, tured learning and is more suitable for use as a 2008. (accessed 2008-04-20). versatile black box than for reading and mod- [3] S. Bird and E. Loper. Nltk: The natural language ifying source code. For example Weka’s per- toolkit. In Proceedings ACL. ACL, 2004. ceptron algorithm is implemented in 600 lines [4] S. Bird and E. Loper. Natural language toolkit. http: //www.nltk.org/, 2008. of code compared to 38 for Edlin. By contrast [5] T. G. O. Consortium. Gene ontology: tool for the Weka has a very good graphical user interface unification of biology. Nature Genetics, 25(1):25–29, and allows visualization not implemented in 2000. Edlin. GATE integration allows some visual- [6] K. Crammer, R. McDonald, and F. Pereira. Scalable izations and evaluation for Edlin, but special- large-margin online learning for structured classifica- tion. Department of Computer and Information Sci- ized only for text. ence, University of Pennsylvania, 2005. ABNER [16] is a tool for processing natu- [7] H. Cunningham. GATE – general architecture for ral language text aimed at the biomedical do- text engineering. http://gate.ac.uk/. main. ABNER is widely used for annotation [8] H. Cunningham, D. Maynard, K. Bontcheva, and of biomedical named entities such as genes and V. Tablan. GATE: A framework and graphical de- velopment environment for robust NLP tools and ap- gene products. It contains a CRF implemen- plications. In Proceedings of the 40th Anniversary tation and a graphical user interface for visu- Meeting of the Association for Computational Lin- alization and modification of annotations, in guistics, 2002. addition to domain specific tokenizers and sen- [9] G. Georgiev, K. Ganchev, V. Momchev, D. Pey- chev, P. Nakov, and A. Roberts. Tunable domain- tence segmenters. BioTagger [14] is a different independent event extraction in the mira framework. tool for named entity recognition in biomedi- In Proceedings of BioNLP. ACL, June 2009. cal text also using linear-chain CRFs. It has [10] Y. Jin, R. McDonald, K. Lerman, M. Mandel, M. Liberman, F. Pereira, R. Winters, and P. White. been applied to genes/gene products [14], ma- Identifying and extracting malignancy types in cancer lignancy mentions [10] and genomic variations literature. In BioLink, 2005. in the oncology domain [15]. [11] A. McCallum. Mallet: A machine learning for lan- guage toolkit. http://mallet.cs.umass.edu, 2002. [12] R. McDonald, K. Crammer, K. Ganchev, S. P. Bachoti, and M. Dredze. Penn StructLearn. 6 Conclusions and Future http://www.seas.upenn.edu/~strctlrn/StructLearn/ Work StructLearn.html. [13] R. McDonald, K. Crammer, and F. Pereira. Online large-margin training of dependency parsers. In Pro- We have presented a linear modeling toolkit of ceedings of ACL. ACL, 2005. implementations written specifically for read- [14] R. McDonald and F. Pereira. Identifying gene and ability. We described the toolkit’s layout, protein mentions in text using conditional random learning algorithms and an application we have fields. BMC Bioinformatics, (Suppl 1):S6(6), 2005. found it useful for. The main goal of Edlin is [15] R. McDonald, R. Winters, M. Mandel, Y. Jin, P. White, and F. Pereira. An entity tagger for rec- to be easy to read and modify and we have ognizing acquired genomic variations in cancer liter- used the toolkit in teaching a Master’s level ature. Journal of Bioinformatics, 2004. class. While we have not performed a scientific [16] B. Settles. ABNER: An open source tool for automat- ically tagging genes, proteins, and other entity names evaluation, initial feedback from students has in text. Bioinformatics, 21(14):3191–3192, 2005. been positive and at least one fellow researcher [17] J. Tsujii. BioNLP’09 Shared Task on Event Extrac- commented that he liked the organization and tion. http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ simplicity of the code. SharedTask/index.html, 2009. Future work includes the implementation of [18] I. H. Witten and E. Frank. Data Mining: Practi- maximal margin learning (i.e. support vector cal machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2nd edition, 2005. machines) and further improvements to the in-

98