To appear in Proceedings of the 2004 Text Retrieval Conference (TREC). Gaithersburg, MD USA. 2005.

Exploiting Zone Information, Syntactic Features, and Informative Terms in Ontology Annotation from Biomedical Documents

Burr Settles∗† [email protected] Mark Craven†∗ [email protected]

∗Department of Computer Sciences, University of Wisconsin, Madison, WI 53706 USA †Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53706 USA

Abstract assignment. The TREC 2004 Genomics Track Anno- tation Task consisted of two subtasks: Annotation Hi- We describe a system developed for the Annota- tion Hierarchy subtask of the Text Retrieval Con- erarchy, and Annotation Hierarchy Plus Evidence. ference (TREC) 2004 Genomics Track. The goal We focused on developing a system to solve the Anno- of this track is to automatically predict Gene On- tology (GO) domain annotations given full-text tation Hierarchy subtask. In this task, we are given biomedical journal articles and associated . a document-gene tuple hd, gi and are to determine Our system uses a two-tier statistical machine which, if any, of the three GO domains (BP, CC, MF) learning system that makes predictions first on are applicable to gene g according to the text in doc- “zone”-level text (i.e. abstract, introduction, ument d. Note that the goal is not to link the gene etc.) and then combines evidence to make fi- with a precise GO code (e.g. “Ahr” and “signal trans- nal document-level predictions. We also describe duction”), but rather the domain from which these the effects of more advanced syntactic features (equivalence classes of syntactic patterns) and GO codes come (e.g. “Ahr” and “Biological Process”). “informative terms” (phrases semantically linked This is a significantly simplified version of the annota- to specific GO codes). These features are in- tion task that organism database curators must face, duced automatically from the training data and since the exact GO code need not be identified, and external sources, such as “weakly labeled” MED- we are given the relevant genes a priori. LINE abstracts. Our baseline system (using only traditional word The training data for this task comes from 178 full- features), significantly exceeds median F1 of all text journal articles as curated in 2002 by the Mouse task submissions on unseen test data. We show Genome Informatics (MGI) database (Blake et al., further improvement by including syntactic fea- 2003). These documents and their associated genes tures and informative term features. Our com- serve as “positive” examples (i.e. have at least one GO plete system exceeds median performance for all three evaluation metrics (precision, recall, and annotation). We are also provided with 326 full-text F1), and achieves the second highest reported F1 articles that were not selected for GO curation, but for of all systems in the track. which genes are associated. These document-gene tu- ples serve as “negative” examples (i.e. have no GO an- notation). In total, there are 504 documents and 1,418 1. Introduction document-gene tuples in the training data. Test data containing 378 documents and 877 document-gene tu- A major activity of most model organism database ples are similarly prepared from articles processed by projects is to assign codes from the MGI in 2003. (The Gene Ontology Consortium, 2000) to annotate the known function of genes and . The Gene This paper is organized as follows. First, we provide Ontology (GO) consists of three structured, controlled a fairly detailed description of our system’s architec- domains (“hierarchies,” or “subontologies”) that de- ture and explain how more advanced feature sets are scribe functions of genes and their products. These generated. We then present experimental results on domains are Biological Process (BP), Cellular Com- cross-validated training data, as well as a report on ponent (CC), and Molecular Function (MF). Each as- official task evaluation. Finally, we present our con- signed GO code can also be delegated a level of evi- clusions and future directions for this work. dence indicating specific experimental support for its 2. System Description feature vectors and can better distinguish between them. This section describes the architecture of our sys- tem. There are several key steps involved in taking From these sentences, we generate six feature vectors a document-gene tuple and predicting GO domain an- per document (one for each zone). Our baseline sys- notations. Figure 1 illustrates the overall system. The tem employs traditional bag-of-words text classifica- following sections describe each step in detail. tion features. In this case, each word that occurs in a sentence with a mention of the gene is considered 2.1 Zoning and Text Preprocessing a unique feature (case-insensitive, with stop-words re- moved). Consider the following sentence, taken from First we process the documents, provided by task orga- a caption in the h11694502, Desi training tuple: nizers in SGML format, into six distinct information- content zones: title, abstract, introduction, “An overlay of desmin and syncoilin im- methods, results, and discussion. Partitioning is munoreactivity confirmed co-localization.” done using keyword heuristics during SGML parsing (e.g. “discussion,” “conclusion,” and “summary” are keywords for the discussion zone). We then segment Our baseline system represents this using features like the text in each zone into paragraphs and sentences word=overlay, word=immunoreactivity, etc. But using SGML tags and regular expressions. All other perhaps the syntactic pattern “overlay of X” refers SGML tags are then stripped away. to a procedure with some special significance to label- ing? Perhaps certain keywords, such as “syncoilin,” Once a document has been zoned and segmented, we have semantic correlation with a particular process, apply a gene name recognition algorithm to each sen- component, or function? We wish to induce features tence. This step involves taking the gene symbol that that capture such information. is provided in the tuple, looking it up in a list of known aliases provided by MGI,1 and matching these against In this paper, we also introduce and investigate two the text using regular expressions that are somewhat advanced types of features that were used to encode robust to orthographic variants (e.g. allowing “Aox-1” the problem. The features described here are syntac- to match against “Aox I” or “AOX1”). We find that tic features (equivalence classes of shallow syntactic this simple matching approach is able to retrieve 97% patterns) and “informative terms” (n-grams with se- of all document-gene tuples among positive training mantic association with specific GO codes). These documents, and only 52% of tuples among negative features were automatically induced from the training training documents. Therefore, we treat this step as corpus plus several external text sources. Section 3.2 a filter on our predictions: documents for which the describes the experimental impact of including these associated gene cannot be found are not considered features. The following subsections describe how they for GO annotation. Assuming the matching algorithm are generated and incorporated into the system. behaves comparably on test data, this filter sets an upper bound of 97% on recall, but effectively prunes 2.2.1 External Data Sources away 48% of potential false positives. The task training set of 178 MGI-labeled documents provided for the task is decidedly small. To remedy 2.2 Feature Vector Encoding this, we supplement data used for feature induction Once the text preprocessing is complete, our system with training data released for Task 2 of the 2003 2 selects only those sentences where gene name men- BioCreative text mining challenge. This provides an tions were identified. The reasoning for this is two- additional 803 documents labeled with GO-related in- fold. First, it reduces the feature space for the machine formation as curated by GOA (Camon et al., 2004). learning algorithms, as we only consider the text that While this corpus may be larger, it still represents rel- is close to the gene of interest. Second, and perhaps atively few specific GO codes (which we use for infor- more importantly, it introduces a unique hd, gi repre- mative term features). Thus, we also use databases sentation for a document that may contain multiple from the GO Consortium website to gather more data genes. Consider a single article that is disjointly cu- about other organisms. The databases we use include rated for Molecular Function (MF) of “Maged1,” but FlyBase (The FlyBase Consortium, 2003), WormBase Biological Process (BP) of “Ndn.” By using only sen- (Harris et al., 2004), SGD (Dolinski et al., 2003), and tences that reference each gene, we generate unique TAIR (Huala et al., 2001). They are similar to the 1 ftp://ftp.informatics.jax.org/pub/reports/ 2http://www.mitre.org/public/biocreative/ “trained” GO tion PUBMED to W abstracts 2.2.2 deep shallo tion dence, AutoSlog-TS dicativ significan ma What a that abstract. T abstracts cause do in with p where Figure they W man MGI with predictions Na the G e ossible o terest large n e e ¨ cumen ıv ensure y e generate sen : extract e y co

use er is D text. are gene list b them. w curations e Ba Synt genes they de. w 1. tences e e s the still thereb structural

enough ( parse e T y of to d that ts syn useful A information-theoretically do t e e es are to on that call Ho x name s 1 for IDs do (Ev the m n t system via considered cumen the a . are the

mo tactic

um P i b Z b n ctic cross-referenced w these Then where these y cumen r o ) (Riloff, elonging the eac of e e a en ev for del, n presence no a allo p b gene fraction triplets b not i pro “w men r syn n ers er, h eac MEDLINE o ey information g though t, all evidence feature c diagram Fea GO wing external

and eakly features w e a ond gene patterns, vided ts tactic h tions w s gene, complete n of e BioCreativ migh s d e 1996), sen i ha part tures domain n obtain predictions to do h from . . g . of . . us surface-lev . men . will . .

v

o b yp lab s tence, e f are t our u e

induction with cumen

these t t and feature i d h for m absence y to e p sources, othesize of

asso

o PUBMED d o in eled” not tions f r query e

con H t s with whic a these iden w 2 m n the ab mo induce the annotation K articles, the t

i GO-related

f n e o m an

training ciating . . . . r e i then y

n resp tain out m ts b o t dels e b tified a use data w annotation l r h i e are a . training n m the with is s abstracts t clustered el a attempt ere t e of s databases for i d W n

the men a first o i i that ectiv a a n 2 n a t g this d en features ccurs e com a lexical .

a are

e TREC

. sen . . . and . . found. F . set . . within IDs an and a umerates GO consider GO e E feature system tioned data a are n e bined constructs tence, if y GO t not t c u of MEDLINE yp pro triplets organisms. o on asso r w GO domain. th co e d to hierarc made to for test features. e

e patterns i V (3) in n for eac directly cess.) us co in des ev e to g of collect induc- in mo b ciated c whic called co de these these cases alua- t h e data Lab sim- that that o it create evi- b the b r h del de, for

in- . . . . zone. y y of e- is h a el trained task. predictions a secondary (e.g. ple system p consider to ture patterns tern W The this, 0.1 on ple suggest in tion GO men “supp nal gies cluster=CC based trop supp lab ture in P (2) (1) ora atterns C e Maxim 3 the the b B el ho l . patterns (c F found syn a

sources) e ts). P y domain), N Z Do relating in ort p eature s pattern w , is has “ to a

hosen o w with C useful, ort” , s ï ranking. v supp E bac X e on n i C e tactic let cumen the is f assumed

e B i generate “purely” , cell separate ntr set, W for

c - um a patterns M together binds l y run a do little kground are e (1) that e F e p t set s v ort en , i supp opy

v ⊕ o eac cumen empirically). e X then and En comp and 0.0 ectors n in and l to “ ts trop patterns on and b whether (do then o and h to treating ( trop e W predictiv are p verla to” ort/bac them. the p zone . . . . ) to their prop do feature, tally sev discard y e cumen they t-lev t onen

that in a = w set. y bac for partitioned clustered cumen 0.0 to “bac con , b o clustered eral training mo − e or ortion are el y sets kground the ts statistics for p a eac indicate o they W the eac tain dels. kground (i.e. ⊕ Lik

ts . . . . feature e “transla ccur kground” set of thousand e them. or made as T h log( ts sub o impact. whic h complemen able then ewise, for ccurrences zone of exp it of pattern syn (training o X en corpus totally few p 4 ccur jects in together . is ⊕ ” eac o h disjoin

b texts. trop v C D a tactic to erimen 1 for ccurrences rank ) y ector. set, er l is o are are indicativ a GO − pro h “ c a s zones, tion set suc in u M M M mo and s than y clustered t M C B p a a a Th m trained plus GO i w as pure). and lab x x x f constructed

vides C P in p i E E E F t F e data o greater c h (all tary domain. vement n n n structures of (4) tal n log or a us, t t t a features using direct suc eled t t incremen patterns. of in - i and domain: 10 separate external eac o l (2) e e a ( Final other to metho n v p prop plus some h w of of giv m X e times

h This with e l sen their features. en ulti-class in ) features p ” prop pattern the . ob en cluster ). to binary trop ortion T in exter- fr tences based do dolo- using sam- jects o ts that pat- The cor- this fea- fea- CC not W the om the the cu- en- or- do of y: e final predictions 2 CC @ 0.0 Patterns CC/Total Entropy Informative Term χ value X induced currents 24/24 0.000 syncoilin 11843.04 X induced vesicles 17/17 0.000 filaments 1292.96 X overlay assays 11/11 0.000 microtubule 1278.40 overlay of X 11/11 0.000 cytoskeleton 720.75 N-glycosylation of X 10/10 0.000 cytoskeletal 715.64 BP @ 0.2 Patterns BP/Total Entropy microsphere separation 686.48 dead box proteins 455.41 starved X 54/56 0.222 centrosome 451.85 rolling on X 25/26 0.235 spindle 396.62 degrees C at X 24/25 0.242 actin binding 370.91 pathways to X 22/23 0.256 latently 344.93 movement from X 19/20 0.266 kinesin light chains 332.90 Table 1. Sample patterns from two syntactic features. Pat- myoblasts and myotubes 296.27 terns from the feature at the top indicate CC perfectly (en- severing 236.60 tropy = 0), whereas the patterns below are less strongly punctate cytoplasmic pattern 221.93 associated with the BP label. Table 2. Sample informative terms and their associated χ2 values generated for the code GO:0005856 (cytoskeleton). X” is clustered into the cluster=BP 0.2 feature, as A high score indicates that the term is likely correlated it indicates the BP label somewhat less strongly, with with documents discussing the cytoskeleton. entropy 0.266. If any of the patterns in a particular syntactic feature occurs in the the text for a zone, that feature is active in the feature vector. This method pirically chosen threshold) is added to the list of in- generated 41 novel syntactic features. formative terms for GO code c. Table 2 provides some sample informative terms generated for the code 2.2.3 Informative Term Features GO:0005856 (cytoskeleton, in the CC domain). Simi- lar to syntactic features, if any of these terms occurs Even though the task description is to label documents in zone text selected for feature vector encoding, the domains codes only with GO , the specific GO are ac- infterm=cytoskeleton feature is active for that fea- tually provided with the training data (and the exter- ture vector. This method generated novel informative nal sources). Our “informative term” features indicate term features for 2,739 such GO codes. whether or not there is evidence in the text for a partic- ular GO code lower down in the ontology. We hypoth- 2.3 Zone-level Classification esize that (1) this code-level evidence may help inform a domain-level decision, and (2) if a domain-labeling Feature vectors for each zone are now labeled by a system can be successfully trained, these features can multi-class, multinomial Na¨ıve Bayes model. Na¨ıve in turn be useful in making lower level labelings. Bayes is a well-established machine learning algorithm that has been shown to work well on high-dimensional To induce informative terms for each GO code, we text classification problems such as this. The model again separate documents into support and back- uses Bayes’ rule to calculate the probability of label l ground sets for each GO code c. Then we tally oc- given evidence in document d. Note that we use the currence counts for each term t, which is an n-gram of term “document d” here to describe a zone within a text (where 1 ≤ n ≤ 3), in both of these sets. These document, since classification at this stage is done at counts can be used to construct the following χ2 con- the zone level: tingency table: P (l)P (d|l) # occurrences of # occurrences of P (l|d) = . term t in code c’s term t in code c’s P (d) support text background text # occurrences of # occurrences of In this framework, P (l) is simply the proportion of other terms in code c’s other terms in code c’s training documents that were assigned label l. Since support text background text P (d) is held constant for our purposes (it is the same for all labels since the document doesn’t change), we Any term t with a χ2 value greater than 200 (em- must only worry about computing P (d|l). We use the multinomial Bayesian event model in the [0,1]), and each λi is a real-valued weight assigned to fashion of Lewis & Gale (1994). Here we view a doc- feature fi. ument as a series of features, drawn from some fea- Maximum Entropy is attractive here because it is a ture set. Let Ci be the raw count of the number of discriminative algorithm than does not make the con- times feature fi occurs in document d. The multino- ditional independence assumptions that Na¨ıve Bayes mial event probability for our Na¨ıve Bayes classifier is does. It also assigns positive or negative weights to then: real-valued inputs (in this case posterior probabilities Ci from the Na¨ıve Bayes model), which can be inter- P (fi|l) 3 P (d|l) = P (|d|)|d|! Y . preted as positive or negative evidence. There are Ci! i several nonlinear optimization algorithms than can be employed to find the globally optimal weight setting This is a multi-class model, trained to provide prob- in training. Our system uses a quasi-Newton method abilities for four labels: BP, CC, MF, and X (no do- called L-BFGS. main). Note, however, that the three GO domain la- bels are not mutually exclusive. To deal with this, a training tuple which is labeled with more than one do- 3. Experimentation and Development main is repeated during training. Since we are using a We developed our system by conducting various 4-fold Na¨ıve Bayes model, raw counts for features are simply cross-validation experiments on the training data. The incremented multiple times, once under each label. main system was implemented mostly in Java using, The output probabilities of this model are then com- in part, classes from the MALLET library (McCal- bined to construct a secondary feature vector. This lum, 2002). This library includes implementations of document-level vector uses 24 real-valued features (4 multinomial Na¨ıve Bayes and Maximum Entropy clas- classes × 6 zones) which have corresponding seman- sification models. Evaluation for this task uses three tics (e.g. “the probability of labeling the title with criteria: T P BP”). Recall = , T P + F N 2.4 Document-level Classification T P P recision = , and T P F P At this stage, we have posterior probabilities for each + of four labels assigned to the six different zones in a 2 × Recall × P recision F1 = , document. This, in and of itself, is sufficient informa- Recall + P recision tion to make GO domain predictions. However, we consider the possibilities that (1) evidence for one la- where T P means true positives, F P means false posi- bel may influence the likelihood of annotation for an- tives, F N means false negatives, and F1 is the inverse other, and (2) some zones may be more informative harmonic mean of precision and recall, a summary for document-level annotation than others. statistic to account for the inherent trade-off between them. Thus, the document-level feature vectors we prepare are finally labeled by three binary classifiers, each 3.1 Exploiting Zone Information trained to annotate a different GO domain. In this case, a “positive” example is a document-level vector First of all, we hypothesized that using zone informa- describing a tuple annotated with the corresponding tion would be generally useful in making document- GO domain in the training data, and a “negative” ex- level annotations where the location of relevant tex- ample describes a tuple annotated with either some tual content is unknown. Our first step was to in- other GO domain, or nothing at all. vestigate the validity of this idea. Figure 2 compares These secondary document-level classifiers are Maxi- recall-precision curves for three systems: (1) a Na¨ıve mum Entropy models, for which the probability of a Bayes model trained on words from the entire docu- label l for document-level vector d is defined as: ment and ranked by posterior probability, (2) a simi- lar model divided into zones (each zone submitting a 1 “vote” for classification weighted by its posterior prob- P l d λifi d, l , ( | ) = exp(X ( )) ability), and (3) the two-tier classification described in Zd i Section 2. From this figure we can see that moving where Zd is a normalizing factor over all possible label- 3We also experimented with Support Vector Machines ings of d (to ensure a proper probability in the range for the document-level model, which perform comparably. 1 1 zoning: two-tier system all features zoning: weighted vote words, informative terms full document text words, syntactic rules 0.8 0.8 words only (baseline)

n 0.6 n 0.6 o o i i s s i i c c e e r r

P 0.4 P 0.4

0.2 0.2

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall

Figure 2. Recall-precision curves comparing use of zoning Figure 4. Recall-precision curves comparing use of different information in various ways during system development feature sets during system development (using 4-fold cross- (using 4-fold cross-validation). validation).

BP model CC model MF model

BP CC MF X BP CC MF X BP CC MF X title abstract introduction 3.2 Varying the Feature Set methods We also wanted to evaluate the impact of using dif- results ferent subsets of the advanced features described in discussion section 2.2 to extend our baseline system. Figure 4 -1.163 0.0 2.213 compares recall-precision curves for the system using various features sets. In general, we see that both Figure 3. A visualization of weights assigned by each syntactic features and informative terms improve pre- document-level Maximum Entropy model to the outputs cision for a given level of recall, and combined they of the zone-level Na¨ıve Bayes model. Lighter cells indicate show even more improvement. This turns out to hold positive weight, darker cells indicate negative weight, and true for official track evaluation data as well. grey cells have weight close to 0. 4. Official Evaluation Results from a full-document classification model to a simple Using the results obtained during development, we se- vote zoning system has a slight impact: sacrificing pre- lect the predictive threshold for each Maximum En- cision at low levels of recall, but maintaining higher tropy model that optimizes F1 for that GO domain on precision at high levels of recall. However, moving on the 4-fold cross-validated training data. Models are to the two-tier system significantly improves precision then re-trained on the entire training corpus and ap- across the whole recall spectrum. plied to the unseen test corpus for official evaluation. Figure 3 is a visualization of the weights assigned by Table 4 illustrates how the four runs from our system, each document-level Maximum Entropy model to the each using a different subset of features, perform in the posterior probabilities of each zone-level labeling. We official task evaluation. We also report minimum, me- can see that a zone-level CC label is generally a good dian, and maximum scores for all track participants. predictor of a document-level CC label (likewise for Our top two systems surpass median recall, our top BP and MF), and that X (no annotation) weights are three surpass median precision, and all four systems generally negative. Moreover, we see stronger weights achieve a substantially higher F1 than the median for for the title and introduction zones, indicating the track. (Our best system comes close to the over- highly predictive information in these text regions, all maximum). Figure 5 presents the results of offi- while abstract and discussion seem surprisingly un- cial runs, for all track participants, in recall-precision informative. space. System description R P F1 1 our systems words only (baseline) 55.96 39.35 46.21 other participant systems words, syntactic features 55.96 42.55 48.34 0.8 words, informative terms 62.63 42.18 50.41 all features 62.02 43.86 51.38

n 0.6 o i s i

track min 13.33 16.92 14.92 c e r

track median 60.00 41.74 35.84 P 0.4 track max 100.00 60.14 56.11

Table 3. Comparison of our system using four different fea- 0.2 ture sets on official TREC evaluation data. Also presented are the minimum, median, and maximum scores for each 0 metric for all task participants. 0 0.2 0.4 0.6 0.8 1 Recall

Figure 5. Graphical comparison in recall-precision space of 5. Conclusions and Future Work our systems vs. all other system runs submitted for official task evaluation. The performance of our two-tier system for the TREC Annotation Hierarchy task seems very promising. The major goals of this study were to investigate the value general English than the vernacular of biomedical jour- of (1) zone information, and (2) advanced features not nals. The system also generates all arbitrary pat- often used in such text classification tasks (specifically terns, rather than those that extract genes and pro- syntactic features and GO-specific informative terms). teins specifically. We would like to see if refining the This early work suggests that both facets can offer parsing, pattern generation and rule clustering can substantial gains over our baseline approach for this lead to even more useful syntactic features. The in- problem. formative term features presented here are generated similar to previous work (Ray & Craven, 2005), but However, because the present work was greatly con- they are used quite differently. We consider the occur- strained by time (all work was done in approximately rence of an informative term to be a multinomial event six weeks), there are many things we were unable to in- (similar to word features). However, much richer in- vestigate. Among other things, the single Na¨ıve Bayes formation is available to us: textual distance from the model ⇒ multiple Maximum Entropy model architec- term to the gene of interest, the χ2 confidence of that ture here is a bit theoretically unfounded. We did particular term, etc. It is our belief that this sort of experiment with using unique Na¨ıve Bayes models for information would be useful to the model as well. each zone with suboptimal performance (though this may be due to data sparsity). We also used Maximum Finally, curation and annotation tasks of this nature Entropy models for zone-level classification, but expe- seem ideal for research in active learning. We intend rienced severe overfitting. In short, we were able to get to investigate algorithms that can learn to make pre- the present architecture to work empirically within the dictions at the zone or passage level (rather than the task time line, but would like to try more variants of document level) and garner feedback from human or- this two-tier architecture. acles. Exploiting zone information—at least as it is presented Acknowledgements and used here—is quite helpful. However, our notion of “zone” and the methods by which we partition a This work was supported in part by NLM grant document into them are necessarily ad-hoc. It would 5T15LM007359 and by NIH grant R01 LM07050-01. Spe- be interesting to see if incorporating a finer grained, cial thanks to Soumya Ray for his advice, help in collecting and processing external data sources, and assistance gen- more rhetorical zone ontology (Mizuta & Collier, 2004) erating the informative term features. We would also like is more helpful. We are currently investigating ma- to thank the TREC 2004 Genomics Track organizers for chine learning approaches to automate this kind of their hard work making the task possible. zoning, and plan to incorporate this into the Anno- tation Hierarchy task as well. References The patterns we generate with AutoSlog-TS are sim- Blake, J., Richardson, J. E., Bult, C. J., Kadin, J. A., ple and based on deterministic parses more suited to Eppig, J. T., & the members of the Mouse Genome Database Group (2003). MGD: The Mouse Genome The Gene Ontology Consortium (2000). Gene Ontology: Database. Nucleic Acids Research, 31, 193–195. Tool for the unification of biology. Nature Genetics, 25, http://www.informatics.jax.org. 25–29.

Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., & Apweiler, R. (2004). The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Research, 32, D262–D266.

Dolinski, K., Balakrishnan, R., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S. R., Fisk, D. G., Hirschman, J. E., Hong, E. L., Issel-Tarver, L., Sethu- raman, A., Theesfeld, C. L., Binkley, G., Lane, C., Schroeder, M., Dong, S., Weng, S., Andrada, R., Bot- stein, D., & Cherry, J. M. (2003). Saccharomyces Genome Database. ftp://ftp.yeastgenome.org/yeast/.

Harris, T. W., Chen, N., Cunningham, F., Tello-Ruiz, M., Antoshechkin, I., Bastiani, C., Bieri, T., Blasiar, D., Bradnam, K., Chan, J., Chen, C.-K., Chen, W. J., Davis, P., Kenny, E., Kishore, R., Lawson, D., Lee, R., Muller, H.-M., Nakamura, C., Ozersky, P., Petcherski, A., Rogers, A., Sabo, A., Schwarz, E. M., Auken, K. V., Wang, Q., Durbin, R., Spieth, J., Sternberg, P. W., & Stein, L. D. (2004). WormBase: a multi-species resource for nematode biology and genomics. Nucleic Acids Re- search, 32, D411–D417.

Huala, E., Dickerman, A., Garcia-Hernandez, M., Weems, D., Reiser, L., LaFond, F., Hanley, D., Kiphart, D., Zhuang, J., Huang, W., Mueller, L., Bhattacharyya, D., Bhaya, D., Sobral, B., Beavis, B., Somerville, C., & Rhee, S. (2001). The Arabidopsis Information Resource (TAIR): A comprehensive database and web-based in- formation retrieval, analysis, and visualization system for a model plant. Nucleic Acids Research, 29, 102–105.

Lewis, D., & Gale, W. (1994). A sequential algorithm for training text classifiers. Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Re- search and Development in Information Retrieval (pp. 3–12). ACM/Springer.

McCallum, A. (2002). MALLET: A MAchine Learning for LanguagE Toolkit. http://mallet.cs.umass.edu. Mizuta, Y., & Collier, N. (2004). Zone identification in bi- ology articles as a basis for information extraction. Pro- ceedings of the International Joint Workshop on Natu- ral Language Processing in Biomedicine and its Applica- tions (JNLPBA) (pp. 29–35). Geneva, Swizterland. Ray, S., & Craven, M. (2005). Learning statistical models for annotating proteins with function information using biomedical text. BioMed Central (BMC) Bioinformat- ics. In Press.

Riloff, E. (1996). Automatically generating extraction pat- terns from untagged text. Proceeding of the Third Na- tional Conference on Artificial Intelligence (pp. 1044– 1049). MIT Press.

The FlyBase Consortium (2003). The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Research, 31, 172–175. http://flybase.org/.