A Learning Approach to Shallow *

Marcia Mufioz t Vasin Punyakanok* Dan Roth* Day Zimak

Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 USA

Abstract This paper presents a general learning ap- A SNoW based learning approach to shallow pars- proach for identifying syntactic patterns, based ing tasks is presented and studied experimentally. on the SNoW learning architecture (Roth, 1998; The approach learns to identify syntactic patterns by Roth, 1999). The SNoW learning architecture is combining simple predictors to produce a coherent a sparse network of linear ftmctions over a pre- inference. Two instantiations of this approach are defined or incrementally learned feature space. studied and experimental results for Noun-Phrases SNoW is specifically tailored for learning in do- (NP) and Subject-Verb (SV) phrases that compare mains in which the potential number of infor- favorably with the best published results are pre- mation sources (features) taking part in deci- sented. In doing that, we compare two ways of mod- sions is very large - of which NLP is a princi- eling the problem of learning to recognize patterns pal example. Preliminary versions of it have al- and suggest that shallow parsing patterns are bet- ready been used successfully on several tasks in ter learned using open/close predictors than using natural language processing (Roth, 1998; Gold- inside/outside predictors. ing and Roth, 1999; Roth and Zelenko, 1998). In particular, SNoW's sparse architecture sup- 1 Introduction ports well chaining and combining predictors to produce a coherent inference. This property of Shallow parsing is studied as an alternative to the architecture is the base for the learning ap- full-sentence parsers. Rather than producing a proach studied here in the context of shallow complete analysis of sentences, the alternative is parsing. to perform only partial analysis of the syntactic Shallow parsing tasks often involve the iden- structures in a text (Harris, 1957; Abney, 1991; tification of syntactic phrases or of that Greffenstette, 1993). Shallow parsing informa- participate in a syntactic relationship. Com- tion such as NPs and other syntactic sequences putationally, each decision of this sort involves have been found useful in many large-scale lan- multiple predictions that interact in some way. guage processing applications including infor- For example, in identifying a phrase, one can mation extraction and text summarization. A identify the beginning and end of the phrase lot of the work on shallow parsing over the past while also making sure they are coherent. years has concentrated on manual construction Our computational paradigm suggests using a of rules. The observation that shallow syntactic SNoW based predictor as a building block that information can be extracted using local infor- learns to perform each of the required predic- mation - by examining the pattern itself, its tions, and writing a simple program that acti- nearby context and the local part-of-speech in- vates these predictors with the appropriate in- formation - has motivated the use of learning put, aggregates their output and controls the methods to recognize these patterns (Church, interaction between the predictors. Two instan- 1988; Ramshaw and Marcus, 1995; Argamon et tiations of this paradigm are studied and eval- al., 1998; Cardie and Pierce, 1998). uated on two different shallow parsing tasks - * Research supported by NSF grants IIS-9801638 and identifying base NPs and SV phrases. The first SBR-9873450. instantiation of this para4igm uses predictors t Research supported by NSF grant CCR-9502540. to decide whether each belongs to the in-

168 I

terior of a phrase or not, and then groups the rent application, target nodes may represent a words into phrases. The second instantiation potential prediction with respect to a word in finds the borders of phrases (beginning and end) the input sentence, e.g., inside a phrase, outside and then pairs !them in an "optimal" way into a phrase, at the beginning off a phrase, etc. An different phrases. These problems formulations input sentence, along with a designated word are similar to those studied in (Ramshaw and of interest in it, is mapped into a set of fea- Marcus, 1995) and (Church, 1988; Argamon et tures which are active in it; this representation al., 1998), respectively. is presented to the input layer of SNoW and The experimental results presented using the propagates to the target nodes. Target nodes SNoW based approach compare favorably with are linked via weighted edges to (some of the) previously published results, both for NPs and input features. Let ,At = (Q,... ,ira} be the set SV phrases. A s important, we present a few of features that are active in an example and experiments that shed light on some of the is- are linked to the target node t. Then the linear sues involved in using learned predictors that unit is active iff ~ieAt wit > 9t, where w it is the interact to produce the desired inference. In weight on the edge connecting the ith feature particular, we exhibit the contribution of chain- to the target node t, and 9t is the threshold for ing: features that are generated as the output the target node t. of one of the predictors contribute to the per- Each SNoW unit may include a collection of formance of another predictor that uses them as subnetworks, one for each of the target rela- its input. Also, the comparison between the two tions. A given example is treated autonomously instantiations 0f the learning paradigm - the In- by each target subnetwork; an example labeled side/Outside and the Open/Close - shows the t may be treated as a positive example by the advantages of the Open/Close model over the subnetwork for t and as a negative example by Inside/Outside, especially for the task of iden- the rest of the target nodes. tifying long sequences. The learning policy is on-line and mistake- The contribtition of this work is in improving driven; several update rules can be used within the state of the art in learning to perform shal- SNOW. The most successful update rule, and low parsing tasks, developing a better under- the only one used in this work is a variant of standing for how to model these tasks as learn- Littlestone's (1988) Winnow update rule, a mul- ing problems and in further studying the SNoW tiplicative update rule tailored to the situation based computational paradigm that, we believe, in which the set of input features is not known a can be used in many other related tasks in NLP. priori, as in the infinite attribute model (Blum, The rest of this paper is organized as follows: 1992). This mechanism is implemented via the The SNoW architecture is presented in Sec. 2. sparse architecture of SNOW. That is, (1) input Sec. 3 presents the shallow parsing tasks stud- features are allocated in a data driven way - an led and provides details on the computational input node for the feature i is allocated only if approach. Sec. 4 describes the data used and the feature i was active in any input sentence the experimental approach, and Sec. 5 presents and (2) a link (i.e., a non-zero weight) exists and discusses the experimental results. between a target node t and a feature i if and 2 SNoW only if i was active in an example labeled t. The SNoW (Sparse Network of Winnows 1) The Winnow update rule has, in addition to learning architecture is a sparse network of lin- the threshold 9t at the target t, two update pa- ear units over: a common pre-defined or incre- rameters: a promotion parameter a > 1 and mentally learned feature space. Nodes in the a demotion parameter 0 < j3 < 1. These input layer of t:he network represent simple rela- are being used to update the current represen- tions over the input sentence and are being used tation of the target t (the set of weights w~) as the input features. Each linear unit is called only when a mistake in prediction is made. Let a target node and represents relations which are .At -- (Q,... ,ira} be the set of active features of interest ove~r the input sentence; in the cur- that are linked to the target node t. If the algo- rithm predicts 0 (that is, ~ieAt wit <- St) and 1To winnow: to separate chaff from grain. the received label is 1, the active weights in

169 the current example are promoted in a multi- verbs 2. For example, the SV phrases are brack- plicative fashion: Vi E ,At, wlt +-- o~ • w i.t If the eted in the following: algorithm predicts 1 (~ie.~t wit > 0t) and the •..presented [ a theory that claims ] received label is 0, the active weights in the cur- that [the algorithm runs ] and t t rent example are demoted Vi E .At, wi ~ 8 " wi. performs... All other weights are unchanged. Both tasks can be viewed as sequence recog- The key feature of the Winnow update rule is nition problems. This can be modeled as a col- that the number of examples required to learn lection of prediction problems that interact in a linear function grows linearly with the num- a specific way. For example, one may predict ber of relevant features and only logarithmically the first and last word in a target sequence. with the total number of features. This prop- Moreover, it seems plausible that information erty seems crucial in domains in which the num- produced by one predictor (e.g., predicting the ber of potential features is vast, but a relatively beginning of the sequence) may contribute to small number of them is relevant. Winnow is others (e.g., predicting the end of the sequence). known to learn efficiently any linear threshold Therefore, our computational paradigm sug- function and to be robust in the presence of gests using SNoW predictors that learn sepa- various kinds of noise and in cases where no rately to perform each of the basic predictions, linear-threshold function can make perfect clas- and chaining the resulting predictors at evalu- sifications, while still maintaining its abovemen- ation time. Chaining here means that the pre- tioned dependence on the number of total and dictions produced by one of the predictors may relevant attributes (Littlestone, 1991; Kivinen be used as (a part of the) input to others 3. and Warmuth, 1995). Two instantiations of this paradigm - each of Once target subnetworks have been learned which models the problems using a different set and the network is being evaluated, a decision of predictors - are described below. support mechanism is employed, which selects the dominant active target node in the SNoW 3.2 Inside/Outside Predictors unit via a winner-take-all mechanism to produce The predictors in this case are used to decide, a final prediction. The decision support mech- for each word, whether it belongs to the inte- anism may also be cached and processed along rior of a phrase or not; this information is then with the output of other SNoW units to produce used to group the words into phrases. Since a coherent output. annotating words only with Inside/Outside in- formation is ambiguous in cases of two consec- 3 Modeling Shallow Parsing utive phrases, an additional predictor is used. 3.1 Task Definition Specifically, each word in the sentence may be This section describes how we model the shallow annotated using one of the following labels: O parsing tasks studied here as learning problems. - the current word is outside the pattern. I - The goal is to detect NPs and SV phrases. Of the current word is inside the pattern. B - the the several slightly different definitions of a base current word marks the beginning of a pattern NP in the literature we use for the purposes of that immediately follows another pattern 4. this work the definition presented in (Ramshaw 2Notice that according to this definition the identified and Marcus, 1995) and used also by (Argamon verb may not correspond to the subject, but this phrase et al., 1998)and others. That is, a base NP still contains meaningful information; in any case, the is a non-recursive NP that includes determin- learning method presented is independent of the specific ers but excludes post-modifying prepositional definition used. 3The input data used in all the experiments phrases or clauses. For example: presented here consists of part-of-speech tagged •..presented [last year ] in data. In the demo of the system (available from [Illinois] in front of ... http ://12r. cs. uiuc. edu/'cogcomp/eoh/index, html), SV phrases, following the definition suggested an additional layer of chaining is used. Raw sentences in (Argamon et al., 1998), are word phrases are supplied as input and are processed using a SNoW based POS tagger (Roth and Ze!enko, 1998) first. starting with the subject of the sentence and 4There are other ways to define the B annotation, ending with the first verb, excluding modal e.g., as always marking the beginning of a phrase. The

170 ( Final ~edicfion )

i OIB Prediction Second Predictor

[FeatureExtactor I

( Sentence + POS tags + OIB )

OIB Prediction

Predictor

Sentence + POS tags

Figure 1: Architecture for Inside/Outside method.

For example, the sentence I went to sentence along with the corresponding part-of- California last May would be marked for speech (POS) tags. The features extracted from base NPs as: this input represent the local context of each word in terms of POS tags (with the possible I went to California last May addition of lexical information), as described in I 0 0 I B I Sec 3.4. The SNoW predictor in this case con- sists of three targets - O, I and B. Figure 1 indicating that the NPs are I, California and depicts the feature extraction module which ex- last May. This approach has been studied tracts the local features and generates an exam- in (Ramshaw and Marcus, 1995). ple for each word in the sentence. Each example is labeled with one of 0, I or B. 3.2.1 Architecture SNoW is used in order to learn the OIB an- The second predictor takes as input a sen- notations both for NPs and SV phrases. In tence along with the corresponding POS tags each case, two:predictors are learned, which dif- as well as the Inside/Outside information. fer in the type of information they receive in The hope is that representing the local context their input. A first predictor takes as input a of a word using the Inside/Outside information for its neighboring words, in addition to the modeling used, however, turns out best experimentally. POS and lexical information, will enhance the

171 Final Prediction )

Combinator

t Olin ,d

FFeature Extractor ] •-~[ Feature Extractor T T Sentence+ POS tags

Figure 2: Architecture for Open/Close Method. performance of the predictor. While this in- 3.3.1 Architecture formation is available during training, since the The architecture used for the Open/Close pre- data is annotated with the OIB information, it dictors is shown in Figure 2. Two SNoW pre- is not available in the input sentence at evalua- dictors are used, one to predict if the word cur- tion time. Therefore, at evaluation time, given rently in consideration is the first in the phrase a sentence (represented as a sequence of POS (an open bracket), and the other to predict if it tags), we first need to evaluate the first pre- is the last (a close bracket). Each of the two pre- dictor on it, generate an Inside/Outside repre- dictors is a SNoW network with two competing sentation of the sentence, and then use this to target nodes: one predicts if the current posi- generate new features that feed into the second tion is an open (close) bracket and the other predictor. predicts if it is not. In this case, the actual activation value (sum of weights of the active 3.3 Open/Close Predictors features for a given target) of the SNoW pre- dictors is used to compute a confidence in the The predictors in this case are used to decide, prediction. Let ty be the activation value for for each word, whether it is the first in a phrase, the yes-bracket target and t N for the no-bracket the last in a phrase, both of these, or none target. Normally, the network would predict of these. In this way, the phrase boundaries the target corresponding to the higher activa- are determined; this is annotated by placing an tion value. In this case, we prefer to cache the open bracket ([) before the first word and a system preferences for each of the open (close) close bracket (]) after the last word of each brackets predictors so that several bracket pair- phrase. Our earlier example would be marked ings can be considered when all the information for base NPs as: [I] wont to [California] is available. The confidence, '7, of a candidate is [last May]. This approach has been studied defined by '7 = tr/(tr + t,v). Normally, SNoW in (Church, 1988; Argamon et al., 1998). will predict that there is a bracket if "7 /> 0.5,

172 81 82 83 84 85 W6 [;.8 ];.3 ]0.6 [;.5 ]0.3 ];.9 ]0.3 [04 ]05 ]03 [ 81 ] 82[ s3 84 ] 85 86 Figure 3: Example of combinator assignment. Subscripts denote the confidence of the bracket candidates. Bracket candidates that would be chosen by the combinator are marked with a * but this system employs an threshold ~-. We The value of the pairing is the sum of all of the will consider any bracket that has '7 >t r as a values of the pairs within the pairing. candidate. The lower ~- is, the more candidates Our combinator finds the pairing with the will be considered. maximum value. Note that while there may be The input to: the open bracket predictor is exponentially many pairings, by modeling the a sentence and the POS tags associated with problem of finding the maximum wlued pairing each word in the sentence. For each position as a shortest path problem on a directed acyclic in the sentence,: the open bracket predictor de- graph, we provide a linear time solution. Fig- cides if it is a candidate for an open bracket. For ure 3 gives an example of pairing bracket can- each open bracket candidate, features that cor- didates of the sentence S = 818283848586, where respond to this information are generated; the the confidence of each candidate is written in close bracket predictor can (potentially) receive the subscript. this information in addition to the sentence and 3.4 Features the POS information, and use it in its decision on whether a given position in the sentence is The features used in our system are relational to be a candiddte for a close bracket predictor features over the sentence and the POS informa- (to be paired with the open bracket candidate). tion, which can be defined by a pair of numbers, k and w. Specifically, features are either word 3.3.2 Combinator conjunctions or POS tags conjunctions. All con- Finding the final phrases by pairing the open junctions of size up to k and within a symmetric and close bracket candidates is crucial to the window that includes the w words before and performance of the system; even given good after the designated word are generated. prediction performance choosing an inadequate An example is shown in Figure 4 where pairing would severely lower the overall perfor- (w, k) = (3, 4) for POS tags, and (w, k) = (1, 2) mance. We use 'a graph based method that uses for words. In this example the word "how" is the confidence Of the SNoW predictors to gener- the designated word with POS tag "WRB". "0" ate the consistent pairings, at only a linear time marks the position of the current word (tag) complexity. if it is not part of the feature, and "(how)" We call p = (o, c) a pair, where o is an open or "(WI:tB)" marks the position of the current bracket and c is any close bracket that was pre- word (tag) if it is part of the current feature. dicted with respect to o. The position of a The distance of a conjunction from the current bracket at the: ith word is defined to be i if word (tag) can be induced by the placement of it is an open bracket and i + 1 if it is a close the special character %" in the feature. We do bracket. Clearly, a pair (o, c) is possible only not consider mixed features between words and when pos(o) <: po8(c). The confidence of a POS tags as in (l:tamshaw and Marcus, 1995), bracket t is thei weight '7(t). The value of a pair that is, a single feature consists of either words p = (o,c) is defined to be v(p) = 7(o) * 7(c). or tags. The pair Pl occurs before the pair P2 if po8(cl) ~. Additionally, in the Inside/Outside model, pos(o2). Pl and P2 are compatible if either Pl oc- the second predictor incorporates as features curs before p2 or P2 occurs before Pl. A pairing the OIB status of the w words before and af- is a set of pair s P = {pl,p2,...pn} such that Pl ter the designated word, and the conjunctions is compatible with pj for all i and j where i ~ j. of size 2 of the words surrounding it.

173 This is an example of how to generate features DT VBZ DT NN IN WRB TO VB NNS

Conj. Size 1 2 3 4 For Tags DT __ 0 DT NN _ 0 DT NN IN 0 DT NN IN (WRB) w=3 NN _ 0 NN IN 0 NN IN (WRB) NN IN (WRB) TO k=4 IN 0 IN (WRB) IN (Wl:tB) TO IN (WRB) TO VB (WRB) (WRB) TO (WRB) TO VB (WRB) TO VB NNS 0 TO 0 TO VB 0 TO VB NNS 0 -VB 0 - VB NNS 0 - - NNS For Words of 0 of (how) w=l (how) (how) to k=2 0 to

Figure 4: An example of feature extraction.

I Test ni° l 20128°3°12117271 47377 5475812335 I

Table 1: Sizes of the training and test data sets for NP Patterns.

l[Data io,o []Sentences [ Words I SV Patterns J Test 1921 46451 3044

Table 2: Sizes of the training and test data sets for SV Patterns.

4 Methodology to include bracketing information related only to the non-recursive, base NPs present in each 4.1 Data sentence while the subject verb phrases were In order to be able to compare our results with taken as is. The data sets include POS tag the results obtained by other researchers, we information generated by Ramshaw and Mar- worked with the same data sets already used cus using Brill's transformational part-of-speech by (Ramshaw and Marcus, 1995; Argamon et tagger (Brill, 1995). al., 1998) for NP and SV detection. These data The sizes of the training and test data are sets were based on the Wall Street Journal cor- summarized in Table 1 and Table 2. pus in the Penn (Marcus et al., 1993). For NP, the training and test corpus was pre- 4.2 Parameters pared from sections 15 to 18 and section 20, The Open/Close system has two adjustable pa- respectively; the SV corpus was prepared from rameters, r[ and v], the threshold for the open sections 1 to 9 for training and section 0 for and close bracket predictors, respectively. For testing. Instead of using the NP bracketing in- all experiments, the system is first trained on formation present in the tagged Treebank data, 90% of the training data and then tested on the Ramshaw and Marcus modified the data so as remaining 10%. The r] and r[ that provide the

174 best performance are used on the real test file. tend to break up long base NPs into smaller After the best parameters are found, the system ones. is trained on the whole training data set. Re- The results also show that lexical information sults are reported in terms of recall, precision, improves the performance by nearly 2%. This and Fa. F# is always used as the single value to is similar to results in the literature (Ramshaw compare the performance. and Marcus, 1995). What we found surprising is For all the experiments, we use 1 as the initial that the second predictor, that uses additional weight, 5 as the .threshold, 1.5 as a, and 0.7 as information about the OIB status of the local ~3 to train SNOW, and it is always trained for 2 context, did not do much better than the first cycles. predictor, which relies only on POS and lexical information. A control experiment has verified 4.3 Evaluation Technique that this is not due to the noisy features that the To evaluate the results, we use the following first predictor supplies to the second predictor. metrics: Finally, the Inside/Outside method was also tested on predicting SV phrases, yielding poor results that are not shown here. An attempt at Number of correct proposed patterns Recall = explaining this phenomena by breaking down Number of correct patterns performance according to the length of the phrases is discussed in Sec. 5.3. Precision = Number of correct proposed patterns 5.2 Open/Close Number of proposed patterns The results of the Open/Close method for NP and SV phrases are presented in Table 4. In ad- dition to the good overall performance, the re- (/~2 + 1) • Recall. Precision sults show significant improvement by incorpo- F~ = ~32.Precision + Recall rating the lexical information into the features. In addition to the recall/precision results we have also presented the accuracy of each of the Number of words labeled correctly Open and Close predictors. These are impor- Accuracy = Total number of words tant since they determine the overall accuracy in phrase detection. It is evident that the pre- We use ~ = 1. Note that, for the Open/Close dictors perform very well, and that the overall system, we must measure the accuracy for the performance degrades due to inconsistent pair- open predictor and the close predictor sepa- ings. rately since each word can be labeled as "Open" An important question in the learning ap- or "Not Open" and, at the same time, "Close" proach presented here is investigating the gain or "Not Close". achieved due to chaining. That is, whether the 5 Experimental Results features extracted from open brackets can im- prove the performance of the the close bracket 5.1 Inside/Outside predictor. To this effect, we measured the ac- The results of each of the predictors used curacy of the close bracket predictor itself, on a in the Inside/0utside method are presented word basis, by supplying it features generated in Table 3. The results are comparable to from correct open brackets. We compared this other results reported using the Inside/Outside with the same experiment, only this time with- method (Ramshaw and Marcus, 1995) (see Ta- out incorporating the features from open brack- ble 7. We have observed that most of the mis- ets to the close bracket predictor. The results, taken predictions of base NPs involve predic- shown in Table 5 indicate a significant contribu- tions with respect to conjunctions, gerunds, ad- tion due to chaining the features. Notice that verbial NPs and some punctuation marks. As the overall accuracy for the close bracket pre- reported in (Argamon et al., 1998), most base dictor is very high. This is due to the fact that, NPs present in ~he data are less or equal than as shown in Table 2, there are many more neg- 4 words long. This implies that our predictors ative examples than positive examples. Thus, a

175 Method [Recall[ Precision[ Eft=1 Accuracy First Predictor 90.5 89.8 90.1 96.9 Second Predictor 90.5 90.4 90.4 97.0 First Predictor + lexical 92.5 92.2 92.4 97.6 Second Predictor + lexical 92.5 92.1 92.3 97.6 Table 3: Results for NP detection using Inside/Outside method.

Method Precision [Fts=I [ AccuracY ]

SV~ lexical ~ 87.9 SV with lexical ~ 92.2 NP w/o lexical 90.3 NP with lexical 92.4

Table 4: Results for SV Phrase and NP detection using Open/Close method.

I II withoutOpenio o IIWith Open bracket info I Overall I Positive Only Overall I Positive Only [ [ 0"5 II 99"3 I 92.7 II 99.41 95.0 I Table 5: Accuracy of close bracket predictor when using features created on local information alone versus using additional features created from the open bracket candidate. Overall performance and performance on positive examples only is shown.

predictor that always predicts "no" would have different lengths. Perhaps this was not observed an accuracy of 93.4%. Therefore, we considered earlier since (Ramshaw and Marcus, 1995) stud- also the accuracy over positive examples, which ied only base NPs, most of which are short. indicates the significant role of the chaining. The conclusion is therefore that the Open/Close method is more robust, especially when the tar- 5.3 Discussion get sequences are longer than a few tokens. Both methods we study here - Inside/Outside Finally, Tables 7 and 8 present a comparison and Open/Close - have been evaluated before of our methods to some of the best NP and SV (using different learning methods) on similar results published on these tasks. tasks. However, in this work we have allowed for a fair comparison between two different models 6 Conclusion by using the same basic learning method and the same features. We have presented a SNoW based learning ap- Our main conclusion is with respect to the proach to shallow parsing tasks. The learning robustness of the methods to sequences of dif- approach suggests to identify a syntactic pat- ferent lengths. While both methods give good terns is performed by writing a simple program results for the base NP problem, they differ sig- in which several instantiations of SNoW learn- nificantly on the SV tasks. Furthermore, our ing units are chained and combined to produce investigation revealed that the Inside/Outside a coherent inference. Two instantiations of this method is very sensitive to the length of the approach have been described and shown to per- phrases. Table 6 shows a breakdown of the per- form very well on NP and SV phrase detection. formance of the two methods on SV phrases of In addition to exhibiting good results on shallow

176 Length Patterns I Inside/Outside Open/Close 'Recall lPrecision I Fp=l Recall lPrecision I Fp=t 4 I 2212 90.5 61.5 73.2 94.1 93.5 93.8 4 < I ~ 8 i 5O9 61.4 44.1 51.3 72.3 79.7 75.8 > 8 323 30.3 15.0 20.0 74.0 64.4 68.9

Table 6: Comparison of Inside/Outside and Open/Close on SV patterns of varying lengths.

Method I Recall Precision Fp=l I Accuracy Inside/Outside 90.5 90.4 90.4 97.0 Inside/Outside + lexical 92.5 92.2 92.4 97.6 Open/Close 90.9 90.3 90.6 O: 97.4, C: 97.8 Open/Close + lexical 93.1 92.4 92.8 O: 98.1, C: 98.2 Ramshaw & Marcus 90.7 90.5 90.6 97.0 Ramshaw & Marcus + lexical 92.3 91.8 92.0 97.4 Argamon et al. 91.6 91.6 91.6 N/A Table 7: Comparison of Results for NP. In the accuracy column, O indicates the accuracy of the Open predictor and C indicates the accuracy of the Close predictor.

Method Recall Precision I Fp=l I Accuracy

0Pen/Close 88.3 87.9 , 88.1 O: 98.6, C: 99.4 0pen/Close + lexical 91.9 92.2 92.0 O: 99.2, C: 99.4 Argamon et al. 84.5 88.6 I 86.5 N/A Table 8: Comparison of Results for SV. In the accuracy column, O indicates the accuracy of the Open predictor and C indicates the accuracy of the Close predictor. parsing tasks, we have made some observations tion and Psycholinguistics, pages 257-278. on the sensitivity of modeling the task. We be- Kluwer, Dordrecht. lieve that the paradigm described here, as well S. Argamon, I. Dagan, and Y. Krymolowski. as the basic learning system, can be used in this 1998. A memory-based approach to learn- way in many problems that are of interest to the ing shallow natural language patterns. In NLP community. COLING-ACL 98, The 17th International Conference on Computational Linguistics. Acknowledgments A. Blum. 1992. Learning boolean functions in We would like to thank Yuval Krymolowski and an infinite attribute space. Machine Learn- the reviewers for their helpful comments on the ing, 9(4):373-386, October. paper. We als0 thank Mayur Khandelwal for E. BriU. 1995. Transformation-based error- the suggestion to model the combinator as a driven learning and natural language process- graph problem. ing: A case study in part of speech tagging. Computational Linguistics, 21(4):543-565. References C. Cardie and D. Pierce. 1998. Error-driven S. P. Abney. :1991. Parsing by chunks. In pruning of grammars for base noun S. P. AbneylR. C. Berwick and C. Tenny, phrase identification. In Proceedings of A CL- editors, Principle-based parsing: Computa- 98, pages 218-224.

177 Kenneth W. Church. 1988. A stochastic parts tecture. Technical Report UIUCDCS-R-99- program and noun phrase parser for unre- 2101, UIUC Computer Science Department, stricted text. In Proc. of A CL Conference on May. Applied Natural Language Processing. A. R. Golding and D. Roth. 1999. A winnow based approach to context-sensitive spelling correction. . Special Issue on Machine Learning and Natural Language. Preliminary version appeared in ICML-96. G. Greffenstette. 1993. Evaluation techniques for automatic semantic extraction: compar- ing semantic and window based approaches. In ACL'93 workshop on the Acquisition of Lexical Knowledge from TexL Z. S. Harris. 1957. Co-occurrence and trans- formation in linguistic structure. Language, 33(3):283-340. J. Kivinen and M. K. Warmuth. 1995. Expo- nentiated gradient versus gradient descent for linear predictors. In Proceedings of the An- nual A CM SSymp. on the Theory of Comput- ing. N. Littlestone. 1988. Learning quickly when irrelevant attributes abound: A new linear- threshold algorithm. Machine Learning, 2:285-318. N. Littlestone. 1991. Redundant noisy at- tributes, attribute errors, and linear thresh- old learning using Winnow. In Proc. ~th Annu. Workshop on Comput. Learning The- ory, pages 147-156, San Mateo, CA. Morgan Kaufmann. M. P. Marcus, B. Santorini, and M. Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313-330, June. L. A. Ramshaw and M. P. Marcus. 1995. Text chunking using transformation-based learn- ing. In Proceedings of the Third Annual Workshop on Very Large Corpora. D. Roth and D. Zelenko. 1998. Part of speech tagging using a network of linear separa- tors. In COLING-ACL 98, The 17th Inter- national Conference on Computational Lin- guistics, pages 1136-1142. D. Roth. 1998. Learning to resolve natural lan- guage ambiguities: A unified approach. In Proc. National Conference on Artificial Intel- ligence, pages 806---813. D. Roth. 1999. The SNoW learning archi-

178