A Learning Approach to Shallow Parsing*

A Learning Approach to Shallow Parsing* Marcia Mufioz t Vasin Punyakanok* Dan Roth* Day Zimak Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 USA Abstract This paper presents a general learning ap- A SNoW based learning approach to shallow pars- proach for identifying syntactic patterns, based ing tasks is presented and studied experimentally. on the SNoW learning architecture (Roth, 1998; The approach learns to identify syntactic patterns by Roth, 1999). The SNoW learning architecture is combining simple predictors to produce a coherent a sparse network of linear ftmctions over a pre- inference. Two instantiations of this approach are defined or incrementally learned feature space. studied and experimental results for Noun-Phrases SNoW is specifically tailored for learning in do- (NP) and Subject-Verb (SV) phrases that compare mains in which the potential number of infor- favorably with the best published results are pre- mation sources (features) taking part in deci- sented. In doing that, we compare two ways of mod- sions is very large - of which NLP is a princi- eling the problem of learning to recognize patterns pal example. Preliminary versions of it have al- and suggest that shallow parsing patterns are bet- ready been used successfully on several tasks in ter learned using open/close predictors than using natural language processing (Roth, 1998; Gold- inside/outside predictors. ing and Roth, 1999; Roth and Zelenko, 1998). In particular, SNoW's sparse architecture sup- 1 Introduction ports well chaining and combining predictors to produce a coherent inference. This property of Shallow parsing is studied as an alternative to the architecture is the base for the learning ap- full-sentence parsers. Rather than producing a proach studied here in the context of shallow complete analysis of sentences, the alternative is parsing. to perform only partial analysis of the syntactic Shallow parsing tasks often involve the iden- structures in a text (Harris, 1957; Abney, 1991; tification of syntactic phrases or of words that Greffenstette, 1993). Shallow parsing informa- participate in a syntactic relationship. Com- tion such as NPs and other syntactic sequences putationally, each decision of this sort involves have been found useful in many large-scale lan- multiple predictions that interact in some way. guage processing applications including infor- For example, in identifying a phrase, one can mation extraction and text summarization. A identify the beginning and end of the phrase lot of the work on shallow parsing over the past while also making sure they are coherent. years has concentrated on manual construction Our computational paradigm suggests using a of rules. The observation that shallow syntactic SNoW based predictor as a building block that information can be extracted using local infor- learns to perform each of the required predic- mation - by examining the pattern itself, its tions, and writing a simple program that acti- nearby context and the local part-of-speech in- vates these predictors with the appropriate information - has motivated the use of learning put, aggregates their output and controls the methods to recognize these patterns (Church, interaction between the predictors. Two instan- 1988; Ramshaw and Marcus, 1995; Argamon et tiations of this paradigm are studied and eval- al., 1998; Cardie and Pierce, 1998). uated on two different shallow parsing tasks - * Research supported by NSF grants IIS-9801638 and identifying base NPs and SV phrases. The first SBR-9873450. instantiation of this para4igm uses predictors t Research supported by NSF grant CCR-9502540. to decide whether each word belongs to the in- 168 I terior of a phrase or not, and then groups the rent application, target nodes may represent a words into phrases. The second instantiation potential prediction with respect to a word in finds the borders of phrases (beginning and end) the input sentence, e.g., inside a phrase, outside and then pairs !them in an "optimal" way into a phrase, at the beginning off a phrase, etc. An different phrases. These problems formulations input sentence, along with a designated word are similar to those studied in (Ramshaw and of interest in it, is mapped into a set of fea- Marcus, 1995) and (Church, 1988; Argamon et tures which are active in it; this representation al., 1998), respectively. is presented to the input layer of SNoW and The experimental results presented using the propagates to the target nodes. Target nodes SNoW based approach compare favorably with are linked via weighted edges to (some of the) previously published results, both for NPs and input features. Let ,At = (Q,... ,ira} be the set SV phrases. A s important, we present a few of features that are active in an example and experiments that shed light on some of the is- are linked to the target node t. Then the linear sues involved in using learned predictors that unit is active iff ~ieAt wit > 9t, where w it is the interact to produce the desired inference. In weight on the edge connecting the ith feature particular, we exhibit the contribution of chain- to the target node t, and 9t is the threshold for ing: features that are generated as the output the target node t. of one of the predictors contribute to the per- Each SNoW unit may include a collection of formance of another predictor that uses them as subnetworks, one for each of the target rela- its input. Also, the comparison between the two tions. A given example is treated autonomously instantiations 0f the learning paradigm - the In- by each target subnetwork; an example labeled side/Outside and the Open/Close - shows the t may be treated as a positive example by the advantages of the Open/Close model over the subnetwork for t and as a negative example by Inside/Outside, especially for the task of iden- the rest of the target nodes. tifying long sequences. The learning policy is on-line and mistake- The contribtition of this work is in improving driven; several update rules can be used within the state of the art in learning to perform shal- SNOW. The most successful update rule, and low parsing tasks, developing a better under- the only one used in this work is a variant of standing for how to model these tasks as learn- Littlestone's (1988) Winnow update rule, a mul- ing problems and in further studying the SNoW tiplicative update rule tailored to the situation based computational paradigm that, we believe, in which the set of input features is not known a can be used in many other related tasks in NLP. priori, as in the infinite attribute model (Blum, The rest of this paper is organized as follows: 1992). This mechanism is implemented via the The SNoW architecture is presented in Sec. 2. sparse architecture of SNOW. That is, (1) input Sec. 3 presents the shallow parsing tasks stud- features are allocated in a data driven way - an led and provides details on the computational input node for the feature i is allocated only if approach. Sec. 4 describes the data used and the feature i was active in any input sentence the experimental approach, and Sec. 5 presents and (2) a link (i.e., a non-zero weight) exists and discusses the experimental results. between a target node t and a feature i if and 2 SNoW only if i was active in an example labeled t. The SNoW (Sparse Network of Winnows 1) The Winnow update rule has, in addition to learning architecture is a sparse network of lin- the threshold 9t at the target t, two update pa- ear units over: a common pre-defined or incre- rameters: a promotion parameter a > 1 and mentally learned feature space. Nodes in the a demotion parameter 0 < j3 < 1. These input layer of t:he network represent simple rela- are being used to update the current represen- tions over the input sentence and are being used tation of the target t (the set of weights w~) as the input features. Each linear unit is called only when a mistake in prediction is made. Let a target node and represents relations which are .At -- (Q,... ,ira} be the set of active features of interest ove~r the input sentence; in the cur- that are linked to the target node t. If the algorithm predicts 0 (that is, ~ieAt wit <- St) and 1To winnow: to separate chaff from grain. the received label is 1, the active weights in 169 the current example are promoted in a multi- verbs 2. For example, the SV phrases are brack- plicative fashion: Vi E ,At, wlt +-- o~ • w i.t If the eted in the following: algorithm predicts 1 (~ie.~t wit > 0t) and the •..presented [ a theory that claims ] received label is 0, the active weights in the cur- that [the algorithm runs ] and t t rent example are demoted Vi E .At, wi ~ 8 " wi. performs... All other weights are unchanged. Both tasks can be viewed as sequence recog- The key feature of the Winnow update rule is nition problems. This can be modeled as a col- that the number of examples required to learn lection of prediction problems that interact in a linear function grows linearly with the num- a specific way. For example, one may predict ber of relevant features and only logarithmically the first and last word in a target sequence. with the total number of features. This prop- Moreover, it seems plausible that information erty seems crucial in domains in which the num- produced by one predictor (e.g., predicting the ber of potential features is vast, but a relatively beginning of the sequence) may contribute to small number of them is relevant.

Load more