Exploring Evidence for Shallow Parsing ¡

Exploring Evidence for Shallow Parsing Xin Li Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 [email protected] [email protected] Abstract to ] [NP only $ 1.8 billion ] [PP in ] [NP September] . Significant amount of work has been devoted recently to develop learning While earlier work in this direction concen- techniques that can be used to gener- trated on manual construction of rules, most of ate partial (shallow) analysis of natu- the recent work has been motivated by the obser- ral language sentences rather than a full vation that shallow syntactic information can be parse. In this work we set out to evalu- extracted using local information — by examin- ate whether this direction is worthwhile ing the pattern itself, its nearby context and the by comparing a learned shallow parser local part-of-speech information. Thus, over the to one of the best learned full parsers past few years, along with advances in the use on tasks both can perform — identify- of learning and statistical methods for acquisition ing phrases in sentences. We conclude of full parsers (Collins, 1997; Charniak, 1997a; that directly learning to perform these Charniak, 1997b; Ratnaparkhi, 1997), significant tasks as shallow parsers do is advanta- progress has been made on the use of statisti- geous over full parsers both in terms of cal learning methods to recognize shallow pars- performance and robustness to new and ing patterns — syntactic phrases or words that lower quality texts. participate in a syntactic relationship (Church, 1988; Ramshaw and Marcus, 1995; Argamon et al., 1998; Cardie and Pierce, 1998; Munoz et al., 1 Introduction 1999; Punyakanok and Roth, 2001; Buchholz et Shallow parsing is studied as an alternative to al., 1999; Tjong Kim Sang and Buchholz, 2000). full-sentence parsing. Rather than producing a Research on shallow parsing was inspired by complete analysis of sentences, the alternative psycholinguistics arguments (Gee and Grosjean, is to perform only partial analysis of the syn- 1983) that suggest that in many scenarios (e.g., tactic structures in a text (Harris, 1957; Abney, conversational) full parsing is not a realistic strat- 1991; Greffenstette, 1993). A lot of recent work egy for sentence processing and analysis, and was on shallow parsing has been influenced by Ab- further motivated by several arguments from a ney’s work (Abney, 1991), who has suggested to natural language engineering viewpoint. “chunk” sentences to base level phrases. For ex- First, it has been noted that in many natural lan- ample, the sentence “He reckons the current ac- guage applications it is sufficient to use shallow count deficit will narrow to only $ 1.8 billion in parsing information; information such as noun September .” would be chunked as follows (Tjong phrases (NPs) and other syntactic sequences have Kim Sang and Buchholz, 2000): been found useful in many large-scale language processing applications including information ex- [NP He ] [VP reckons ] [NP the current traction and text summarization (Grishman, 1995; account deficit ] [VP will narrow ] [PP Appelt et al., 1993). Second, while training a full This¡ research is supported by NSF grants IIS-9801638, parser requires a collection of fully parsed sen- ITR-IIS-0085836 and an ONR MURI Award. tences as training corpus, it is possible to train a shallow parser incrementally. If all that is avail- duce quite different outputs — we have chosen able is a collection of sentences annotated for the task of identifying the phrase structure of a NPs, it can be used to produce this level of anal- sentence. This structure can be easily extracted ysis. This can be augmented later if more infor- from the outcome of a full parser and a shallow mation is available. Finally, the hope behind this parser can be trained specifically on this task. research direction was that this incremental and There is no agreement on how to define phrases modular processing might result in more robust in sentences. The definition could depend on parsing decisions, especially in cases of spoken downstream applications and could range from language or other cases in which the quality of the simple syntactic patterns to message units peo- natural language inputs is low — sentences which ple use in conversations. For the purpose of this may have repeated words, missing words, or any study, we chose to use two different definitions. other lexical and syntactic mistakes. Both can be formally defined and they reflect dif- Overall, the driving force behind the work on ferent levels of shallow parsing patterns. learning shallow parsers was the desire to get bet- The first is the one used in the chunking com- ter performance and higher reliability. However, petition in CoNLL-2000 (Tjong Kim Sang and since work in this direction has started, a sig- Buchholz, 2000). In this case, a full parse tree nificant progress has also been made in the re- is represented in a flat form, producing a rep- search on statistical learning of full parsers, both resentation as in the example above. The goal in terms of accuracy and processing time (Char- in this case is therefore to accurately predict a niak, 1997b; Charniak, 1997a; Collins, 1997; collection of ¢£¢ different types of phrases. The Ratnaparkhi, 1997). chunk types are based on the syntactic category This paper investigates the question of whether part of the bracket label in the Treebank. Roughly, work on shallow parsing is worthwhile. That a chunk contains everything to the left of and is, we attempt to evaluate quantitatively the intu- including the syntactic head of the constituent itions described above — that learning to perform of the same name. The phrases are: adjective shallow parsing could be more accurate and more phrase (ADJP), adverb phrase (ADVP), conjunc- robust than learning to generate full parses. We tion phrase (CONJP), interjection phrase (INTJ), do that by concentrating on the task of identify- list marker (LST), noun phrase (NP), preposition ing the phrase structure of sentences — a byprod- phrase (PP), particle (PRT), subordinated clause uct of full parsers that can also be produced by (SBAR), unlike coordinated phrase (UCP), verb shallow parsers. We investigate two instantiations phrase (VP). (See details in (Tjong Kim Sang and of this task, “chucking” and identifying atomic Buchholz, 2000).) phrases. And, to study robustness, we run our The second definition used is that of atomic experiments both on standard Penn Treebank data phrases. An atomic phrase represents the most (part of which is used for training the parsers) and basic phrase with no nested sub-phrases. For ex- on lower quality data — the Switchboard data. ample, in the parse tree, Our conclusions are quite clear. Indeed, shallow parsers that are specifically trained to perform the tasks of identifying the phrase structure ( (S (NP (NP Pierre Vinken) , (ADJP of a sentence are more accurate and more robust (NP 61 years) old) ,) (VP will (VP join than full parsers. We believe that this finding, not (NP the board) (PP as (NP a nonexecu- only justifies work in this direction, but may even tive director)) (NP Nov. 29))) .)) suggest that it would be worthwhile to use this methodology incrementally, to learn a more com- Pierre Vinken, 61 years, the board, plete parser, if needed. a nonexecutive director and Nov. 29 2 Experimental Design are atomic phrases while other higher-level phrases are not. That is, an atomic phrase denotes In order to run a fair comparison between full a tightly coupled message unit which is just parsers and shallow parsers — which could pro- above the level of single words. 2.1 Parsers here, we also trained and tested it under the exact We perform our comparison using two state-of- conditions of CoNLL-2000 (Tjong Kim Sang and the-art parsers. For the full parser, we use the Buchholz, 2000) to compare it to other shallow one developed by Michael Collins (Collins, 1996; parsers. Table 1 shows that it ranks among the 1 Collins, 1997) — one of the most accurate full top shallow parsers evaluated there . parsers around. It represents a full parse tree as a set of basic phrases and a set of dependency Table 1: Rankings of Shallow Parsers in relationships between them. Statistical learning CoNLL-2000. See (Tjong Kim Sang and Buch- techniques are used to compute the probabilities holz, 2000) for details. of these phrases and of candidate dependency re- lations occurring in that sentence. After that, it ¤ ¥§¦ ¤ Parsers Precision(¤ ) Recall( ) ( ) will choose the candidate parse tree with the high- ¨ KM00© 93.45 93.51 93.48 est probability as output. The experiments use ¨ Hal00© 93.13 93.51 93.32 the version that was trained (by Collins) on sec- ¨ CSCL © * 93.41 92.64 93.02 tions 02-21 of the Penn Treebank. The reported ¨ TKS00 © 94.04 91.00 92.50 results for the full parse tree (on section 23) are ¨ ZST00 © 91.99 92.25 92.12 recall/precision of 88.1/87.5 (Collins, 1997). ¨ Dej00© 91.87 91.31 92.09 The shallow parser used is the SNoW-based ¨ Koe00© 92.08 91.86 91.97 CSCL parser (Punyakanok and Roth, 2001; ¨ Osb00© 91.65 92.23 91.94 Munoz et al., 1999). SNoW (Carleson et al., ¨ VB00© 91.05 92.03 91.54 1999; Roth, 1998) is a multi-class classifier that ¨ PMP00 © 90.63 89.65 90.14 is specifically tailored for learning in domains ¨ Joh00© 86.24 88.25 87.23 in which the potential number of information ¨ VD00 © 88.82 82.91 85.76 sources (features) taking part in decisions is very Baseline 72.58 82.14 77.07 large, of which NLP is a principal example.

Load more