Converting the Penn Treebank to Systemic Functional Grammar
Total Page:16
File Type:pdf, Size:1020Kb
Converting the Penn Treebank to Systemic Functional Grammar Matthew Honnibal Department of Linguistics, Macquarie University Macquarie University 2109 Sydney Australia [email protected] Abstract language processing (Munro, 2003; Couchman and Whitelaw, 2003), and there is a strong history of Systemic functional linguistics offers a grammar interaction between systemic functional linguistics that is semantically organised, so that salient gram- and natural language generation (Matthiessen and matical choices are made explicit. This paper de- Bateman, 1991). However, there is currently a lack scribes the explication of these choices through the of computational SFG resources. There is no stan- conversion of the Penn Treebank into a systemic dard format for machine readable annotation, no an- functional grammar corpus. Developing such a re- notated corpora, and no useable parsers. Converting source can help connect work in natural language the Penn Treebank will make a large body of SFG processing to a significant body of research dealing annotated data available to computational linguists explicitly with the issue of how lexical and gram- for the first time, an important step towards address- matical selections create meaning. ing this situation. 1 Introduction We first discuss some preliminaries relating to the nature of systemic functional grammar, and the The Penn Treebank was designed to maximise con- scope of the converted corpus’s annotation. We sistency and annotator efficiency, rather than con- then discuss the conversion of the treebank’s phrase- formity with any particular linguistic theory (Mar- structure representation to SFG constituency struc- cus et al., 1994). This results in trees that strongly ture, and finally we discuss the addition of interper- suggest the use of synthetic features to explicate sonal and textual function structures. semantically significant grammatical choices like mood, tense, voice or negation. These distinctions 2 Some preliminaries lie latent in the configuration of the tree in the Tree- bank II annotation scheme, making it difficult for a 2.1 Structure of the SFG analysis machine learner to make use of them. Systemic functional grammar divides the task of Rather than the ad hoc addition of this informa- grammatical analysis — the process of stating the tion at the feature extraction stage, the corpus can be grammatical properties of a text — into two parts: re-presented in a way that makes feature extraction analysis of syntactic structures, and analysis of more principled. This involves increasing the size function structures. and complexity of the representation of a sentence SFG syntactic analysis is constituency based, by organising the tree semantically. Organising a and is predicated on Halliday’s notion of the rank grammar semantically is by no means a trivial task, scale (Halliday, 1966): clauses are composed of and has been an active area of linguistic research for groups/phrases, which are composed of words, the last forty years. This paper describes the con- which are composed of morphemes. The main version of the Penn Treebank into a prominent out- concerns of SFG syntactic analysis are the chunk- put of such research, systemic functional grammar ing of words into groups/phrases, and the chunk- (SFG). ing of groups/phrases into clauses. Levels of con- Systemic functional grammar does not confine its stituency between groups/phrases and their words description to syntactic structure, but includes a rep- are recognised in the literature (Matthiessen, 1995), resentation of the choices grammatical configura- but rarely brought into focus in research unless tions represent — or ‘realise’, to use the term pre- the group/phrase contains, or is, an embedded con- ferred in the linguistics literature (Halliday, 1976). stituent from another rank (e.g., a nominal group There is growing evidence that systemic func- like ‘the man’ with an embedded relative clause like tional grammar can be usefully applied to natural ‘who knew too much’). Function structures can refer to any rank of the The distinction between systems which can be constituency, but clause rank functional analysis automatically annotated and systems which cannot is generally regarded as the most important. The lies in the way the systems are realised. Mood grammar defines a set of systems, which can be de- and theme are realised primarily through the order fined recursively using conjunction and disjunction. of constituents (the order of Subject and Finite in They are usually represented graphically in system the case of mood, and the first Adjunct, Subject, networks (Matthiessen, 1995), as in Figure 1. Complement or Predicator in the case of theme). In this figure, the nested disjunction ‘indicative They are realised structurally, as opposed to lexi- or interrogative’ represents a more delicate, or finer cally. Other systems are realised through the se- grained, distinction than that between indicative and lection of grammatical items (also called ‘function imperative. After selecting from the initial choice, words’ — a term we prefer not to use because of the one proceeds from left to right into increasingly del- special sense of ‘function’ in the context of SFG). icate distinctions. These systems are categorised Systems that are realised with grammatical items, into three metafunctions, which represent differ- such as voice, polarity and tense, can also be au- ent types of meaning language enacts simultane- tomatically annotated. Lexically realised systems, ously (ideational, interpersonal and textual) (Hall- on the other hand, require a lexicon or equivalent iday, 1969). resource, since the choice of words within identi- cal syntactic structures changes the selection from declarative the system. Trees which are identical at every level indicative except their leaves have different process type se- interrogative lections. The central system of transitivity, process type, cannot be analysed for this reason. imperative The annotation of the corpus we present there- fore attempts to include selections from the follow- Figure 1: A simple mood system, ‘(indicative or in- ing systems at clause rank: terrogative) or imperative’ • interpersonal 2.2 Scope of target annotation – mood (i.e. mood type and role tags There is no clearly defined limit to systemic func- for Subject, Finite, Predicator, Adjunct, tional grammar, in the sense that one could say that a Complement, Vocative) text has been ‘fully’ analysed. The grammar is con- – clause class stantly being extended, with new kinds of analysis and levels of delicacy suggested. The ultimate aim – status of the approach is to distinguish every semantically – tense distinct different wording choice (Hasan, 1987). – polarity When working with systemic functional gram- mar, then, practitioners generally define the scope • textual of their analysis. We must do the same, although – theme (i.e. role tags for Textual Theme, the reasons are different. Analysis, so far, has al- Interpersonal Theme, Topical Theme, ways been performed manually, with only finite Rheme) time available. Projects have therefore had to de- cide between the size of a sample and the detail of – voice its analysis. In our case, we are limited to the kinds of analysis which can be directly inferred from Ideational analysis is omitted entirely, because the Penn Treebank. Future research will doubtless transitivity analysis requires a more complicated leverage other resources to extend the analysis of approach, as discussed above. Although arguably the corpus we present, but attempts to do so are be- some aspects of taxis and expansion type could be yond the scope of this paper. annotated automatically, because the central infor- The Penn Treebank presents accurate con- mation cannot be annotated, we have left it out en- stituency and part-of-speech information. This is tirely. enough information to annotate the corpus automat- ically with roughly two thirds of the most important 3 Constituency Conversion clause rank systems: mood and theme, but not tran- We have not found it necessary to use a method of sitivity. automatic rule induction to generate a CFG. The lack of a suitable training set made that approach S impractical for the time and resources we have had available; and good results have been obtained by NP VP simply using a set of hard-coded transformation functions, implemented as a Python script. This NP NP PP approach does have a significant drawback, how- ever: because the script does not output a con- version grammar, correcting systematic errors and Figure 2: Raising of NP and PP nodes dominated other maintenance or extension tasks are much more by a VP difficult. Sentence The first process in the conversion of a sentence is to parse the Lisp-style string representation into a NP-SBJ VP tree of generic node objects. Each node contains a function tag (which may be null), a node label Sentence and a set of children (which may be empty). The root node is then used to initialise a sentence ob- NP-SBJ VP ject, which sorts its immediate children into clause, group and verbal group objects. As each class is NP initialised, it initialises a clause, verbal group, other group or lexis object with each of its children. The A Lorilard spokeswoman said This is an old story tree is thus recursively re-represented by more spe- cific constituent objects, rather than generic node Figure 3: A clause dominating another objects. Subtyping the nodes facilitates the changes to the structure that must be performed, since the ture. All non-nominalised, non-embedded clauses structural changes are mostly specific to either ver- are therefore siblings dominated by the root clause bal groups or clauses. complex. These changes are divided into a series of steps, Figure 3 shows the Treebank representation, with each coded as a function. Each function contains a hypotactic clause as a child of a VP. Hypotactic a series of conditionals which identify the struc- clauses are raised to be siblings of the nearest clause ture being targeted and how it should be altered. node above them. Figure 4 shows the tree after this The most significant functions are described in more has been performed.