Systemic Functional Features in Stylistic Text Classification
Total Page:16
File Type:pdf, Size:1020Kb
Systemic Functional Features in Stylistic Text Classification Casey Whitelaw Shlomo Argamon School of Information Technologies Department of Computer Science University of Sydney, NSW, Australia Illinois Institute of Technology [email protected] 10 W. 31st Street, Chicago, IL 60616, USA [email protected] Abstract of meaning, we are unlikely to gain any true insight into the nature of the stylistic dimension(s) under study. We propose that textual ‘style’ should be best de- fined as ‘non-denotational meaning’, i.e., those as- How then should we approach finding more theoretically- pects of a text’s meaning that are mostly inde- motivated stylistic feature sets? First consider the well- pendent of what the text refers to in the world. accepted notion that style is indicated by features that in- To make this more concrete, we describe a lin- dicate the author’s choice of one mode of expression from guistically well-motivated framework for compu- among a set of equivalent modes. At the surface level, this tational stylistic analysis based on Systemic Func- may be expressed by specific choice of words, syntactic struc- tional Linguistics. This theory views a text as a tures, discourse strategy, or combinations. The causes of such realisation of multiple overlapping choices within surface variation are similarly heterogeneous, including the a network of related meanings, many of which re- genre, register, or purpose of the text, as well as the educa- late to non-denotational (traditionally ‘pragmatic’) tional background, social status, and personality of the au- aspects such as cohesion or interpersonal distance. thor and audience. What all these dimensions of variation Variations between relative frequencies of options have in common, though, is an independence from the topic within these systems corresponds to stylistic varia- of the text, which may be considered to be those objects and tion. Though full parsing in SFL remains a difficult events that it refers to (as well as their properties as described unsolved problem, we present a software architec- in the text). We thus provisionally define the stylistic mean- ture which allows for efficient modelling and ex- ing of a text to be those aspects of its meaning that are non- traction of SFL entities for use in stylistic analyses. denotational, i.e., independent of the objects and events to In support, we show results for two applications: which the text refers. Indeed, this definition of style as char- classifying texts as financial scams and classifying acteristic choice unrelated to basic functionality is applicable scientific articles by field. outside the textual domain. Our goal, therefore, is to find a computationally tractable formulation of linguistically well-motivated features which 1 Introduction permit expression of non-denotational meanings. We believe Modern computational stylistics seeks to apply statistical and that Systemic Functional Linguistics (SFL) is a particularly machine learning techniques to features extracted from text, apposite theory for this purpose. As we will detail further in order to answer style-related questions such as authorship below, SFL [12] analyses a text as a realisation of particular attribution, genre characterisation and so forth. Its origins are choices of meanings. For work that seeks to categorise a text in the field of stylometrics in which small numbers of features according to the meanings that it makes rather than just the of a text were sought that statistically significantly predict a words that it uses, SFL presents an extremely useful model. given stylistic difference. Traditionally, stylometric models In particular, the fact that SFL explicitly recognises and rep- for categorization have typically been based on hand-selected resents non-denotational meanings makes it particularly ap- sets of content-independent, lexical [21], syntactic [25], or plicable to stylistical problems. complexity-based [30] features. In nearly all cases, the input We describe in this paper a software architecture for com- feature sets used are not theoretically motivated, from a lin- puting aspects of language use in an SFL paradigm. Full pars- guistic perspective, as being directly related to style—rather ing of SFL is a difficult and unsolved problem [22] that has the methodology that has developed is to find as large a set of been seen as holding back the applicability of systemic func- topic-independent features as possible and use them as input tional theory to NLP applications. We avoid this complexity to a generic learning algorithm (preferably one resistant to by developing a partial parsing approach, based on the notion overfitting). While some interesting and effective feature sets of identifying semantic units in the text, which are specific have been found in this way (such as [15; 16]; function words instances of systems defined in the underlying SFL gram- have also proven to be surprisingly effective on their own [1; mar. Application of our methods to stylistic text categoriza- 20]), we believe that without a firm basis in a linguistic theory tion shows that SFL is well-suited to identifying document- preferences can be measured by evaluating the relative prob- abilities of different options by tagging their realisations in a Context Categories of social situation corpus of texts [11]. 2.1 Stratification and Realisation Semantics Systems of meaning One key global dimension is the hierarchy of stratification. Language itself is modelled as an ordered series of levels or strata, as shown in Figure 1. In general, we are interested Lexicogrammar Systems of wording in systems that capture an aspect of the meaning of the text (located within the semantic stratum) and is expressed as a pattern of word usage (realised within the lexicogrammatic Phonology / Graphology stratum). Systems of sounding / writing Register A register is a group of texts whose language selections vary from the general language system in similar ways. Regis- ter often denotes functional distinctions in language use re- lated to the context of such use [8], or it may derive from the particular history of a community of discourse. A register is Figure 1: Modelling language stratally (after [14]). Units in realised by a particular skewing ‘of probabilities relative to each stratum are realised by configurations of units in the en- the general systemic probabilities’ [17]. We thus view writ- closed stratum. ing styles as registers realised by skewing the probabilities of non-denotational meanings in the language. These mean- ings in a document in turn are realised in the semantics and level characteristics of language use. lexicogrammar, and so may be analysed on these terms. In particular, register differences may be exposed through the 2 Systemic Functional Linguistics patterns of language choice within a system. For example, Plum and Cowling [24] show a rela- Systemic Functional Linguistics (SFL) is a framework for tion between speaker social class and choice of verb describing and modeling language in functional rather than tense (past/present) in face-to-face interviews. Similarly, formal terms. The theory is functional in that language is Hasan [13] has shown, in mother-child interactions, that the interpreted as a resource for making meaning, and descrip- sex of the child and the family’s social class together have tions are based on extensive analyses of written and spoken a strong influence on several kinds of semantic choice in [ ] text 12 . The theory is also systemic in that it models lan- speech. The methodology for such studies has involved hand- [ ] guage as a system of choices 19 . SFL has been applied coding a corpus for systemic-functional and contextual vari- in natural language processing in various contexts since the ables and then evaluating how systemic choice probabilities [ 1960s, but has been used most widely in text generation 18; vary with contextual factors using multivariate analysis (e.g., ] 26 , due to the difficulty of full parsing in the theory. principal components analysis). Here, by contrast, we de- As noted, SFL construes language as a set of interlocking velop automated methods for recognising entities which re- choices for expressing meanings, with more general choices alise such variables in the text, allowing us to use machine constraining the possible specific choices. For example: “A learning techniques to automatically build accurate classifi- message is either about doing, thinking, or being; if about do- cation models for large numbers of SFL-based features. ing, it is either standalone action or action on something; if action on something it is either creating something or affect- 2.2 Metafunction ing something preexistent,” and so on. Thus a system is a set Key to our use of SFL for stylistic analysis is the notion of of options for meanings to be expressed, with entry condi- metafunction, referring to three separate strands of meaning tions denoting when that choice is possible—for example, if that in parallel contribute to the overall meaning in the text a message is not about doing, then there is no choice possi- [12]. Briefly, the three metafunctions, deployed simultane- ble between expressing standalone action or action on some- ously in a text, are as follows: thing. Each system has also a realization specification, giving constraints (lexical, featural, or structural) on statements ex- The textual metafunction provides ‘resources for present- pressing the option. Options often serve as entry conditions ing information as text in context’ [19]. These resources for more specific (or delicate) systems. enable individual utterances to be evaluated as messages, By viewing language as a complex of choices between mu- as well as related to the context of utterance (either other tually exclusive options, the systemic approach enables effec- utterances or extra-textual features). A simple example tive characterisation of variation in language use. As will be is the use of ‘textual theme’, which sets up a clause as seen, a systemic specification allows us to create features re- connected to a previous clause in a text, as in the con- lated to high-level linguistic variation.