Linguistic Profiling for Author Recognition and Verification Hans van Halteren Language and Speech, Univ. of Nijmegen P.O. Box 9103 NL-6500 HD, Nijmegen, The Netherlands [email protected] feature is to receive. Furthermore, the frequency Abstract counts are not used as absolute values, but rather as deviations from a norm, which is again deter- A new technique is introduced, linguistic mined by the situation at hand. Our hypothesis is profiling, in which large numbers of that this technique can bring a useful contribution counts of linguistic features are used as a to all tasks where it is necessary to distinguish text profile, which can then be compared one group of texts from another. In this paper the to average profiles for groups of texts. technique is tested for one specific type of group, The technique proves to be quite effective namely the group of texts written by the same for authorship verification and recogni- author. tion. The best parameter settings yield a False Accept Rate of 8.1% at a False Re- 2 Tasks and Application Scenarios ject Rate equal to zero for the verification task on a test corpus of student essays, Traditionally, work on the attribution of a text to and a 99.4% 2-way recognition accuracy an author is done in one of two environments. on the same corpus. The first is that of literary and/or historical re- search where attribution is sought for a work of unknown origin (e.g. Mosteller & Wallace, 1984; 1 Introduction Holmes, 1998). As secondary information gener- ally identifies potential authors, the task is au- There are several situations in language research thorship recognition: selection of one author from or language engineering where we are in need of a set of known authors. Then there is forensic a specific type of extra-linguistic information linguistics, where it needs to be determined if a about a text (document) and we would like to suspect did or did not write a specific, probably determine this information on the basis of lin- incriminating, text (e.g. Broeders, 2001; Chaski, guistic properties of the text. Examples are the 2001). Here the task is authorship verification: determination of the language variety or genre of confirming or denying authorship by a single a text, or a classification for document routing or known author. We would like to focus on a third information retrieval. For each of these applica- environment, viz. that of the handling of large tions, techniques have been developed focusing numbers of student essays. on specific aspects of the text, often based on frequency counts of functions words in linguis- For some university courses, students have to tics and of content words in language engineer- write one or more essays every week and submit ing. them for grading. Authorship recognition is needed in the case the sloppy student, who for- In the technique we are introducing in this paper, gets to include his name in the essay. If we could linguistic profiling, we make no a priori choice link such an essay to the correct student our- for a specific type of word (or more complex fea- selves, this would prevent delays in handling the ture) to be counted. Instead, all possible features essay. Authorship verification is needed in the are included and it is determined by the statistics case of the fraudulous student, who has decided for the texts under consideration, and the distinc- that copying is much less work than writing an tion to be made, how much weight, if any, each essay himself, which is only easy to spot if the 2.2 The Test Corpus original is also submitted by the original author. Before using linguistic profiling for any real task, we should test the technique on a benchmark In both scenarios, the test material will be siz- corpus. The first component of the Dutch Au- able, possibly around a thousand words, and at thorship Benchmark Corpus (ABC-NL1) appears least several hundred. Training material can be to be almost ideal for this purpose. It contains sufficiently available as well, as long as text col- widely divergent written texts produced by first- lection for each student is started early enough. year and fourth-year students of Dutch at the Many other authorship verification scenarios do University of Nijmegen. The ABC-NL1 consists not have the luxury of such long stretches of test of 72 Dutch texts by 8 authors, controlled for age text. For now, however, we prefer to test the ba- and educational level of the authors, and for reg- sic viability of linguistic profiling on such longer ister, genre and topic of the texts. It is assumed stretches. Afterwards, further experiments can that the authors’ language skills were advanced, show how long the test texts need to be to reach but their writing styles were as yet at only weakly an acceptable recognition/verification quality. developed and hence very similar, unlike those in 2.1 Quality Measures literary attribution problems. For recognition, quality is best expressed as the Each author was asked to write nine texts of percentage of correct choices when choosing be- about a page and a half. In the end, it turned out tween N authors, where N generally depends on that some authors were more productive than the attribution problem at hand. We will use the others, and that the text lengths varied from 628 percentage of correct choices between two au- to 1342 words. The authors did not know that the thors, in order to be able to compare with previ- texts were to be used for authorship attribution ous work. For verification, quality is usually studies, but instead assumed that their writing expressed in terms of erroneous decisions. When skill was measured. The topics for the nine texts the system is asked to verify authorship for the were fixed, so that each author produced three actual author of a text and decides that the text argumentative non-fiction texts, on the television was not written by that author, we speak of a program Big Brother, the unification of Europe False Reject. The False Reject Rate (FRR) is the and smoking, three descriptive non-fiction texts, percentage of cases in which this happens, the about soccer, the (then) upcoming new millen- percentage being taken from the cases which nium and the most recent book they read, and should be accepted. Similarly, the False Accept three fiction texts, namely a fairy tale about Little Rate (FAR) is the percentage of cases where Red Riding Hood, a murder story at the univer- somebody who has not written the test text is ac- sity and a chivalry romance. cepted as having written the text. With increasing threshold settings, FAR will go down, while FRR The ABC-NL1 corpus is not only well-suited goes up. The behaviour of a system can be shown because of its contents. It has also been used in by one of several types of FAR/FRR curve, such previously published studies into authorship at- as the Receiver Operating Characteristic (ROC). tribution. A ‘traditional’ authorship attribution Alternatively, if a single number is preferred, a method, i.e. using the overall relative frequencies popular measure is the Equal Error Rate (EER), of the fifty most frequent function words and a viz. the threshold value where FAR is equal to Principal Components Analysis (PCA) on the FRR. However, the EER may be misleading, correlation matrix of the corresponding 50- since it does not take into account the conse- dimensional vectors, fails completely (Baayen et quences of the two types of errors. Given the ex- al., 2002). The use of Linear Discriminant Analy- ample application, plagiarism detection, we do sis (LDA) on overall frequency vectors for the 50 not want to reject, i.e. accuse someone of plagia- most frequent words achieves around 60% cor- rism, unless we are sure. So we would like to rect attributions when choosing between two au- measure the quality of the system with the False thors, which can be increased to around 80% by Accept Rate at the threshold at which the False the application of cross-sample entropy weight- Reject Rate becomes zero. ing (Baayen et al., 2002). Weighted Probability Distribution Voting (WPDV) modeling on the ninth essay topic. This means that for all runs, the basis of a very large number of features achieves test data is not included in the training data and is 97.8% correct attributions (van Halteren et al., To about a different topic than what is present in the Appear). Although designed to produce a hard training material. During each run within a set, recognition task, the latter result show that very the system only receives information about high recognition quality is feasible. Still, this ap- whether each training text is written by one spe- pears to be a good test corpus to examine the ef- cific author. All other texts are only marked as fectiveness of a new technique. “not by this author”. 3 Linguistic Profiling 3.3 Raw Score The system first builds a profile to represent text In linguistic profiling, the occurrences in a text written by the author in question. This is simply are counted of a large number of linguistic fea- the featurewise average of the profile vectors of tures, either individual items or combinations of all text samples marked as being written by the items. These counts are then normalized for text author in question. The system then determines a length and it is determined how much (i.e. how raw score for all text samples in the list.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-