Mark Carman & Helen Ashman Evaluating Binary n-gram Analysis For Authorship Attribution Mark Carman
[email protected] School of Information Technology and Mathematical Sciences University of South Australia Adelaide, SA 5095, Australia Helen Ashman
[email protected] School of Information Technology and Mathematical Sciences University of South Australia Adelaide, SA 5095, Australia Abstract Authorship attribution techniques focus on characters and words. However the inclusion of words with meaning may complicate authorship attribution. Using only function words provides good authorship attribution with semantic or character n-gram analyses but it is not yet known whether it improves binary n-gram analyses. The literature mostly reports on authorship attribution at word or character level. Binary n-grams interpret text as binary. Previous work with binary n-grams assessed authorship attribution of full texts only. This paper evaluates binary n-gram authorship attribution over text stripped of content words as well as over a range of cross-domain scenarios. This paper reports a sequence of experiments. First the binary n-gram analysis method is directly compared with character n-grams for authorship attribution. Then it is evaluated over three forms of input text, full text, stop words and function words only, and content words only. Subsequently, it was tested over cross-domain and cross-genre texts, as well as multiple-author texts. Keywords: Authorship Attribution, Binary n-gram, Stop Word, Cross-domain, Cross-genre. 1. INTRODUCTION Authorship attribution is the forensic analysis of text with the aim of discovering who wrote it. There is a range of authorship attribution methods in use, along with some refinements to their use intended to improve their accuracy.