Shakespeare Vs. Fletcher: A Stylometric Analysis by Radial Basis Functions Authors(s): David Lowe and Robert Matthews Source: Computers and the Humanities, Vol. 29, No. 6 (Dec., 1995), pp. 449-461 Published by: Springer Stable URL: http://www.jstor.org/stable/30200368 Accessed: 27-03-2016 15:04 UTC

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://about.jstor.org/terms

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Springer is collaborating with JSTOR to digitize, preserve and extend access to Computers and the Humanities

http://www.jstor.org

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms Computers and the Humanities 29: 449-461, 1995. 449

C 1995 Kluwer Academic Publishers. Printed in the Netherlands

Shakespeare Vs. Fletcher: A Stylometric Analysis by Radial Basis Functions

David Lowe and Robert Matthews *

Neural Computing Research Group, Aston University, Birmingham B4 7ET, England

e-mail:[email protected]; [email protected]

Key words: neural networks, stylometric analysis, Shakespeare, Fletcher, discrimination, classification

Abstract

In this paper we show, for the first time, how Radial Basis Function (RBF) network techniques can be used to

explore questions surrounding authorship of historic documents. The paper illustrates the technical and practical

aspects of RBF's, using data extracted from works written in the early 17th century by and his

contemporary John Fletcher. We also present benchmark comparisons with other standard techniques for contrast

and comparison.

1. Introduction tive work of Shakespeare. Whilst some scholars have

accepted the play as such, others remain unconvinced.

Literary scholars have long debated over questions of Conventionally, the primary information used to try

authorship of various works and documents. Many and ascribe authorship is centred around scholarly

such questions centre on alleged works by William opinion of the aesthetic style of the prose and the subtle

Shakespeare and one of the oldest of these disputes use of language, vocabulary and grammar when com-

concerns the authorship of an obscure play, The Two pared to other works of undisputed provenance.

Noble Kinsmen. This was first performed around 1613 This is a classic problem faced in many scholarly

domains which use high level, human cognitive but has been relatively ignored ever since. A copy

of this script circulating around 1634 ascribed the methods of reasoning combined with 'intuition' and

work to William Shakespeare and John Fletcher (who 'experience' to try and arrive at a consensus of

succeeded Shakespeare after his death in 1616 as chief opinion. However there are also quantitative, statistical

dramatist to the Kings Men). The question arises as to approaches to data analysis which might have some-

whether this obscure play really is a genuine collabora- thing to offer in these domains. The field of stylo-

metry is essentially the application of mathematical

methods to extract quantitative measures to assist in

* David Lowe is Professor of Neural Computing at Aston

such debates.

University, UK. His research interests span from the theoretical

Of course, no technique can ascribe definitive aspects of dynamical systems theory and statistical pattern process-

ing, to a wide range of application domains, from financial market answers in such applications. The best we can hope for

analysis ("Novel Exploitation of Neural Network Methods in Finan-

is a technique which provides additional quantifiable

cial Markets", invited paper, World Conference on Computational

evidential weight in favour of one author or another.

Intelligence, vol. VI, pp. 3623-28, 1994) to the 'artificial nose'

Another problem is that in extracting high level quali- ("Novel 'Topographic' Nonlinear Feature Extraction using Radial

Basis Functions for Concentration Coding in the 'Artificial Nose'", tative information from an abstract knowledge source

3rd IEE International Conference on Artificial Neural Networks,

for quantitative analysis, we need to produce an inter-

pp. 95-99, Conference Publication number 372, The Institute of

mediate representation of information which is more

Electrical Engineers, 1993).

Robert Matthews is a visiting research fellow at Aston Univer- 'low-level'. This process of dimensionality reduction

sity. His research interests include probability, number theory and and feature extraction is inevitably a nonlinear process.

astronomy. His recent paper in Nature (vol. 374, pp. 681-82, 1995)

If the transformed information has been nonlinearly

somehow managed to combine all three.

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 450

distorted, then evidently we need access to nonlinear mated by a suitable Radial Basis Function architecture.

In addition it can be considered as a generalisation of analysis techniques to resolve any conflict. Unfortu-

nately there are very few nonlinear methods which several traditional statistical pattern processing tech-

have an inherent ability to extract and convey statistical niques. Its strength derives from a rich interpretational

information. However, one such class of techniques basis since it lies in the confluence of a variety of

exists in the neural network domain. 'established' scientific disciplines. Thus, although the

There is already evidence (Matthews and Merriam, original motivation of this particular network struc-

ture was in terms of functional approximation tech- 1993) that the Multilayer Perceptron is a potentially

very useful tool in stylometric analysis. It was shown niques (Powell, 1992), the network may be derived

that the Multilayer Perceptron could be trained to on the basis of statistical pattern processing theory

classify 96% of the training set successfully (using (Lowe, 1991), regression and regularisation (Girosi

cross-validation) composed of known Shakespeare- et al., 1995), biological pattern formation, mapping

Fletcher works. When applied to other data not used in the presence of noisy data etc. However, in addi-

as part of the training set, very successful discrimina- tion to exhibiting a range of useful theoretical proper-

tion was obtained on known works, and when applied ties, it is also a practically useful construct as it may

to disputed works the method provided information be applied to problem domains in discrimination (see

which was in general broad agreement with current e.g. Niranjan and Fallside, 1990, for a speech classi-

scholarly opinion. fication example), time series prediction (see articles

However there are many distinct types of neural in Rao Vemuri and Rogers, 1994, for financial and

other examples) and other mapping problems, and fea- network methods, each with their own properties,

ture extraction/topographic mapping problem domains advantages and disadvantages. There are also many

recent statistical techniques which have yet to be appro- (e.g. Lowe, 1993, for a chemical odour concentration

priately developed in this type of problem domain. coding example).

The previous work which has studied this particular

2.1. Neural networks and classification problems problem was a preliminary, feasibility study in that no

comparative performance experiments were presented,

Neural networks such as the Radial Basis Function either contrasting with other network techniques, or

with other traditional methods. This paper addresses network are examples of techniques known as nonpara-

these criticisms by presenting an alternative network metric methods. This means that they can be used to

construct representations to problems where an explicit study as well as presenting comparative performance

model of the problem domain is not known (such as in estimates using more traditional techniques. In partic-

ular this paper presents an analysis of Shakespeare- financial market prediction) or is too difficult to eval-

Fletcher data using a range of quantitative techniques, uate (as in weather forecasting). This is achieved by

including classical statistical pattern processing optimising the structure of a neural network architec-

methods and the Radial Basis Function network. This ture by minimising a criterion function (usually a sum

latter technique has several advantages over the previ- squared error criterion between the desired answer and

ously applied Multilayer Perceptron, especially when the predicted network answer). Although originally

motivated by the apparent structure of information applied to small sample data sets as exemplified by

the specific problem considered in this paper. Some of processing in nervous systems, we now know that

these advantages will be discussed later. artificial neural networks are more closely related to

pattern processing methods than to biology.

The architecture of an artificial neural network is

2. Classification Using the Radial Basis Function very simple and is composed of layers of process-

Network ing elements with nonlinear (though differentiable)

transfer functions at each node. An artificial neural net-

The Radial Basis Function (Broomhead and Lowe, work has a set of input nodes, a set of 'hidden layer'

nodes (so called because thay are hidden from direct 1988; Haykin, 1994) is a conceptually very simple

and yet intrinsically powerful network structure. In interaction with the outside environment - they can

particular it has the property of being 'computationally only receive and pass on information to other layers)

universal' (Park and Sandberg, 1991): in principle any and a set of output nodes. Each node in the input layer

(nonlinear) function may be arbitrarily closely approxi- is fully connected to every node in the hidden layer

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 451

Connection strength from input Connection strength from hidden

node i to hidden node j node j to output node k

I,1,..

xjlk

Author A {1,0} Discriminator

Features

Author B {0,1}

(e.g. word frequencies)

input hidden output

layer layer layer

Fig. 1. The architecture of a feed-forward network model to classify texts as either Author A or Author B.

and every node in the hidden layer fully connected to dimensional classification example as depicted in Fig-

every node in the output layer. Information from the ures 2 and 3. These figures show an example of a

environment is presented to the input layer nodes and simple problem where there are two types of classes

to recognise, based upon the measurements of just two the network processes this information to produce pre-

dictions about the unknown system at the output nodes. types of observables. However we cannot separate the

The connections between all nodes have adjustable classes with just a simple straight line (so the problem

is not linearly separable). Nevertheless, it is possible weights which determine the 'strength' associated with

each piece of information flowing down each con- to separate the two classes by using a nonlinear bound-

nection. In the most widely-used neural network, the ary between the two classes. This is the purpose of a

neural network. There are several ways in which this Multilayer Perceptron, this strength between an input

pattern and the weights connecting one of the nodes is nonlinear separating boundary could be produced. The

given by forming the scalar product between vectors first figure shows how a simple Multilayer Perceptron

could produce a separating boundary by using a set of representing the pattern and the weights. The resulting

summation of all contributions flowing into a node is piecewise linear segments. These segments correspond

then passed through a nonlinear transfer function. In to the threshold regions of the hidden nodes where the

the Multilayer Perceptron this nonlinearity is typically nonlinearity changes from 'not firing' to 'firing'. In

a 'logistic' function 1/[1 + exp(-x)] which is a func- this scheme, one can decide which class a novel pat-

tern belongs to simply by deciding which side of the tion of the input activity x. The nonlinearity originally

decision boundary line the novel pattern measurements was taken to represent the output firing rate activity

are situated. of a neuron which increases nonlinearly as the input

activity increases. This output value is then passed on

by the next set of connections to the next layer or to 2.2. Differences between the Multilayer Perceptron

the output. The weights are adjusted in response to the and the Radial Basis Function networks

data in a training phase which determines the precise

behaviour of the network. Figure 3 shows an alternative way to separate the two

The important aspect of such a structure is its classes, and is the mechanism used by a Radial Basis

hidden layer of nonlinear processing elements. This Function network. The Radial Basis Function is a sin-

hidden layer is used to automatically construct a gle hidden layer feed forward network which resem-

nonlinear feature extraction space which allows the bles the Multilayer Perceptron. Differences include the

classification problem to be solved. The nature of this facts that the Radial Basis Function uses linear transfer

feature extraction space depends upon the particular functions on the output nodes and alternative nonlinear

transfer functions to the logistic function on the hidden type of network structure used. We can represent the

role of the hidden layer of nodes using a simple two layer nodes. Also the first layer of the network uses

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 452

Piecewise linear decision boundary

separating the two classes

SCLASS 1

Cl

ICLASS2

a)

measurement #1

Fig. 2. A simple two dimensional classification example. The separation of the two classes requires a nonlinear decision boundary. The figure

shows a typical boundary produced by the hidden layer nodes of a multilayer perceptron.

likely class in the sense of a probability distribution. 'distance' as a measure of similarity between a weight

vector and a pattern vector, rather than a scalar product This is part of the reason that a Radial Basis Function

network is particularly appropriate for the problem we function as in the Multilayer Perceptron. The weight

vector connected to a hidden node corresponds to the are considering in this paper.

'location' of this hidden node in the pattern space. Although the Multilayer Perceptron and Radial

Therefore the hidden nodes in a Radial Basis Function Basis Function represent complementary views of

network respond to the difference between an input analysing nonlinear noisy problems, the Radial Basis

pattern and the weight vector connected to the hidden Function has several advantages over the Multilayer

node. Naively, one could think of the hidden node as Perceptron, especially in the context of small sample

having a localised response, so that the node's influ- problem domains, to which stylometric questions often

ence decays as the distance between the input pattern belong. For instance, the Radial Basis Function has

and the weight, or the 'centre' of the node increases. very strong similarities to more traditional statistical

Appendix 2 discusses some of the technical aspects, pattern classification techniques such as Parzen win-

the architecture and the notation of the Radial Basis dow classifiers and the method of Potential Functions

Function. Further information may be obtained from for density estimation (Lowe, 1991). It can also be

Broomhead and Lowe (1988) and Haykin (1994). considered as a generalisation of the simple Gaussian

However, heuristically each node is 'centred' Classifier which we also used as part of this study.

around a location in the pattern space, and the nonlin- Hence the architecture may be considered to be a nat-

earity describes how much each data point contributes ural extrapolation from traditional techniques. In addi-

towards influencing the node. In this way a proba- tion it is an architecture which allows the incorporation

bility distribution profile of the two data clusters is of prior knowledge in a much easier fashion compared

constructed, specifically by summing the contributions to the Multilayer Perceptron. For instance, the inter-

from all the 'microclusters' as defined by the hidden pretation of the weights in the first layer of the Radial

nodes in the Radial Basis Function network. So in Basis Function is that the weights constitute a quan-

order to decide which of the two classes a new pat- tisation of the input space such that they are located

tern belongs, we are essentially looking for the most in regions of high data density. This means that rather

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 453

Circles depict the influence of each radial basis function

CLASS 1

a)

a)

CLASS 2

a)

measurement #1

Fig. 3. The same two dimensional classification example. In this figure the division of the pattern space by a Radial Basis Function network

is revealed. Each hidden node in the network accounts for part of the cluster of data space. The description of the entire data set is obtained by

combining the contributions from all of the microclusters.

than having to perform a full nonlinear optimisation which reflect this uncertainty. Otherwise one should

to decide what values the input weights should have doubt the validity of that technique.

(as we must do in the Multilayer Perceptron), we can The particular problem considered in this paper is

use our knowledge of the data and manually position one of classification. This means that the ideal quantity

the weights so that they represent the distribution of produced by any model which operates on ambigu-

the data. Since a finite data set only has a finite num- ous and noisy data is posterior probability estima-

ber of degrees of freedom, and we use up some of tion, i.e. given a specific exemplar pattern, produce

those degrees of freedom for every network parameter an estimate of the probability of each class occurring

we have to optimise, then a large neural network with conditional upon that pattern. Network structures are

many weights will have too many degrees of freedom ideal candidates for producing approximations to pos-

compared to the data set itself. Therefore, if we can terior probabilities. For instance, using an appropriate

exploit some exteral prior knowledge to set some of coding scheme on the output target values (specifi-

the weight values then the degrees of freedom of the cally 1-from-n coding), training a neural network to

data can be used more reliably to estimate the remain- minimise the sum squared error induces the outputs

ing network weights (specifically, in the final layer of of the network to approximate the conditional density

the Radial Basis Function network). This becomes par- of the class given the data (Lowe and Webb, 1991).

ticularly advantageous in small sample size problems. Basically, the optimum network output approaches

Finally, the Radial Basis Function network, being an p(clx), the probability of class c occurring given that

extension of standard statistical methods has a natural the observed input pattern was x. This is a result rele-

tendency to produce probabilistic outputs, rather than vant, though not specific, to the Radial Basis Function

strict binary decision boundaries. For problems such network applied to authorship questions. Therefore,

as this one where there is intrinsic doubt anyway, one if we choose the target coding scheme correctly and

would expect that any technique which is used to help perform the optimisation of the network parameters

in unravelling the problem, should produce answers appropriately then the Radial Basis Function should be

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 454

equivalent to a processing 'engine' which outputs the undisputed works. Based on a large and varied body of

evidence, literary scholars have arrived at a set of'core probabilistic decisions in support of either one author

or another. canon' works constituting undisputed authorship. It is

These arguments imply we choose a Radial Basis from this set of undisputed works that the data used to

Function architecture with 5 input nodes (one for each train the various classifiers was drawn. So-called "test

dimension in the stylometric feature space), and two sets" were also constructed from undisputed plays not

output nodes (one for each 'class', i.e. either Author A used as part of the training set.

or Author B). (See Figure 1.) The target coding on the The training set was constructed from descriptors

output nodes are { 1,0} to indicate Author A, or {0,1 } extracted from the core canon plays:

to indicate Author B. This particular coding induces

Shakespeare The Winter's Tale, Richard III, the actual output of the trained network to approxi-

Love's Labour's Lost, A Midsummer mate the probability of the class occurring. Note that

as it is only an approximation, the output numbers can- Night's Dream, Henry IV part

I, , As You Like not be interpreted strictly as probabilites. For example,

It, , Antony and even though we can guarantee that the numbers add up

Cleopatra to unity (because we optimise the final layer weights

of the network according to a Moore-Penrose pseudo- Fletcher , The Womans Prize,

inverse method) it is not guaranteed that the numbers , ,

are less than unity, or are necessarily positive; we give The Loyal Subject, Demetrius and

examples later. Nevertheless, even with this caution- Enanthe.

ary note, the interpretation on the output values of the

Five function word descriptors (following Horton,

Radial Basis Function network is that the 'evidence'

1987) were extracted from each play corresponding to

for one or other author is characterised by the dis-

the ratios of the occurrence of common 'scaffolding'

tance away from the target vectors, i.e. either { 1,0} or

words, (are: in: no: of: the) drawn from samples of

{0,1}.

whole acts of plays. Whether stylometric information

is more appropriately captured in common scaffolding

or function words or in the use of more prosaic and

3. The Data

rare words is a debatable point. However, in forgery

or mimickry it is arguably more difficult to capture the

We now turn to the practical issues of applying Radial

long-time frequency of use of common words, rather

Basis Function networks to a specific stylometric task.

than the more infrequent use of 'exotic' words. Also,

John Fletcher, Shakespeare's successor as chief drama-

they are not particularly context-sensitive and since

tist to the Kings Men, has been linked to Shakespeare

we need statistically reliable estimators (which implies

through the debatable provenance of four plays: The

higher frequencies of occurrence), the use of com-

Two Noble Kinsmen, Henry VIII, The Double False-

monly occurring words is more useful for relatively

hood and The London Prodigal.

small text samples. A total of 50 samples were used

The Two Noble Kinsmen and Henry VIII have

for each author.

long been considered to be a collaboration between

Each set of ratios of occurrence obtained from

Shakespeare and Fletcher (Hart, 1934; Schoenbaum,

each author was normalised to zero mean, unit vari-

1967; Proudfoot, 1970). The Double. Falsehood is

ance. This ensured that gross and obvious deviations

now generally thought to be an adaptation of the now

such as the total numbers of words in an Act (which

lost The History of Cardenio, itself a collaboration

would effectively constitute unhelpful noise character-

between Fletcher and Shakespeare (Taylor, 1987), and

istics adding to the data features) would be reduced

recent evidence supporting authorship of The London

or eliminated. Therefore we are attempting to produce

Prodigal by Fletcher has been produced (Matthews

a set of features where each 'channel of information'

and Merriam, 1993; Merriam, 1992), though it was

contributed equally towards the training of the classi-

previously associated with Shakespeare.

fiers.

With such confusion surrounding authorship dis-

Once a set of classifiers (including standard and

putes, and the anarchic environment in which the

'network-based' techniques) has been determined, or

plays were written and published, clearly much care

'trained', it is important to have a second set of separate

is needed in constructing the basic 'ground truth' of

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 455

samples on which to test the generalisation ability of 2 were misclassified as Fletcher. Similarly, of the 50

the classifier. Without this there is no way of gauging Fletcher patterns, 7 were incorrectly labelled as Shake-

the success of the optimised classifier. The test set was speare. Therefore it is clear that even in the original

constructed in the same manner as the training set, by five dimensional space, the training data is not linearly

extracting values of the same five scaffolding words separable into the two classes.

from the following core canon plays: A statistical clustering method does not perform

any better. For instance, the confusion matrix obtained

Shakespeare All's Well That Ends Well, Much Ado

on the training set using a full Gaussian classifier (see

about Nothing,

Appendix 1) is

Fletcher , M'sieur Thomas

Predicted as Predicted

Following testing, the classifiers were used to

Shakespeare as Fletcher

examine the following disputed texts:

Actual Shakespeare 49 1

Disputed London Prodigal, Double Falsehood, Actual Fletcher 9 41/ (2)

The Two Noble Kinsmen

We can visualise something of the ambiguity

inherent in the data by displaying two dimensional

projections of the data. Figures 4 and 5 depict the pro-

4. Is a Neural Network Necessary? jection of the original 5 dimensional descriptors into

the space spanned by the two most significant principal

The first question to be asked of any neural network components of the training data, i.e. those directions

application is: is it necessary? If the problem domain which reflect most of the variance of the data. From

as determined by the training set data is simple, then these figures one can see the overlapping nature of

the two classes of data should be linearly separable in the two distributions. Nevertheless a certain amount of

the five dimensional space corresponding to the infor- separation is evident and thus it should be possible to

mation in the scaffolding words. In such a situation construct separate models for the distribution function

of the two classes using nonlinear methods. the nonlinear abilities of neural network techniques

are redundant and more conventional classification The information conveyed by the two dimensional

techniques should be employed. Two such techniques projections combined with the information provided in

which provide a benchmark for linear decomposability the confusion matrices from the linear methods, sug-

are the Optimum Linear Transformation and a full gests that it may be advantageous to use a nonlinear

Gaussian classifer. Appendix 1 briefly discusses the network model to perform the discrimination.

algorithms for the Optimum Linear Transformation

classifier and the Gaussian classifer (which assumes

5. Radial Basis Function Results that each data class may be described statistically as

if it were generated according to a Gaussian distri-

bution function, with a full covariance matrix). The Due to the sparsity of the total data set it is necessary

benchmark results indicate that, although the descrip- to estimate an appropriate model order complexity and

tors extracted from the data do provide very good optimise the network weights on the training set alone.

discriminatory power, the problem domain is still not There is insufficent data to warrant full optimisation of

linearly separable. For instance, the confusion matrix the first layer parameters (an advantage of the Radial

obtained on the training set using an Optimum Linear Basis Function over the previously employed Multi-

Transformation is layer Perceptron. The number and location of the

Radial Basis Function centres was determined by

Predicted as Predicted

selecting centres randomly from the training set (thus

ensuring that the distribution of the centres reflected Shakespeare as Fletcher

the distribution of the training set) and setting the net-

Actual Shakespeare 118 2

work complexity where the training error had its first

Actual Fletcher 7 43 (1)

'plateau' as a function of the number of centres. This

This matrix displays the information that of the 50 gave an estimate of 55 centres. The network perfor-

Shakespeare patterns, 48 were correctly classified and mance is not critically sensitive on the precise number

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 456

Training Set Projection onto first 2 singular vectors

0.02

0.015

0.01

0.005

y 0

-0.005

-0.01

-0.015

-0.02

-0.03 -0.02 -0.01 0 0.01 0.02 0.03

x

SHAKESPEARE DATA O

FLETCHER DATA +

Fig. 4. Projection of the training set onto the two most significant Principal Components.

of hidden nodes. Recall that we are using nonparamet- Function network was

ric basis functions so there is no issue of estimating

Predicted as Predicted covariance matrices or extra smoothing parameters.

Having estimated the appropriate model order com- Shakespeare as Fletcher

plexity on this data set, the confusion matrix we

Actual Shakespeare (49 1

obtained on the training set using this Radial Basis

Actual Fletcher 0 50 (3)

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 457

Test Set Projection onto first 2 singular vectors

0.02

0.015

0.01

8

0.005

11 2

a D 7 6

1 3~i9o

0

O

124

-0.005

-0.01

-0.015

-0.02

-0.03 -0.02 -0.01 0 0.01 0.02 0.03

1 Alls Well 4 Valentinian

5 2

London Prodigal 3 Romeo and Juliet 6

8 Two Noble Kinsmen Act I 7 Double Falsehood

12 Two Noble Kinsmen Act V 9 Two Noble Kinsmen Act II

10 Two Noble Kinsmen Act III

11 Two Noble Kinsmen Act IV

Fig. 5. Projection of the test set onto the same two most significant Principal Components.

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 458

illustrating a clear superiority over the traditional be too much noise on the data to make a firm deci-

methods. However the main question is how well did sion due to estimating the statistics on individual Acts

rather than whole plays. Nevertheless, these results it perform on the test set, and what interpretations can

we infer? are in broad agreement with the assessments of Hoy

To assist interpretation of the network's output (1956) and Proudfoot (1970) and those found by using

Multilayer Perceptron methods (Matthews and values, (since they are not strictly probabilities) we

introduce a 'Characteristic Shakespeare Indicator'. Merriam, 1993).

Recall that the optimum target vector for a Shake- In short, when applied to previously unseen data

from well provenanced works, the trained Radial Basis speare play was the ordered pair ts = (1,0), one value

for each output node. The ordered pair tF = (0,1) was Function network produces classifications in agree-

the ideal Fletcher target vector. In the training process ment with conventional scholarship. When applied to

we minimise the sum of squares error which attempts disputed works, the Radial Basis Function produces

to reduce the distance between the actual output vec- results which may be seen to be in general agreement

tor, o and the desired target vector, t. What is relevant with contemporary opinions arrived at by a variety of

is the distance of the output vector from each of the alternative and often subjective means.

target vectors. So a suitable indication of how close to The fact that these works are in fact disputed

'Shakespeare-like' a given output vector was, is pro- implies that there is doubt in the expert opinions of the

vided by the function scholars. This doubt can be quantified for the Radial

Basis Function results. For instance the deviation of

= IIo- tsll2 I1o - tFll2

Ilo - ts 112 + | lo - tF 12 | Jo - tS 12 + | |o - tFI 2 the CSI for Valentinian - an undisputed Fletcher work

A value of'l' indicates full Shakespearean style, and is only 0.006, whereas the CSI for the disputed London

a value of 'O' indicates full Fletcherian characteris- Prodigal indicates an uncertainty margin fifteen times

tics. larger, at 0.09. We can use the Characteristic Shake-

The following table of results shows the network's speare Index as given by equation (4) to rank how

'certain' the network predictions are, for either output for the prediction of the authorship of plays in

the test set. Fletcher or Shakespeare. We find, for example, that

As can be seen, the network produces the correct the five most 'uncertain' data samples correspond to (in

classification in the test set on the commonly ascribed order of decreasing uncertainty): Two Noble Kinsmen,

plays (All's Well, Much Ado aboutNothing,Romeo and Act III, Two Noble Kinsmen, Act IV, London Prodi-

Juliet [Shakespeare] and Valentinian, M'sieur Thomas gal, Double Falsehood, and Two Noble Kinsmen, Act

[Fletcher]). What about the disputed works? Although I. These are of course examples from the traditionally

the support is not quite so strong, the network indi- 'disputed' plays.

cates that The Double Falsehood should be ascribed Thus the Radial Basis Function network is capa-

primarily to Fletcher, rather than Shakespeare in agree- ble of producing a much richer interpretation on its

ment with contemporary scholarship (Metz, 1989). predictions than a simple binary segmentation of data

Similarly we find that The London Prodigal is predom- samples into either Fletcher or Shakespeare camps.

inantly Fletcherian, though with some Shakespearian This also now provides a system for fast and efficient

influences. 'classification' of other Fletcher or Shakespeare works

The verdict on the Two Noble Kinsmen is particu- which should also produce a degree of 'uncertainty'

larly interesting. This play has long been considered attached to a prediction.

by some to be a genuine collaboration between the two

dramatists. With this in mind, the network was applied

to individual Acts of the play. Briefly, the network 6. Conclusions

shows very strong support for Shakespeare writing

Act V and Act I, and Fletcher writing Act 2. How- This paper has presented an analysis of stylometric

ever the situation on Acts III and IV is not so clear. features using Radial Basis Function and standard tech-

There is a split in support of Fletcher writing much niques to infer authorship of disputed historic docu-

of Act III but with significant Shakespeare input, and ments. It was demonstrated that a nonlinear model and

Shakespeare writing Act IV, though with significant specifically the Radial Basis Function network archi-

Fletcher involvement. Each individual Act may have tecture, was both required and could be used effectively

been written collaboratively, although there may well to separate, by author, known works of literature and

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 459

Play Actual outputs CSI Decision

Alls Well (1.13, 0.013) 0.992 Shakespeare

Much Ado About Nothing (0.905, 0.0945) 0.989 Shakespeare

Romeo and Juliet (0.972, 0.028) 0.999 Shakespeare

Valentinian (0.07, 0.093) 0.006 Fletcher

M'sieur Thomas (0.125, 0.875) 0.020 Fletcher

London Prodigal (0.24, 0.76) 0.090 Fletcher

Double Falsehood (0.18, 0.82) 0.046 Fletcher

Two Noble Kinsmen (Act I) (0.83, 0.17) 0.960 Shakespeare

Two Noble Kinsmen (Act II) (-0.135, 1.135) 0.014 Fletcher

Two Nobel Kinsmen (Act III) (0.306, 0.694) 0.163 F/S??

Two Nobel Kinsmen (Act IV) (0.729, 0.271) 0.878 S/F??

Two Nobel Kinsmen (Act V) (0.967, 0.033) 0.999 Shakespeare

in addition provide evidence of authorship for disputed P patterns in the training set. Then we can collect

works. Comparisons with other more standard tech- together the entire set of input patterns into a matrix

X of size n x P. Corresponding to each input pattern niques have illustrated an advantage to using Radial

Basis Function neural networks. in the training set we have a desired 'target' value (i.e.

The main conclusion of this work is to demon- either Shakespeare or Fletcher coded appropriately).

Let us assume that there are c components in each strate the utility and potential for using quantitative

techniques such as neural networks as an additional target pattern (c = 2 in the examples considered in this

set of tools to assist in the decision making pro- paper). We denote the p-th target pattern as the vector

cesses in relatively subjective disciplines. Advantages tp. We can collect together all the target patterns into

a matrix T of size c x P. include the ability to deal with statistically noisy

data samples and intrinsically nonlinear relationships.

A.1 Optimum linear transformation (olt) Disadvantages stem from the requirement for large

amounts of raw data which would rule out the exploita-

The optimum linear transformation seeks to find the tion of neural networks as, for example, a forensic

best transformation (i.e. the one that minimises the sum tool. However as the amount of available computer

readable literary texts continues to increase we can squared residual error) between the matrix of desired

target values and the matrix of valued obtained by an expect expansion in the use of automated pattern recog-

nition techniques, such as neural networks, as assis- arbitrary linear transformation of the input data (so by

tants to help in the resolution of outstanding literary using translations, rotations, scalings and reflections).

We can formalise this as follows. For the n x P matrix

mysteries.

X of input patterns and the corresponding c x P matrix

T of target patterns on the training set, the problem is

to find the optimum c x n matrix, A and c x 1 vector Appendix 1: The Optimum Linear

Transformation, and Gaussian Classifier b (which accounts for translations) which satisfy the

equation

This Appendix briefly discusses the mathematics used

AX + bl* T

to contruct the 'benchmark' classifiers of an Optimum

Linear Transformation (OLT) and a Gaussian Classifier

with minimum residual error. Note that 1* is a row

(GC).First let us introduce some notation. If there are

vector of l's of size 1 x P. The solution with mini-

n observable quantities (such as the five 'scaffolding'

mum Frobenius norm may be found by pseudo-inverse

words used in this study), let us denote each one of

methods for which several numerical procedures have

them as xi, i = 1, 2,.....,n and the collection of these

been developed. Once A and b have been determined,

or vector set of the observables as xp for the p-th

an arbitrary input pattern is linearly transformed into a

exemplar pattern. Assume we have a total number of

pattern t in the target pattern space. The class associ-

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 460

ated with that input pattern is determined by the closest smoothing factor is in the use of a general Gaussian

transfer function, i.e. the nonlinear transfer function target vector to the transformed pattern, t.

takes the form ¢(z) , exp [zT E-'z]. In the simple

A.2 Gaussian classifier (GC) case of a Euclidean distance this nonlinearity simpli-

fies to q(z) , exp - [Ilz/ll2E/2]. In this paper, as an

The Gaussian classifier is probabilistically motivated. illustration of the general approach we have chosen

We assume each pattern in each class is drawn from to employ a nonparametric 'spline' basis function i.e.

a full Gaussian distribution where each class c may q(z) , zlog(z) which, although less flexible, has the

advantage that it does not require the additional esti- be characterised by a mean vector, Pc and a (full)

covariance matrix, Ec. These are approximated by the mation of smoothing parameters. An interesting point

about this type of nonlinearity is that it is unbounded. training set samples. Then given the c-th class, c, the

probability of any given pattern xp belonging to that It is commonly believed that the major advantage of

class may be expressed as the use of Radial Basis Function architectures is the

exploitation of the locality of the basis functions. How-

1

ever there are good reasons, which we cannot discuss

P(xp lc) (2r)/2cexp-(xp -Pc)* ~c-'(xp -PLc)

(2~7r)n/2 Ec as part of this paper, why nonlocalised basis functions

are more appropriate for interpolation problems. The

Knowing the prior probabilitiespc of the occurrence of

important point is that the network as a whole needs

each class allows the determination of the probability

to be able to generate a localised mapping of the data.

of that class c given the pattern as pcP(x, Ic). Thus, the

This we can achieve through the training process, since

decision is to choose that class which gives the largest

the Radial Basis Function structure is computationally

pattern conditional probability.

universal.

One of the advantages of the Radial Basis Function

is that the first layer weights {pj, Ej; j = 1,...h} may

Appendix 2: The Radial Basis Function

often be determined or specified by a judicious use

Network

of prior knowledge, or adapted by simple techniques.

Early work (Broomhead and Lowe, 1988) found it

The Radial Basis Function is a single hidden layer feed

sufficient to position the basis functions at data points

forward network with linear transfer functions on the

sampled randomly according to the distribution of the

output nodes and nonlinear transfer functions on the

data. This ensured that network resources were con-

hidden layer nodes. Many types of nonlinearities may

centrated in regions of higher data density. Another

be used. There is also typically a bias or an offset

early technique (Moody and Darken, 1989) was to

weight on each output node, though not usually on

position the centres of the basis functions according

the hidden nodes. The primary adjustable parameters

to a K-means clustering process on the data points

are the final layer weights, {Ajk } connecting the j-th

and then set the smoothing parameters of the assumed

hidden node to the k-th output node. There are also

Gaussian basis functions to be the average distance

weights {pi } connecting the i-th input node with

between cluster centres. Therefore once the weights

the j-th hidden node (see Figure 1) and occasional-

associated with the first layer have been specified the

ly a 'smoothing' factor matrix, { j } which denotes

major problem in 'training' a Radial Basis Function

the range of influence of each node. The bias node is

network is focussed upon the determination of the final

labelled by j = 0.

layer weights. Since this is a linear optimisation pro-

The mathematical embodiment of the Radial Basis

cess (the parameters { Ajk } occur linearly when min-

Function takes the following form. The k-th component

imising the residual sum squared error measure as is

of the output vector y, corresponding to the p-th input

usually employed in the training process), the Radial

pattern x, is expressed as

Basis Function is computationally more attractive in

h

applications compared to a Multilayer Perceptron even

though they are both computationally universal archi-

[y(Xp)lk k = jj((lxp - jII)

tectures.

j=0o

As this final phase is a linear optimisation process,

where qj(...) denotes the nonlinear transfer function

again a pseudoinverse method or other efficient min-

of hidden node j. The most common example of the

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 461

imisatio techniques may by used to obtain the optimum Lowe, D. and A. R. Webb "Optimized Feature Extraction and the

Bayes Decision in Feed-forward Classifier Networks". Pattern network weights (see Broomhead and Lowe, 1988;

Analysis and Machine Intelligence, 13, 4 (1991), 355-64.

Haykin, 1994 for further details).

Lowe, D. "Novel 'Topographic' Nonlinear Feature Extraction using

Radial Basis Functions for Concentration Coding in the 'Artificial

Nose'". 3rd IEE International Conference on Artificial Neural

Networks, Conference Publication number 372, (1993), pp. 95-

Acknowledgements

99.

Matthews, R. A. J. and T. V. N. Merriam. "Neural Computation in

The authors would like to thank Thomas Merriam

Stylometry I: An application to the works of Shakespeare and

Fletcher". Literary and Linguistic Computing, 8, 4 (1993), 203- for advice, comments, guidance and providing the

209. data used as part of this study. This data, used to

Merriam, T. V. N. "Modelling a Canon: Principles and Examples

train the Radial Basis Function, is available from

in Applied Statistics". Doctoral Thesis. University of London,

Robert Matthews in ASCII form to all who provide 1992.

Metz, G. H., ed. Sources of Four Plays Ascribed to Shakespeare. either an email address or a 3.5 inch disc with return

Colombia: University of Missouri Press, 1989.

postage. Robert Matthews can be contacted by the e-

Moody J. and C. Darken. "Fast Learning in Networks of Locally

mail address [email protected].

Tuned Processing Units". Neural Computation, 1,2 (1989), 281-

94.

Niranjan, M. and FE Fallside. "Neural Networks and Radial Basis

Functions in Classifying Static Speech Pattemrns". Computers,

References

Speech and Language, 4 (1990), 275-89.

Park, J. and I. W. Sandberg. "Universal Approximation using

Broomhead, D. S. and David Lowe. "Multi-variable Functional

Radial Basis Function Networks".Neural Computation,3 (1991),

Interpolation and Adaptive Networks". Complex Systems, 2, 3

246-257.

(1988), 269-303.

Powell, M. J. D. "The Theory of Radial Basis Function Approx-

Girosi, F., M. Jones and T. Poggio. "Regularization Theory and imation in 1990". In Advances in Numerical Analysis. Vol II:

Neural Network Architectures". Neural Computation, 7, 2 Wavelets, Subdivision Algorithms and Radial Basis Functions.

(1995), 219-269. Ed. W. A. Light. Oxford University Press, 1992, 105-210.

Hart, A. "Shakespeare and the Vocabulary of The Two Noble

Proudfoot, G. R., ed. "The Two Noble Kinsmen". London: Edward

Kinsmen". Melbourne: Melboumrne University Press, 1934.

Amold, 1970.

Haykin, S. "Neural Networks: A Comprehensive Foundation". Rao Vemuri, V. and R. D. Rogers, eds. Artificial Neural Networks:

(Chapter 7: Radial Basis Function Networks). Macmillan, 1994. Forecasting Time Series. IEEE Computer Society Press, 1994.

Horton, T. B. The Effectiveness of the Stylometry of Function Words

Schoenbaum, S., ed. The Famous History of the Life of King Henry

in Discriminating between Shakespeare and Fletcher. Doctoral

the Eighth. New York: The New American Library, 1967.

Thesis. University of Edinburgh, 1987.

Taylor, G. "The Canon and Chronology of Shakespeare's Plays". In

Hoy, C. "The Shares of Fletcher and his Collaborators. In the Beau- William Shakespeare: A Textual Companion. Oxford: Clarendon

mont and Fletcher Canon (VIII)". Studies in Bibliography, 15

Press, 1987.

(1956), 129-146.

Lowe, D. "What Have Neural Networks to Offer Statistical Pattern

Processing". SPIE Proceedings on Adaptive Signal Processing

1565 (1991), 460-71.

This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms