Shakespeare Vs. Fletcher: A Stylometric Analysis by Radial Basis Functions Authors(s): David Lowe and Robert Matthews Source: Computers and the Humanities, Vol. 29, No. 6 (Dec., 1995), pp. 449-461 Published by: Springer Stable URL: http://www.jstor.org/stable/30200368 Accessed: 27-03-2016 15:04 UTC
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://about.jstor.org/terms
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Springer is collaborating with JSTOR to digitize, preserve and extend access to Computers and the Humanities
http://www.jstor.org
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms Computers and the Humanities 29: 449-461, 1995. 449
C 1995 Kluwer Academic Publishers. Printed in the Netherlands
Shakespeare Vs. Fletcher: A Stylometric Analysis by Radial Basis Functions
David Lowe and Robert Matthews *
Neural Computing Research Group, Aston University, Birmingham B4 7ET, England
e-mail:[email protected]; [email protected]
Key words: neural networks, stylometric analysis, Shakespeare, Fletcher, discrimination, classification
Abstract
In this paper we show, for the first time, how Radial Basis Function (RBF) network techniques can be used to
explore questions surrounding authorship of historic documents. The paper illustrates the technical and practical
aspects of RBF's, using data extracted from works written in the early 17th century by William Shakespeare and his
contemporary John Fletcher. We also present benchmark comparisons with other standard techniques for contrast
and comparison.
1. Introduction tive work of Shakespeare. Whilst some scholars have
accepted the play as such, others remain unconvinced.
Literary scholars have long debated over questions of Conventionally, the primary information used to try
authorship of various works and documents. Many and ascribe authorship is centred around scholarly
such questions centre on alleged works by William opinion of the aesthetic style of the prose and the subtle
Shakespeare and one of the oldest of these disputes use of language, vocabulary and grammar when com-
concerns the authorship of an obscure play, The Two pared to other works of undisputed provenance.
Noble Kinsmen. This was first performed around 1613 This is a classic problem faced in many scholarly
domains which use high level, human cognitive but has been relatively ignored ever since. A copy
of this script circulating around 1634 ascribed the methods of reasoning combined with 'intuition' and
work to William Shakespeare and John Fletcher (who 'experience' to try and arrive at a consensus of
succeeded Shakespeare after his death in 1616 as chief opinion. However there are also quantitative, statistical
dramatist to the Kings Men). The question arises as to approaches to data analysis which might have some-
whether this obscure play really is a genuine collabora- thing to offer in these domains. The field of stylo-
metry is essentially the application of mathematical
methods to extract quantitative measures to assist in
* David Lowe is Professor of Neural Computing at Aston
such debates.
University, UK. His research interests span from the theoretical
Of course, no technique can ascribe definitive aspects of dynamical systems theory and statistical pattern process-
ing, to a wide range of application domains, from financial market answers in such applications. The best we can hope for
analysis ("Novel Exploitation of Neural Network Methods in Finan-
is a technique which provides additional quantifiable
cial Markets", invited paper, World Conference on Computational
evidential weight in favour of one author or another.
Intelligence, vol. VI, pp. 3623-28, 1994) to the 'artificial nose'
Another problem is that in extracting high level quali- ("Novel 'Topographic' Nonlinear Feature Extraction using Radial
Basis Functions for Concentration Coding in the 'Artificial Nose'", tative information from an abstract knowledge source
3rd IEE International Conference on Artificial Neural Networks,
for quantitative analysis, we need to produce an inter-
pp. 95-99, Conference Publication number 372, The Institute of
mediate representation of information which is more
Electrical Engineers, 1993).
Robert Matthews is a visiting research fellow at Aston Univer- 'low-level'. This process of dimensionality reduction
sity. His research interests include probability, number theory and and feature extraction is inevitably a nonlinear process.
astronomy. His recent paper in Nature (vol. 374, pp. 681-82, 1995)
If the transformed information has been nonlinearly
somehow managed to combine all three.
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 450
distorted, then evidently we need access to nonlinear mated by a suitable Radial Basis Function architecture.
In addition it can be considered as a generalisation of analysis techniques to resolve any conflict. Unfortu-
nately there are very few nonlinear methods which several traditional statistical pattern processing tech-
have an inherent ability to extract and convey statistical niques. Its strength derives from a rich interpretational
information. However, one such class of techniques basis since it lies in the confluence of a variety of
exists in the neural network domain. 'established' scientific disciplines. Thus, although the
There is already evidence (Matthews and Merriam, original motivation of this particular network struc-
ture was in terms of functional approximation tech- 1993) that the Multilayer Perceptron is a potentially
very useful tool in stylometric analysis. It was shown niques (Powell, 1992), the network may be derived
that the Multilayer Perceptron could be trained to on the basis of statistical pattern processing theory
classify 96% of the training set successfully (using (Lowe, 1991), regression and regularisation (Girosi
cross-validation) composed of known Shakespeare- et al., 1995), biological pattern formation, mapping
Fletcher works. When applied to other data not used in the presence of noisy data etc. However, in addi-
as part of the training set, very successful discrimina- tion to exhibiting a range of useful theoretical proper-
tion was obtained on known works, and when applied ties, it is also a practically useful construct as it may
to disputed works the method provided information be applied to problem domains in discrimination (see
which was in general broad agreement with current e.g. Niranjan and Fallside, 1990, for a speech classi-
scholarly opinion. fication example), time series prediction (see articles
However there are many distinct types of neural in Rao Vemuri and Rogers, 1994, for financial and
other examples) and other mapping problems, and fea- network methods, each with their own properties,
ture extraction/topographic mapping problem domains advantages and disadvantages. There are also many
recent statistical techniques which have yet to be appro- (e.g. Lowe, 1993, for a chemical odour concentration
priately developed in this type of problem domain. coding example).
The previous work which has studied this particular
2.1. Neural networks and classification problems problem was a preliminary, feasibility study in that no
comparative performance experiments were presented,
Neural networks such as the Radial Basis Function either contrasting with other network techniques, or
with other traditional methods. This paper addresses network are examples of techniques known as nonpara-
these criticisms by presenting an alternative network metric methods. This means that they can be used to
construct representations to problems where an explicit study as well as presenting comparative performance
model of the problem domain is not known (such as in estimates using more traditional techniques. In partic-
ular this paper presents an analysis of Shakespeare- financial market prediction) or is too difficult to eval-
Fletcher data using a range of quantitative techniques, uate (as in weather forecasting). This is achieved by
including classical statistical pattern processing optimising the structure of a neural network architec-
methods and the Radial Basis Function network. This ture by minimising a criterion function (usually a sum
latter technique has several advantages over the previ- squared error criterion between the desired answer and
ously applied Multilayer Perceptron, especially when the predicted network answer). Although originally
motivated by the apparent structure of information applied to small sample data sets as exemplified by
the specific problem considered in this paper. Some of processing in nervous systems, we now know that
these advantages will be discussed later. artificial neural networks are more closely related to
pattern processing methods than to biology.
The architecture of an artificial neural network is
2. Classification Using the Radial Basis Function very simple and is composed of layers of process-
Network ing elements with nonlinear (though differentiable)
transfer functions at each node. An artificial neural net-
The Radial Basis Function (Broomhead and Lowe, work has a set of input nodes, a set of 'hidden layer'
nodes (so called because thay are hidden from direct 1988; Haykin, 1994) is a conceptually very simple
and yet intrinsically powerful network structure. In interaction with the outside environment - they can
particular it has the property of being 'computationally only receive and pass on information to other layers)
universal' (Park and Sandberg, 1991): in principle any and a set of output nodes. Each node in the input layer
(nonlinear) function may be arbitrarily closely approxi- is fully connected to every node in the hidden layer
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 451
Connection strength from input Connection strength from hidden
node i to hidden node j node j to output node k
I,1,..
xjlk
Author A {1,0} Discriminator
Features
Author B {0,1}
(e.g. word frequencies)
input hidden output
layer layer layer
Fig. 1. The architecture of a feed-forward network model to classify texts as either Author A or Author B.
and every node in the hidden layer fully connected to dimensional classification example as depicted in Fig-
every node in the output layer. Information from the ures 2 and 3. These figures show an example of a
environment is presented to the input layer nodes and simple problem where there are two types of classes
to recognise, based upon the measurements of just two the network processes this information to produce pre-
dictions about the unknown system at the output nodes. types of observables. However we cannot separate the
The connections between all nodes have adjustable classes with just a simple straight line (so the problem
is not linearly separable). Nevertheless, it is possible weights which determine the 'strength' associated with
each piece of information flowing down each con- to separate the two classes by using a nonlinear bound-
nection. In the most widely-used neural network, the ary between the two classes. This is the purpose of a
neural network. There are several ways in which this Multilayer Perceptron, this strength between an input
pattern and the weights connecting one of the nodes is nonlinear separating boundary could be produced. The
given by forming the scalar product between vectors first figure shows how a simple Multilayer Perceptron
could produce a separating boundary by using a set of representing the pattern and the weights. The resulting
summation of all contributions flowing into a node is piecewise linear segments. These segments correspond
then passed through a nonlinear transfer function. In to the threshold regions of the hidden nodes where the
the Multilayer Perceptron this nonlinearity is typically nonlinearity changes from 'not firing' to 'firing'. In
a 'logistic' function 1/[1 + exp(-x)] which is a func- this scheme, one can decide which class a novel pat-
tern belongs to simply by deciding which side of the tion of the input activity x. The nonlinearity originally
decision boundary line the novel pattern measurements was taken to represent the output firing rate activity
are situated. of a neuron which increases nonlinearly as the input
activity increases. This output value is then passed on
by the next set of connections to the next layer or to 2.2. Differences between the Multilayer Perceptron
the output. The weights are adjusted in response to the and the Radial Basis Function networks
data in a training phase which determines the precise
behaviour of the network. Figure 3 shows an alternative way to separate the two
The important aspect of such a structure is its classes, and is the mechanism used by a Radial Basis
hidden layer of nonlinear processing elements. This Function network. The Radial Basis Function is a sin-
hidden layer is used to automatically construct a gle hidden layer feed forward network which resem-
nonlinear feature extraction space which allows the bles the Multilayer Perceptron. Differences include the
classification problem to be solved. The nature of this facts that the Radial Basis Function uses linear transfer
feature extraction space depends upon the particular functions on the output nodes and alternative nonlinear
transfer functions to the logistic function on the hidden type of network structure used. We can represent the
role of the hidden layer of nodes using a simple two layer nodes. Also the first layer of the network uses
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 452
Piecewise linear decision boundary
separating the two classes
SCLASS 1
Cl
ICLASS2
a)
measurement #1
Fig. 2. A simple two dimensional classification example. The separation of the two classes requires a nonlinear decision boundary. The figure
shows a typical boundary produced by the hidden layer nodes of a multilayer perceptron.
likely class in the sense of a probability distribution. 'distance' as a measure of similarity between a weight
vector and a pattern vector, rather than a scalar product This is part of the reason that a Radial Basis Function
network is particularly appropriate for the problem we function as in the Multilayer Perceptron. The weight
vector connected to a hidden node corresponds to the are considering in this paper.
'location' of this hidden node in the pattern space. Although the Multilayer Perceptron and Radial
Therefore the hidden nodes in a Radial Basis Function Basis Function represent complementary views of
network respond to the difference between an input analysing nonlinear noisy problems, the Radial Basis
pattern and the weight vector connected to the hidden Function has several advantages over the Multilayer
node. Naively, one could think of the hidden node as Perceptron, especially in the context of small sample
having a localised response, so that the node's influ- problem domains, to which stylometric questions often
ence decays as the distance between the input pattern belong. For instance, the Radial Basis Function has
and the weight, or the 'centre' of the node increases. very strong similarities to more traditional statistical
Appendix 2 discusses some of the technical aspects, pattern classification techniques such as Parzen win-
the architecture and the notation of the Radial Basis dow classifiers and the method of Potential Functions
Function. Further information may be obtained from for density estimation (Lowe, 1991). It can also be
Broomhead and Lowe (1988) and Haykin (1994). considered as a generalisation of the simple Gaussian
However, heuristically each node is 'centred' Classifier which we also used as part of this study.
around a location in the pattern space, and the nonlin- Hence the architecture may be considered to be a nat-
earity describes how much each data point contributes ural extrapolation from traditional techniques. In addi-
towards influencing the node. In this way a proba- tion it is an architecture which allows the incorporation
bility distribution profile of the two data clusters is of prior knowledge in a much easier fashion compared
constructed, specifically by summing the contributions to the Multilayer Perceptron. For instance, the inter-
from all the 'microclusters' as defined by the hidden pretation of the weights in the first layer of the Radial
nodes in the Radial Basis Function network. So in Basis Function is that the weights constitute a quan-
order to decide which of the two classes a new pat- tisation of the input space such that they are located
tern belongs, we are essentially looking for the most in regions of high data density. This means that rather
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 453
Circles depict the influence of each radial basis function
CLASS 1
a)
a)
CLASS 2
a)
measurement #1
Fig. 3. The same two dimensional classification example. In this figure the division of the pattern space by a Radial Basis Function network
is revealed. Each hidden node in the network accounts for part of the cluster of data space. The description of the entire data set is obtained by
combining the contributions from all of the microclusters.
than having to perform a full nonlinear optimisation which reflect this uncertainty. Otherwise one should
to decide what values the input weights should have doubt the validity of that technique.
(as we must do in the Multilayer Perceptron), we can The particular problem considered in this paper is
use our knowledge of the data and manually position one of classification. This means that the ideal quantity
the weights so that they represent the distribution of produced by any model which operates on ambigu-
the data. Since a finite data set only has a finite num- ous and noisy data is posterior probability estima-
ber of degrees of freedom, and we use up some of tion, i.e. given a specific exemplar pattern, produce
those degrees of freedom for every network parameter an estimate of the probability of each class occurring
we have to optimise, then a large neural network with conditional upon that pattern. Network structures are
many weights will have too many degrees of freedom ideal candidates for producing approximations to pos-
compared to the data set itself. Therefore, if we can terior probabilities. For instance, using an appropriate
exploit some exteral prior knowledge to set some of coding scheme on the output target values (specifi-
the weight values then the degrees of freedom of the cally 1-from-n coding), training a neural network to
data can be used more reliably to estimate the remain- minimise the sum squared error induces the outputs
ing network weights (specifically, in the final layer of of the network to approximate the conditional density
the Radial Basis Function network). This becomes par- of the class given the data (Lowe and Webb, 1991).
ticularly advantageous in small sample size problems. Basically, the optimum network output approaches
Finally, the Radial Basis Function network, being an p(clx), the probability of class c occurring given that
extension of standard statistical methods has a natural the observed input pattern was x. This is a result rele-
tendency to produce probabilistic outputs, rather than vant, though not specific, to the Radial Basis Function
strict binary decision boundaries. For problems such network applied to authorship questions. Therefore,
as this one where there is intrinsic doubt anyway, one if we choose the target coding scheme correctly and
would expect that any technique which is used to help perform the optimisation of the network parameters
in unravelling the problem, should produce answers appropriately then the Radial Basis Function should be
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 454
equivalent to a processing 'engine' which outputs the undisputed works. Based on a large and varied body of
evidence, literary scholars have arrived at a set of'core probabilistic decisions in support of either one author
or another. canon' works constituting undisputed authorship. It is
These arguments imply we choose a Radial Basis from this set of undisputed works that the data used to
Function architecture with 5 input nodes (one for each train the various classifiers was drawn. So-called "test
dimension in the stylometric feature space), and two sets" were also constructed from undisputed plays not
output nodes (one for each 'class', i.e. either Author A used as part of the training set.
or Author B). (See Figure 1.) The target coding on the The training set was constructed from descriptors
output nodes are { 1,0} to indicate Author A, or {0,1 } extracted from the core canon plays:
to indicate Author B. This particular coding induces
Shakespeare The Winter's Tale, Richard III, the actual output of the trained network to approxi-
Love's Labour's Lost, A Midsummer mate the probability of the class occurring. Note that
as it is only an approximation, the output numbers can- Night's Dream, Henry IV part
I, Julius Caesar, As You Like not be interpreted strictly as probabilites. For example,
It, Twelfth Night, Antony and even though we can guarantee that the numbers add up
Cleopatra to unity (because we optimise the final layer weights
of the network according to a Moore-Penrose pseudo- Fletcher The Chances, The Womans Prize,
inverse method) it is not guaranteed that the numbers Bonduca, The Island Princess,
are less than unity, or are necessarily positive; we give The Loyal Subject, Demetrius and
examples later. Nevertheless, even with this caution- Enanthe.
ary note, the interpretation on the output values of the
Five function word descriptors (following Horton,
Radial Basis Function network is that the 'evidence'
1987) were extracted from each play corresponding to
for one or other author is characterised by the dis-
the ratios of the occurrence of common 'scaffolding'
tance away from the target vectors, i.e. either { 1,0} or
words, (are: in: no: of: the) drawn from samples of
{0,1}.
whole acts of plays. Whether stylometric information
is more appropriately captured in common scaffolding
or function words or in the use of more prosaic and
3. The Data
rare words is a debatable point. However, in forgery
or mimickry it is arguably more difficult to capture the
We now turn to the practical issues of applying Radial
long-time frequency of use of common words, rather
Basis Function networks to a specific stylometric task.
than the more infrequent use of 'exotic' words. Also,
John Fletcher, Shakespeare's successor as chief drama-
they are not particularly context-sensitive and since
tist to the Kings Men, has been linked to Shakespeare
we need statistically reliable estimators (which implies
through the debatable provenance of four plays: The
higher frequencies of occurrence), the use of com-
Two Noble Kinsmen, Henry VIII, The Double False-
monly occurring words is more useful for relatively
hood and The London Prodigal.
small text samples. A total of 50 samples were used
The Two Noble Kinsmen and Henry VIII have
for each author.
long been considered to be a collaboration between
Each set of ratios of occurrence obtained from
Shakespeare and Fletcher (Hart, 1934; Schoenbaum,
each author was normalised to zero mean, unit vari-
1967; Proudfoot, 1970). The Double. Falsehood is
ance. This ensured that gross and obvious deviations
now generally thought to be an adaptation of the now
such as the total numbers of words in an Act (which
lost The History of Cardenio, itself a collaboration
would effectively constitute unhelpful noise character-
between Fletcher and Shakespeare (Taylor, 1987), and
istics adding to the data features) would be reduced
recent evidence supporting authorship of The London
or eliminated. Therefore we are attempting to produce
Prodigal by Fletcher has been produced (Matthews
a set of features where each 'channel of information'
and Merriam, 1993; Merriam, 1992), though it was
contributed equally towards the training of the classi-
previously associated with Shakespeare.
fiers.
With such confusion surrounding authorship dis-
Once a set of classifiers (including standard and
putes, and the anarchic environment in which the
'network-based' techniques) has been determined, or
plays were written and published, clearly much care
'trained', it is important to have a second set of separate
is needed in constructing the basic 'ground truth' of
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 455
samples on which to test the generalisation ability of 2 were misclassified as Fletcher. Similarly, of the 50
the classifier. Without this there is no way of gauging Fletcher patterns, 7 were incorrectly labelled as Shake-
the success of the optimised classifier. The test set was speare. Therefore it is clear that even in the original
constructed in the same manner as the training set, by five dimensional space, the training data is not linearly
extracting values of the same five scaffolding words separable into the two classes.
from the following core canon plays: A statistical clustering method does not perform
any better. For instance, the confusion matrix obtained
Shakespeare All's Well That Ends Well, Much Ado
on the training set using a full Gaussian classifier (see
about Nothing, Romeo and Juliet
Appendix 1) is
Fletcher Valentinian, M'sieur Thomas
Predicted as Predicted
Following testing, the classifiers were used to
Shakespeare as Fletcher
examine the following disputed texts:
Actual Shakespeare 49 1
Disputed London Prodigal, Double Falsehood, Actual Fletcher 9 41/ (2)
The Two Noble Kinsmen
We can visualise something of the ambiguity
inherent in the data by displaying two dimensional
projections of the data. Figures 4 and 5 depict the pro-
4. Is a Neural Network Necessary? jection of the original 5 dimensional descriptors into
the space spanned by the two most significant principal
The first question to be asked of any neural network components of the training data, i.e. those directions
application is: is it necessary? If the problem domain which reflect most of the variance of the data. From
as determined by the training set data is simple, then these figures one can see the overlapping nature of
the two classes of data should be linearly separable in the two distributions. Nevertheless a certain amount of
the five dimensional space corresponding to the infor- separation is evident and thus it should be possible to
mation in the scaffolding words. In such a situation construct separate models for the distribution function
of the two classes using nonlinear methods. the nonlinear abilities of neural network techniques
are redundant and more conventional classification The information conveyed by the two dimensional
techniques should be employed. Two such techniques projections combined with the information provided in
which provide a benchmark for linear decomposability the confusion matrices from the linear methods, sug-
are the Optimum Linear Transformation and a full gests that it may be advantageous to use a nonlinear
Gaussian classifer. Appendix 1 briefly discusses the network model to perform the discrimination.
algorithms for the Optimum Linear Transformation
classifier and the Gaussian classifer (which assumes
5. Radial Basis Function Results that each data class may be described statistically as
if it were generated according to a Gaussian distri-
bution function, with a full covariance matrix). The Due to the sparsity of the total data set it is necessary
benchmark results indicate that, although the descrip- to estimate an appropriate model order complexity and
tors extracted from the data do provide very good optimise the network weights on the training set alone.
discriminatory power, the problem domain is still not There is insufficent data to warrant full optimisation of
linearly separable. For instance, the confusion matrix the first layer parameters (an advantage of the Radial
obtained on the training set using an Optimum Linear Basis Function over the previously employed Multi-
Transformation is layer Perceptron. The number and location of the
Radial Basis Function centres was determined by
Predicted as Predicted
selecting centres randomly from the training set (thus
ensuring that the distribution of the centres reflected Shakespeare as Fletcher
the distribution of the training set) and setting the net-
Actual Shakespeare 118 2
work complexity where the training error had its first
Actual Fletcher 7 43 (1)
'plateau' as a function of the number of centres. This
This matrix displays the information that of the 50 gave an estimate of 55 centres. The network perfor-
Shakespeare patterns, 48 were correctly classified and mance is not critically sensitive on the precise number
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 456
Training Set Projection onto first 2 singular vectors
0.02
0.015
0.01
0.005
y 0
-0.005
-0.01
-0.015
-0.02
-0.03 -0.02 -0.01 0 0.01 0.02 0.03
x
SHAKESPEARE DATA O
FLETCHER DATA +
Fig. 4. Projection of the training set onto the two most significant Principal Components.
of hidden nodes. Recall that we are using nonparamet- Function network was
ric basis functions so there is no issue of estimating
Predicted as Predicted covariance matrices or extra smoothing parameters.
Having estimated the appropriate model order com- Shakespeare as Fletcher
plexity on this data set, the confusion matrix we
Actual Shakespeare (49 1
obtained on the training set using this Radial Basis
Actual Fletcher 0 50 (3)
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 457
Test Set Projection onto first 2 singular vectors
0.02
0.015
0.01
8
0.005
11 2
a D 7 6
1 3~i9o
0
O
124
-0.005
-0.01
-0.015
-0.02
-0.03 -0.02 -0.01 0 0.01 0.02 0.03
1 Alls Well 4 Valentinian
5 Monsieur Thomas 2 Much Ado About Nothing
London Prodigal 3 Romeo and Juliet 6
8 Two Noble Kinsmen Act I 7 Double Falsehood
12 Two Noble Kinsmen Act V 9 Two Noble Kinsmen Act II
10 Two Noble Kinsmen Act III
11 Two Noble Kinsmen Act IV
Fig. 5. Projection of the test set onto the same two most significant Principal Components.
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 458
illustrating a clear superiority over the traditional be too much noise on the data to make a firm deci-
methods. However the main question is how well did sion due to estimating the statistics on individual Acts
rather than whole plays. Nevertheless, these results it perform on the test set, and what interpretations can
we infer? are in broad agreement with the assessments of Hoy
To assist interpretation of the network's output (1956) and Proudfoot (1970) and those found by using
Multilayer Perceptron methods (Matthews and values, (since they are not strictly probabilities) we
introduce a 'Characteristic Shakespeare Indicator'. Merriam, 1993).
Recall that the optimum target vector for a Shake- In short, when applied to previously unseen data
from well provenanced works, the trained Radial Basis speare play was the ordered pair ts = (1,0), one value
for each output node. The ordered pair tF = (0,1) was Function network produces classifications in agree-
the ideal Fletcher target vector. In the training process ment with conventional scholarship. When applied to
we minimise the sum of squares error which attempts disputed works, the Radial Basis Function produces
to reduce the distance between the actual output vec- results which may be seen to be in general agreement
tor, o and the desired target vector, t. What is relevant with contemporary opinions arrived at by a variety of
is the distance of the output vector from each of the alternative and often subjective means.
target vectors. So a suitable indication of how close to The fact that these works are in fact disputed
'Shakespeare-like' a given output vector was, is pro- implies that there is doubt in the expert opinions of the
vided by the function scholars. This doubt can be quantified for the Radial
Basis Function results. For instance the deviation of
= IIo- tsll2 I1o - tFll2
Ilo - ts 112 + | lo - tF 12 | Jo - tS 12 + | |o - tFI 2 the CSI for Valentinian - an undisputed Fletcher work
A value of'l' indicates full Shakespearean style, and is only 0.006, whereas the CSI for the disputed London
a value of 'O' indicates full Fletcherian characteris- Prodigal indicates an uncertainty margin fifteen times
tics. larger, at 0.09. We can use the Characteristic Shake-
The following table of results shows the network's speare Index as given by equation (4) to rank how
'certain' the network predictions are, for either output for the prediction of the authorship of plays in
the test set. Fletcher or Shakespeare. We find, for example, that
As can be seen, the network produces the correct the five most 'uncertain' data samples correspond to (in
classification in the test set on the commonly ascribed order of decreasing uncertainty): Two Noble Kinsmen,
plays (All's Well, Much Ado aboutNothing,Romeo and Act III, Two Noble Kinsmen, Act IV, London Prodi-
Juliet [Shakespeare] and Valentinian, M'sieur Thomas gal, Double Falsehood, and Two Noble Kinsmen, Act
[Fletcher]). What about the disputed works? Although I. These are of course examples from the traditionally
the support is not quite so strong, the network indi- 'disputed' plays.
cates that The Double Falsehood should be ascribed Thus the Radial Basis Function network is capa-
primarily to Fletcher, rather than Shakespeare in agree- ble of producing a much richer interpretation on its
ment with contemporary scholarship (Metz, 1989). predictions than a simple binary segmentation of data
Similarly we find that The London Prodigal is predom- samples into either Fletcher or Shakespeare camps.
inantly Fletcherian, though with some Shakespearian This also now provides a system for fast and efficient
influences. 'classification' of other Fletcher or Shakespeare works
The verdict on the Two Noble Kinsmen is particu- which should also produce a degree of 'uncertainty'
larly interesting. This play has long been considered attached to a prediction.
by some to be a genuine collaboration between the two
dramatists. With this in mind, the network was applied
to individual Acts of the play. Briefly, the network 6. Conclusions
shows very strong support for Shakespeare writing
Act V and Act I, and Fletcher writing Act 2. How- This paper has presented an analysis of stylometric
ever the situation on Acts III and IV is not so clear. features using Radial Basis Function and standard tech-
There is a split in support of Fletcher writing much niques to infer authorship of disputed historic docu-
of Act III but with significant Shakespeare input, and ments. It was demonstrated that a nonlinear model and
Shakespeare writing Act IV, though with significant specifically the Radial Basis Function network archi-
Fletcher involvement. Each individual Act may have tecture, was both required and could be used effectively
been written collaboratively, although there may well to separate, by author, known works of literature and
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 459
Play Actual outputs CSI Decision
Alls Well (1.13, 0.013) 0.992 Shakespeare
Much Ado About Nothing (0.905, 0.0945) 0.989 Shakespeare
Romeo and Juliet (0.972, 0.028) 0.999 Shakespeare
Valentinian (0.07, 0.093) 0.006 Fletcher
M'sieur Thomas (0.125, 0.875) 0.020 Fletcher
London Prodigal (0.24, 0.76) 0.090 Fletcher
Double Falsehood (0.18, 0.82) 0.046 Fletcher
Two Noble Kinsmen (Act I) (0.83, 0.17) 0.960 Shakespeare
Two Noble Kinsmen (Act II) (-0.135, 1.135) 0.014 Fletcher
Two Nobel Kinsmen (Act III) (0.306, 0.694) 0.163 F/S??
Two Nobel Kinsmen (Act IV) (0.729, 0.271) 0.878 S/F??
Two Nobel Kinsmen (Act V) (0.967, 0.033) 0.999 Shakespeare
in addition provide evidence of authorship for disputed P patterns in the training set. Then we can collect
works. Comparisons with other more standard tech- together the entire set of input patterns into a matrix
X of size n x P. Corresponding to each input pattern niques have illustrated an advantage to using Radial
Basis Function neural networks. in the training set we have a desired 'target' value (i.e.
The main conclusion of this work is to demon- either Shakespeare or Fletcher coded appropriately).
Let us assume that there are c components in each strate the utility and potential for using quantitative
techniques such as neural networks as an additional target pattern (c = 2 in the examples considered in this
set of tools to assist in the decision making pro- paper). We denote the p-th target pattern as the vector
cesses in relatively subjective disciplines. Advantages tp. We can collect together all the target patterns into
a matrix T of size c x P. include the ability to deal with statistically noisy
data samples and intrinsically nonlinear relationships.
A.1 Optimum linear transformation (olt) Disadvantages stem from the requirement for large
amounts of raw data which would rule out the exploita-
The optimum linear transformation seeks to find the tion of neural networks as, for example, a forensic
best transformation (i.e. the one that minimises the sum tool. However as the amount of available computer
readable literary texts continues to increase we can squared residual error) between the matrix of desired
target values and the matrix of valued obtained by an expect expansion in the use of automated pattern recog-
nition techniques, such as neural networks, as assis- arbitrary linear transformation of the input data (so by
tants to help in the resolution of outstanding literary using translations, rotations, scalings and reflections).
We can formalise this as follows. For the n x P matrix
mysteries.
X of input patterns and the corresponding c x P matrix
T of target patterns on the training set, the problem is
to find the optimum c x n matrix, A and c x 1 vector Appendix 1: The Optimum Linear
Transformation, and Gaussian Classifier b (which accounts for translations) which satisfy the
equation
This Appendix briefly discusses the mathematics used
AX + bl* T
to contruct the 'benchmark' classifiers of an Optimum
Linear Transformation (OLT) and a Gaussian Classifier
with minimum residual error. Note that 1* is a row
(GC).First let us introduce some notation. If there are
vector of l's of size 1 x P. The solution with mini-
n observable quantities (such as the five 'scaffolding'
mum Frobenius norm may be found by pseudo-inverse
words used in this study), let us denote each one of
methods for which several numerical procedures have
them as xi, i = 1, 2,.....,n and the collection of these
been developed. Once A and b have been determined,
or vector set of the observables as xp for the p-th
an arbitrary input pattern is linearly transformed into a
exemplar pattern. Assume we have a total number of
pattern t in the target pattern space. The class associ-
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 460
ated with that input pattern is determined by the closest smoothing factor is in the use of a general Gaussian
transfer function, i.e. the nonlinear transfer function target vector to the transformed pattern, t.
takes the form ¢(z) , exp [zT E-'z]. In the simple
A.2 Gaussian classifier (GC) case of a Euclidean distance this nonlinearity simpli-
fies to q(z) , exp - [Ilz/ll2E/2]. In this paper, as an
The Gaussian classifier is probabilistically motivated. illustration of the general approach we have chosen
We assume each pattern in each class is drawn from to employ a nonparametric 'spline' basis function i.e.
a full Gaussian distribution where each class c may q(z) , zlog(z) which, although less flexible, has the
advantage that it does not require the additional esti- be characterised by a mean vector, Pc and a (full)
covariance matrix, Ec. These are approximated by the mation of smoothing parameters. An interesting point
about this type of nonlinearity is that it is unbounded. training set samples. Then given the c-th class, c, the
probability of any given pattern xp belonging to that It is commonly believed that the major advantage of
class may be expressed as the use of Radial Basis Function architectures is the
exploitation of the locality of the basis functions. How-
1
ever there are good reasons, which we cannot discuss
P(xp lc) (2r)/2cexp-(xp -Pc)* ~c-'(xp -PLc)
(2~7r)n/2 Ec as part of this paper, why nonlocalised basis functions
are more appropriate for interpolation problems. The
Knowing the prior probabilitiespc of the occurrence of
important point is that the network as a whole needs
each class allows the determination of the probability
to be able to generate a localised mapping of the data.
of that class c given the pattern as pcP(x, Ic). Thus, the
This we can achieve through the training process, since
decision is to choose that class which gives the largest
the Radial Basis Function structure is computationally
pattern conditional probability.
universal.
One of the advantages of the Radial Basis Function
is that the first layer weights {pj, Ej; j = 1,...h} may
Appendix 2: The Radial Basis Function
often be determined or specified by a judicious use
Network
of prior knowledge, or adapted by simple techniques.
Early work (Broomhead and Lowe, 1988) found it
The Radial Basis Function is a single hidden layer feed
sufficient to position the basis functions at data points
forward network with linear transfer functions on the
sampled randomly according to the distribution of the
output nodes and nonlinear transfer functions on the
data. This ensured that network resources were con-
hidden layer nodes. Many types of nonlinearities may
centrated in regions of higher data density. Another
be used. There is also typically a bias or an offset
early technique (Moody and Darken, 1989) was to
weight on each output node, though not usually on
position the centres of the basis functions according
the hidden nodes. The primary adjustable parameters
to a K-means clustering process on the data points
are the final layer weights, {Ajk } connecting the j-th
and then set the smoothing parameters of the assumed
hidden node to the k-th output node. There are also
Gaussian basis functions to be the average distance
weights {pi } connecting the i-th input node with
between cluster centres. Therefore once the weights
the j-th hidden node (see Figure 1) and occasional-
associated with the first layer have been specified the
ly a 'smoothing' factor matrix, { j } which denotes
major problem in 'training' a Radial Basis Function
the range of influence of each node. The bias node is
network is focussed upon the determination of the final
labelled by j = 0.
layer weights. Since this is a linear optimisation pro-
The mathematical embodiment of the Radial Basis
cess (the parameters { Ajk } occur linearly when min-
Function takes the following form. The k-th component
imising the residual sum squared error measure as is
of the output vector y, corresponding to the p-th input
usually employed in the training process), the Radial
pattern x, is expressed as
Basis Function is computationally more attractive in
h
applications compared to a Multilayer Perceptron even
though they are both computationally universal archi-
[y(Xp)lk k = jj((lxp - jII)
tectures.
j=0o
As this final phase is a linear optimisation process,
where qj(...) denotes the nonlinear transfer function
again a pseudoinverse method or other efficient min-
of hidden node j. The most common example of the
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms 461
imisatio techniques may by used to obtain the optimum Lowe, D. and A. R. Webb "Optimized Feature Extraction and the
Bayes Decision in Feed-forward Classifier Networks". Pattern network weights (see Broomhead and Lowe, 1988;
Analysis and Machine Intelligence, 13, 4 (1991), 355-64.
Haykin, 1994 for further details).
Lowe, D. "Novel 'Topographic' Nonlinear Feature Extraction using
Radial Basis Functions for Concentration Coding in the 'Artificial
Nose'". 3rd IEE International Conference on Artificial Neural
Networks, Conference Publication number 372, (1993), pp. 95-
Acknowledgements
99.
Matthews, R. A. J. and T. V. N. Merriam. "Neural Computation in
The authors would like to thank Thomas Merriam
Stylometry I: An application to the works of Shakespeare and
Fletcher". Literary and Linguistic Computing, 8, 4 (1993), 203- for advice, comments, guidance and providing the
209. data used as part of this study. This data, used to
Merriam, T. V. N. "Modelling a Canon: Principles and Examples
train the Radial Basis Function, is available from
in Applied Statistics". Doctoral Thesis. University of London,
Robert Matthews in ASCII form to all who provide 1992.
Metz, G. H., ed. Sources of Four Plays Ascribed to Shakespeare. either an email address or a 3.5 inch disc with return
Colombia: University of Missouri Press, 1989.
postage. Robert Matthews can be contacted by the e-
Moody J. and C. Darken. "Fast Learning in Networks of Locally
mail address [email protected].
Tuned Processing Units". Neural Computation, 1,2 (1989), 281-
94.
Niranjan, M. and FE Fallside. "Neural Networks and Radial Basis
Functions in Classifying Static Speech Pattemrns". Computers,
References
Speech and Language, 4 (1990), 275-89.
Park, J. and I. W. Sandberg. "Universal Approximation using
Broomhead, D. S. and David Lowe. "Multi-variable Functional
Radial Basis Function Networks".Neural Computation,3 (1991),
Interpolation and Adaptive Networks". Complex Systems, 2, 3
246-257.
(1988), 269-303.
Powell, M. J. D. "The Theory of Radial Basis Function Approx-
Girosi, F., M. Jones and T. Poggio. "Regularization Theory and imation in 1990". In Advances in Numerical Analysis. Vol II:
Neural Network Architectures". Neural Computation, 7, 2 Wavelets, Subdivision Algorithms and Radial Basis Functions.
(1995), 219-269. Ed. W. A. Light. Oxford University Press, 1992, 105-210.
Hart, A. "Shakespeare and the Vocabulary of The Two Noble
Proudfoot, G. R., ed. "The Two Noble Kinsmen". London: Edward
Kinsmen". Melbourne: Melboumrne University Press, 1934.
Amold, 1970.
Haykin, S. "Neural Networks: A Comprehensive Foundation". Rao Vemuri, V. and R. D. Rogers, eds. Artificial Neural Networks:
(Chapter 7: Radial Basis Function Networks). Macmillan, 1994. Forecasting Time Series. IEEE Computer Society Press, 1994.
Horton, T. B. The Effectiveness of the Stylometry of Function Words
Schoenbaum, S., ed. The Famous History of the Life of King Henry
in Discriminating between Shakespeare and Fletcher. Doctoral
the Eighth. New York: The New American Library, 1967.
Thesis. University of Edinburgh, 1987.
Taylor, G. "The Canon and Chronology of Shakespeare's Plays". In
Hoy, C. "The Shares of Fletcher and his Collaborators. In the Beau- William Shakespeare: A Textual Companion. Oxford: Clarendon
mont and Fletcher Canon (VIII)". Studies in Bibliography, 15
Press, 1987.
(1956), 129-146.
Lowe, D. "What Have Neural Networks to Offer Statistical Pattern
Processing". SPIE Proceedings on Adaptive Signal Processing
1565 (1991), 460-71.
This content downloaded from 129.67.116.144 on Sun, 27 Mar 2016 15:04:38 UTC All use subject to http://about.jstor.org/terms