“Rational Creatures”: Using Vector Space Models to Examine Independence in the Novels of Jane Austen, Maria Edgeworth, and Sydney Owenson (1800–1820)
Total Page:16
File Type:pdf, Size:1020Kb
“Rational Creatures”: Using Vector Space Models to Examine Independence in the Novels of Jane Austen, Maria Edgeworth, and Sydney Owenson (1800–1820) Sara Jane Kerr A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Maynooth University Faculty of Arts, Celtic Studies and Philosophy February 2019 HoD & Supervisor 1: Supervisor 2: Prof. S. Schreibman Dr C. Brunstrom Contents Contents i List of Figures v List of Tables ix Abstract xi Acknowledgements xiii Dedication xv Introduction 1 1 Independent Women 11 1.1 Contemporary Reception . 12 1.1.1 Women Readers, Women Writers . 12 1.1.2 The Reviewers . 34 1.2 Independence . 45 1.2.1 Education . 48 1.2.2 Wealth and Work . 54 1.2.3 Marriage and Sexual Conduct . 58 1.2.4 National Independence . 67 2 Enhanced Reading 75 i 2.1 ‘Close’ and ‘Distant’ . 76 2.2 Beyond Close Reading? . 83 2.3 Visualisation . 88 2.3.1 The Purposes of Visualisation . 89 2.3.2 Visualisation in Digital Humanities . 93 2.3.3 Austen, Edgeworth, and Owenson . 97 Frequency Analysis . 98 Topic Modelling . 101 Word Embedding . 102 Network Analysis . 103 3 Enhanced Reading in Practice 107 3.1 Corpus Creation . 107 19th Century Comparison Corpus . 114 3.2 Computational Analysis . 121 3.2.1 R Programming . 121 3.2.2 Vector Space Models . 124 Frequency Analysis . 128 Topic Modelling . 129 Word Embedding . 131 Semantic Network Analysis . 136 3.3 Conclusion . 139 4 Term-Document Models 141 4.1 Frequency Analysis . 141 4.1.1 Grouping Texts . 142 ii 4.1.2 Type-Token Ratio . 146 4.1.3 Collocations . 150 4.2 Topic Modelling . 156 4.2.1 Combined Corpus . 160 4.2.2 The Austen, Edgeworth, and Owenson Corpus . 165 5 Word-Context Models 169 5.1 Word-Embedding Models . 170 5.1.1 An Independence Lexicon . 170 5.1.2 An Enhanced Reading of Mansfield Park ........ 174 5.1.3 Education, Employment, Wealth and Rank . 185 Education and Employment . 185 Wealth and Rank . 192 5.1.4 Nation and Independence . 198 5.2 Semantic Networks . 206 Conclusion 215 A Text Cleaning Script 227 B Nineteenth Century Corpus 229 C Standard Stop Word List in tm 233 D w2v_analysis Script 235 E Type-Token Ratio By Text 237 F 20 Topic Model for Combined Corpus 241 iii G Topic Model for Austen, Edgeworth, Owenson Corpus 243 Bibliography 245 iv List of Figures 1.1 Novels By Gender 1800-1820. (data from Garside and Schöw- erling 2000, p. 73) . 27 3.1 SO Text as Downloaded . 113 3.2 SO Text After Reprocessing . 113 3.3 Tokens Per Text By Year . 115 3.4 Screen Shot of OCR to Plain Text . 119 3.5 Plot of 19thC Corpus by Gender and Date . 122 3.6 A Simple Document-Term Maxtrix . 125 3.7 The CBOW Model . 133 3.8 The Skip-Gram Model . 135 3.9 Cosine Similarity . 138 4.1 Bootstrap Consensus Tree - Austen, Edgeworth and Owenson 144 4.2 Bootstrap Consensus Tree - 19th Century Corpus Plus Austen, Edgeworth and Owenson . 146 4.3 Type-Token Ratio for Novels in Combined Austen, Edgeworth, Owenson, and 19th Century Corpus . 148 4.4 Top Ten Terms for Combined Corpus - 37 Topic Model . 163 4.5 37 Topic Model for Combined Austen, Edgeworth, Owenson, and 19th Century Corpus . 164 v 4.6 20 Topic Model for Combined Austen, Edgeworth, Owenson Corpus . 166 5.1 Key Word in Context Output . 173 5.2 The Mansfield Park Theatricals . 174 5.3 Word2Vec Model of Austen’s Top 500 Terms Related to ‘inde- pendent’ and ‘independence’. 176 5.4 Word2Vec Model of Edgeworth’s Top 500 Terms Related to ‘independent’ and ‘independence’ . 186 5.5 Cluster of Wealth, Employment and Independence Terms in Word2Vec Model of the Austen Corpus . 193 5.6 Word2Vec Model of Owenson’s Top 500 Terms Related to ‘in- dependent’ and ‘independence’. 195 5.7 Cluster of Positive and Negative Terms in Word Embedding Model of the Austen Corpus . 196 5.8 Cluster of Positive and Negative Terms in Word Embedding Model of the Owenson Corpus . 197 5.9 Network Graphs of top 250 ‘independent/independence’ Terms in Austen, Edgeworth, and Owenson Corpora . 207 5.10 Network Graphs of top 250 ‘independent/independence’ Terms in 19th Century and CLMET Corpora . 209 5.11 Section of Network Graph Showing ‘independent’ in the 19th Century and CLMET Corpora . 211 5.12 Ego Network for ‘independence’ in Edgeworth Corpus . 212 5.13 Ego Network for ‘independent’ in Owenson Corpus . 213 5.14 Ego Network for ‘capacity’ in Austen Corpus . 214 vi A.1 Text Cleaning Script . 228 D.1 w2v_analysis Script . 236 F.1 Top Ten Terms for Combined Corpus - 20 Topic Model . 242 G.1 Top Twenty Terms for Austen, Edgeworth and Owenson Corpus - 20 Topic Model . 244 vii List of Tables 3.1 Austen Corpus . 110 3.2 Edgeworth Corpus . 111 3.3 Owenson Corpus . 114 3.4 19th Century Female Corpus . 118 3.5 19th Century Male Corpus . 120 3.6 Word similarity to ‘king’ - ‘man’ + ‘woman’ . 134 4.1 Mean Type-Token Ratio by Corpus . 149 4.2 Austen’s Top 15 Collocates of ‘Independent’ . 151 4.3 Austen’s Top 15 Collocates of ‘Independence’ . 152 4.4 Edgeworth’s Top 15 Collocates of ‘Independent’ . 153 4.5 Edgeworth’s Top 15 Collocates of ‘Independence’ . 153 4.6 Owenson’s Top 15 Collocates of ‘Independent’ . 154 4.7 Owenson’s Top 15 Collocates of ‘Independence’ . 154 4.8 Part of speech Tags and Definitions . 161 5.1 Comparison of Top 10 Terms in ‘independence’ word2vec Models171 B.1 19th Century Female Corpus - Full Title and Source . 230 B.2 19th Century Male Corpus - Full Title and Source . 231 E.1 Austen Corpus - Type-Token Ratio . 237 E.2 Edgeworth Corpus - Type-Token Ratio . 238 ix E.3 Owenson Corpus - Type-Token Ratio . 238 E.4 19th Century Female Corpus - Type-Token Ratio . 239 E.5 19th Century Male Corpus - Type-Token Ratio . 240 x Abstract Sara Jane Kerr “Rational Creatures”: Using Vector Space Models to Examine Independence in the Novels of Jane Austen, Maria Edgeworth, and Sydney Owenson (1800–1820) Recent trends in digital humanities have led to a proliferation of studies that apply ‘distant’ reading to textual data. There is an uneasy relationship be- tween the increased use of computational methods and their application to literary studies. Much of the current literature has focused on the exploration of large cor- pora. However, the ability to work at this scale is often not within the power (financial or technical) or the interests, of researchers. As these large-scale studies often ignore smaller corpora, few have sought to define a clear theoret- ical framework within which to study small-scale text collections. In addition, while some research has been carried out on the application of term-document vector space models (topic models and frequency based analysis) to nineteenth century novels, no study exists which applies word-context models (word em- beddings and semantic networks) to the novels of Austen, Edgeworth, and Owenson. This study, therefore, seeks to evaluate the use of vector space models when applied to these novels. xi This research first defines a theoretical framework - enhanced reading - which combines the use of close and distant reading. Using a corpus of twenty-eight nineteenth century novels as its central focus, this study also demonstrates the practical application of this theoretical approach with the additional aim of providing an insight into the authors’ representation of independence at a time of great political and social upheaval in Ireland and the UK. The use of term-document models was found to be, generally, more useful for gaining an overview of the corpora. However, the findings for word-context models reveal their ability to identify specific textual elements, some of which were not readily identified through close reading, and therefore were useful for exploring texts at both corpus and individual text level. xii Acknowledgements Firstly, I would like to thank my supervisors Professor Susan Schreibman and Dr Conrad Brunstrom for their guidance, advice and support throughout my time at Maynooth University. Their support has enabled me to develop into a more confident researcher, writer and presenter. This research was made possible by a John and Pat Hume Scholarship, without which I would not have been able to finance my studies. I would also like to extend my appreciation to Dr Costas Papadopoulos, a val- ued member of my advisory committee, and Conor Wilkinson, from Graduate Studies, for all their help and support. Developing the ideas which have come to fruition in this thesis would not have been possible without the feedback I have received at conference presentations: thank you to the organisers, and to those who asked questions. In addition, I would like to thank the members of my Viva Committee: Dr JoAnne Mancini, Dr Conor McCarthy and Dr Derek Greene, for their time, their feedback, and their kind comments on my work. On a more personal note, I would like to express my sincere thanks and grat- itude to my husband Iain, who has not only supported me emotionally and financially while I have completed my PhD, but has also put up with my work materials taking up almost every surface in our house. A special mention goes xiii out to my team of proofreaders and advisors: my sister Joanna, and my good friend Dr Stafford Murray. Finally, I would like to thank my parents, Martin and Elsie, who have always supported my academic endeavours. xiv Dedication For Wan, Mollie and Erin.