AB
Pamphlet 8 January 2015
Literary Lab
Between Canon and Corpus: Six Perspectives on 20th-Century Novels
Mark Algee-Hewitt Mark McGurl Pamphlets of the Stanford Literary Lab
ISSN 2164-1757 (online version) Mark Algee-Hewitt Mark McGurl
Between Canon and Corpus: Six Perspectives on 20th-Century Novels
1. Dilemmas of Selection
Of the many, many thousands of novels and stories published in English in the 20th century, which group of several hundred would represent the most reasonable, interesting, and use- ful subset of the whole?
This was the difficult question posed to researchers in the Stanford Literary Lab when they decided to move ahead with plans to create a fully digitized corpus of 20th-century fiction. Lacking any such resource, scholars here and elsewhere had been largely unable to engage in the kinds of large-scale quantitative analyses of literary historical data routinely performed on the texts of earlier periods, which have depended on the ready availability of corpora such as the Chadwyck-Healey database of Nineteenth-Century Fiction, Gale’s Eighteenth Century Collections Online database, and the like. Using this data, the critic-researchers of the Lab had, for instance, been able to make new observations and analyses of the histori- cal nature of novelistic genres, of large-scale shifts in novelistic language over the course of the 19th century, and the nature of style at the level of the sentence.1 Given how often these results had been represented graphically as occurring along a historical time-line, one nat- urally wondered, moving the eye from left to right: what happens next? What happens to these trends in the 20th—and for that matter 21st—centuries? Do novelistic genres operate the same way in the 20th century as they did in the 19th? Does the use of words connoting
1 See, for example, Allison, Sarah, Heuser, Ryan et al. Quantitative Formalism: an Experiment, 2011; Heuser, Ryan and Le-Khac, Long. A Quantative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method, 2012; Allison, Sarah, Gemma, Marissa, et al. Style at the Scale of the Sentence, 2013.
1 abstract values continue to decline? Do sentences get more or less complex? And, for that matter, what entirely new literary historical phenomena might become visible in the data of more recent literary history? And there one’s curiosity perforce remained in suspension, un- satisfied. Although a great deal of the literature of the 20th century has long existed in digital form on the servers of publishing houses, that data has largely not been made available for the use of scholars.
But supposing the doors to the storeroom of all of the novels and stories written in English in the 20th century were opened to inspection and selection, which of them would one even choose? After all, the number of books published in English grows exponentially. Accord- ing to publishers’ own data, in the last forty years alone the number of unique fictional texts published per year in English worldwide has grown from 7,948 to 278,985 (Figure 1).
A 20th-century corpus would therefore be selected from a number of books potentially or- ders of magnitude greater than one built for the 18th or even 19th century, making any aspira- tion we might have had toward the rigorous statistical representativeness of the new corpus unrealistic. Because the labor and expenses involved in assembling a reliable database of digitized texts are considerable, it was decided that that number would initially have to be restricted to roughly 350, the same size as some of the other high quality or reliably “clean” corpora already in use by the Lab.
No doubt the most efficient path forward would have been for one Lab member, ideally a scholar of 20th-century literature, to select the requisite number of texts from an extensive, if by no means exhaustive, private collection of paperbacks amassed over many years as a student, teacher, scholar and general reader. An idiosyncratically “curated” corpus of this sort might have had some charm, in particular to the selector, as a modest monument to a personal (but also, of course, highly class- and otherwise-inflected) history of a relation to