Using Writing Style to Predict the Success of Novels

Success with Style: Using Writing Style to Predict the Success of Novels Vikas Ganjigunte Ashok Song Feng Yejin Choi Department of Computer Science Stony Brook University Stony Brook, NY 11794-4400 vganjiguntea, songfeng, [email protected] Abstract fore they are picked up by a publisher.1 Perhaps due to its obvious complexity of the prob- Predicting the success of literary works is a lem, there has been little previous work that attempts curious question among publishers and aspir- to build statistical models that predict the success of ing writers alike. We examine the quantitative literary works based on their intrinsic content and connection, if any, between writing style and quality. Some previous studies do touch on the no- successful literature. Based on novels over tion of stylistic aspects in successful literature, e.g., several different genres, we probe the predic- tive power of statistical stylometry in discrim- extensive studies in Literature discuss literary styles inating successful literary works, and identify of significant authors (e.g., Ellegard˚ (1962), Mc- characteristic stylistic elements that are more Gann (1998)), while others consider content char- prominent in successful writings. Our study acteristics such as plots, characteristics of charac- reports for the first time that statistical stylom- ters, action, emotion, genre, cast, of the best-selling etry can be surprisingly effective in discrim- novels and blockbuster movies (e.g., Harvey (1953), inating highly successful literature from less Hall (2012), Yun (2011)). successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new All these studies however, are qualitative in na- insights into characteristics of the writing style ture, as they rely on the knowledge and insights of in successful literature, including findings that human experts on literature. To our knowledge, no are contrary to the conventional wisdom with prior work has undertaken a systematic quantitative respect to good writing style and readability. investigation on the overarching characterization of the writing style in successful literature. In consid- eration of widely different styles of authorship (e.g., 1 Introduction Escalante et al. (2011), Peng et al. (2003), Argamon Predicting the success of novels is a curious ques- et al. (2003)), it is not even readily clear whether tion among publishers, professional book reviewers, there might be common stylistic elements that help aspiring and even expert writers alike. There are po- discriminating highly successful ones from less suc- tentially many influencing factors, some of which cessful counterpart. concern the intrinsic content and quality of the book, In this work, we present the first study that in- such as interestingness, novelty, style of writing, and vestigates this unstudied and unexpected connection engaging storyline, but external factors such as so- between stylistic elements and the literary success. cial context and even luck can play a role. As a re- The key findings of our research reveal that there sult, recognizing successful literary work is a hard exists distinct linguistic patterns shared among suc- task even for experts working in the publication in- 1E.g., Paul Harding’s “Tinkers” that won 2010 Pulitzer Prize dustries. Indeed, even some of the best sellers and for Fiction and J. K. Rowling’s “Harry Potter and the Philoso- award winners can go through several rejections be- pher’s Stone” that sold over 450 million copies. 1753 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1753–1764, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics cessful literature, at least within the same genre, GENRE #BOOKS τ − τ + making it possible to build a model with surprisingly Adventure 409 17 100 high accuracy (up to 84%) in predicting the success Detective / Mystery 374 25 90 Fiction 1148 7 125 of a novel. This result is surprising for two reasons. Historical Fiction 374 25 115 First, we tackle the hard task of predicting the suc- Love Stories 342 16 85 cess of novels written by previously unseen authors, Poetry 580 9 70 avoiding incidental learning of authorship signature, Science Fiction 902 30 100 since previous research demonstrated that one can Short Stories 1117 9 224 achieve very high accuracy in authorship attribution Table 1: # of books available per genre at Gutenberg with (as high as 96% in some experimental setup) (e.g., download thresholds used to define more successful (≥ Raghavan et al. (2010), Feng et al. (2012)). Sec- τ +) and less successful (≤ τ −) classes. ond, we aim to discriminate highly successful novels from less successful, but nonetheless published scripts, e.g., adventure stories, mystery, fiction, his- books written by professional writers, which are un- torical fiction, sci-fi, short stories, as well as poetry, doubtedly of higher quality than average writings. and present systematic analyses based on lexical and It is important to note that the task we tackle here syntactic features which have been known to be ef- is much harder than discriminating highly success- fective in a variety of NLP tasks ranging from au- ful works from those that have not even passed the thorship attribution (e.g., Raghavan et al. (2010)), scrutinizing eyes of publishers. genre detection (e.g., Rayson et al. (2001), Douglas In order to quantify the success of literary works, and Broussard (2000)), gender identification (e.g., and to obtain corresponding gold standard labels, Sarawgi et al. (2011)) and native language detection one needs to first define “success”. For practi- (e.g., Wong and Dras (2011)). cal convenience, we largely rely on the download Our empirical results demonstrate that (1) statis- counts available at Project Gutenberg as a surrogate tical stylometry can be surprisingly effective in dis- to quantify the success of novels. For a small num- criminating successful literature, achieving accuracy ber of novels however, we also consider award re- up to 84%, (2) some elements of successful styles cipients (e.g., Pulitzer, Nobel), and Amazon’s sales are genre-dependent while others are more univer- records to define a novel’s success. We also ex- sal. In addition, this research results in (3) find- tend our empirical study to movie scripts, where we ings that are somewhat contrary to the conventional quantify the success of movies based on the aver- wisdom with respect to the connection between suc- age review scores at imdb.com. We leave analysis cessful writing styles and readability, (4) interesting based on other measures of literary success as future correlations between sentiment / connotation and the research. literary success, and finally, (5) comparative insights In this study, we do not attempt to separate out between fiction and nonfiction with respect to the success based on literary quality (award winners) successful writing style. from success based on popularity (commercial hit, often in spite of bad literary quality), mainly because 2 Dataset Construction it is not practically easy to determine whether the For our experiments, we procure novels from project high download counts are due to only one reason or Gutenberg2. Project Gutenberg houses over 40, 000 the other. We expect that in many cases, the two books available for free download in electronic for- different aspects of success are likely to coincide, mat and provides a catalog containing brief descrip- however. In the case of the corpus obtained from tions (title, author, genre, language, download count, Project Gutenberg, where most of our experiments etc.) of these books. We experiment with genres in are conducted, we expect that the download counts Table 1, which have sufficient number of books al- are more indicative of success based on the literary lowing us to construct reasonably sized datasets. quality (which then may have resulted in popularity) We use the download counts in Gutenberg-catalog rather than popularity without quality. We examine several genres in fiction and movie 2http://www.gutenberg.org/ 1754 Figure 1: Differences in POS tag distribution between more successful and less successful books across different genres. Negative (positive) value indicates higher percentage in less (more) successful class. as a surrogate to measure the degree of success of e.g., genre detection (e.g., Kessler et al. (1997)) novels. For each genre, we determine a lower bound and authorship attribution (e.g., Stamatatos (2009)), (τ+) and an upper bound (τ−) of download counts as while the last two are newly explored in this work. shown in Table 1 to categorize the available books as more successful and less successful respectively. I. Lexical Choices: unigrams and bigrams. These thresholds are set to obtain at least 50 books II. Distribution of Word Categories: Many pre- for each class, and for each genre. To balance the vious studies have shown that the distribution of data, for each genre, we construct a dataset of 100 part-of-speech (POS) tags alone can reveal surpris- novels (50 per class). ing insights on genre and authorship (e.g., Koppel We make sure that no single author has more than and Schler (2003)), hence we examine their distri- 2 books in the resulting dataset, and in the major- butions with respect to the success of literary works. ity of the cases, only one book has been taken from each author.3 Furthermore, we make sure that the III. Distribution of Grammar Rules: Recent books from the same author do not show up in both studies reported that features based on CFG rules are training and test data. These constraints make sure helpful in authorship attribution (e.g., Raghavan et that we learn general linguistic patterns of success- al. (2010), Feng et al. (2012)). We experiment with ful novels, rather than a particular writing style of a four different encodings of production rules: few successful authors. • Γ: lexicalized production rules (all production 3 Methodology rules, including those with terminals) G In what follows, we describe five different aspects of • Γ : lexicalized production rules prepended linguistic styles we measure quantitatively.

Load more