AA MathematicalMathematical StudyStudy ofof AuthorshipAuthorship AttributionAttribution

YangYang WangWang DepartmentDepartment ofof MathematicsMathematics MichiganMichigan StateState UniversityUniversity WhoWho WroteWrote ““TheThe CuckooCuckoo’’ss CallingCalling””?? 8/11/13 Amazon.com: Karen's review of The Cuckoo's Calling [LegacyTitleID: 21153...

Yang's Amazon.com Today's Deals Gift Cards Sell Help

Shop by Hello, Yang Your Wish Search All 0 Department Go Your Account Prime Cart List Customer Review

Review Details 1,775 of 1,918 people found the following review helpful Great Read!, July 7, 2013 Item The Cuckoo's Calling By Karen (1,276 customer reviews) This review is from: The Cuckoo's Calling [LegacyTitleID: 21153809] 5 star: (650) 4 star: (330) This book is so well written that I suspect that some years down the 3 star: (140) road we will hear the author's name is a pseudonym of some famous 2 star: (74) 1 star: (82) writer. Lots of description made one feel like another occupant in the

scene. You could feel the weather, the tension, the pain, the $26.00 $15.19 atmosphere in the gatherings. The Audible version had great accents. It is a wonderful mystery with a surprise ending, and I look forward to more by the same author. Help other customers find the most helpful reviews 98 used & new Was this review helpful to you? available from $14.24 Report abuse | Permalink Reviewer Karen

Comments Location: San Jose, CA, United States Track comments by e-mail Tracked by 11 customers Top Reviewer Ranking: 2,412 Showing 1-10 of 92 posts in this discussion Sort: Oldest first | Newest first See all 22 reviews

Initial post: Jul 1 3 , 2 0 1 3 4 :0 0 :0 8 P M P DT Lena Ricken says: You're right. It was just revealed that J.K. Rowling is behind this book.

"I had hoped to keep this secret a little longer," JKR said, "because being Robert Galbraith has been such a liberating experience. It has been wonderful to publish without hype or expectation and pure pleasure to get feedback under a different name."

P ermalink | Report abuse | Ignore this customer Reply to this post 112 of 121 people think this post adds to the discussion. Do you?

Posted on Jul 13, 2013 4:00:44 PM PDT heartJESS says: http://www.hypable.com/2013/07/13/jk-rowling-ghost-writer-the-cuckoos-calling/

Looks like you were right!

P ermalink | Report abuse | Ignore this customer Reply to this post 39 of 47 people think this post adds to the discussion. Do you?

Posted on Jul 13, 2013 5:11:07 PM PDT R. Bruno Martin says: I have nothing useful to add but I'm really amused you mentioned this was probably a pseudonym just a few days before it was revealed this was indeed written by one of the best selling authors of the last decade :P

P ermalink | Report abuse | Ignore this customer Reply to this post 94 of 98 people think this post adds to the discussion. Do you?

Posted on Jul 13, 2013 5:37:30 PM PDT Kenneth Yates says: YOU WON

P ermalink | Report abuse | Ignore this customer www.amazon.com/review/R8GYN2HLDXVFB/ref=cm_cr_dp_title?ie=UTF8&ASIN=0316206849&nodeID=283155&store=books 1/3 WhoWho WroteWrote ““TheThe CuckooCuckoo’’ss CallingCalling””??

Two experts were sent by the Sunday Times sample books by P.D. James, Val McDermid, Ruth Rendell and J. K. Rowling for comparison. LiteraryLiterary StylometryStylometry

„ TheThe termterm ““stylometrystylometry”” waswas firstfirst coinedcoined byby thethe philosopherphilosopher WincentyWincenty LutaslowskiLutaslowski inin 1897.1897. ItIt isis aa catchcatch allall phrasephrase forfor allall quantitativequantitative analysisanalysis ofof literaryliterary art.art.

„ TraditionalTraditional literaryliterary analysisanalysis reliesrelies moremore onon connoisseurshipconnoisseurship thanthan onon quantitativequantitative analysis.analysis.

„ StylometryStylometry analysisanalysis goesgoes backback toto atat leastleast thethe EnglishEnglish mathematicianmathematician AugustusAugustus DeDe Morgan.Morgan.

I wish you would do this : run your eye over any part of those of St. Paul's Epistles which begin with IlavAo? the Greek, I mean and without paying any attention to the meaning. Then do the same with the Epistle to the Hebrews, and try to balance in your own mind the question whether the latter does not deal in longer words than the former. It has always run in my head that a little expenditure of money would settle questions of authorship in this way. The best mode of explaining what I would try will be to put down the results I should expect as if I had tried them.

Count a large number of words in Herodotus say all the first book and count all the letters ; divide the second numbers by the first, giving the average number of letters to a word in that look.

Do the same with the second book. I should expect a very close approximation. If Book I. gave 5 - 624 letters per word, it would not surprise me if Book IT. gave 5*619. I judge by other things.

But I should not wonder if the same result applied to two books of Thucydides gave, say 5713 and 5728. That is to say, I should expect the slight differences between one writer and another to be well maintained against each other, and very well agreeing with themselves. If this fact were established there, if St. Paul's Epistles which begin with ITcuAos gave 5*428 and the Hebrews gave 5'516, for instance, I should feel quite sure that the Greek of the Hebrews (passing no verdict on whether Paul wrote in Hebrew and another translated) was not from the pen of Paul.

If scholars knew the law of averages as well as mathema- ticians, it would be easy to raise a few hundred pounds to try this experiment on a grand scale. I would have Greek, Latin, and English tried, and I should expect to find that one man writing on two different subjects agrees more nearly with himself than two different men writing on the same subject. Some of these days spurious writings will be detected by this test. Mind, I told you so. With kind regards to all your family, I remain, dear Heald, „ LiteraryLiterary stylometrystylometry hashas seenseen impressiveimpressive developmentdevelopment inin recentrecent years.years.

„ ItIt hashas alsoalso spunspun offoff manymany relatedrelated areasareas ofof researchresearch thatthat areare becomingbecoming moremore andand moremore mainstream.mainstream.

• Computational Rhetoric • Sentiment Analysis • Epidemiology via Social Networks • Stylomtery for Visual Arts • Text Mining • Government Surveillance 8/11/13 Google Pressure Cookers and Backpacks, Get a Visit from the Feds - Yahoo! News

Home Mail News Sports Finance Weather Games Groups Answers Flickr More

Search News Search Web Sign In Mail

Home Recommended for You U.S. Google Pressure Cookers and Backpacks, Get Google Pressure Cookers and World Backpacks, Get a Visit from the Politics a Visit from the Feds Feds The Atlantic Wire Tech Like Dislike Funeral held for Pa. boy who was Science parents' best man Associated Press Health Philip Bump August 1, 2013 Odd News Jolie, Marvel superheroes bewitch Disney expo Associated Press Opinion Michele Catalano was looking for Hawaii schools struggle to keep new Local information online about pressure cookers. teachers Associated Press Dear Abby Her husband, in the same time frame, was Comics Googling backpacks. Wednesday morning, Attend the K-12 2013/14 AdChoices ABC News School Year Online in MI six men from a joint terrorism task force K12 Online School Sponsored Y! News Originals showed up at their house to see if they Recommended were terrorists. Which begs the question: Plane in Connecticut crash was upside down, official says Reuters National Football Le… How'd the government know what they

Canada were Googling? Google Pressure Cookers and Backpacks, Get a Visit Crazy seized contraband (24 The Walt Disney C… from the Feds photos) Microsoft RELATED: We'll Never Know What Google's Usher to keep custody of his 2 young Los Angeles Dodgers Doing With the NSA sons Associated Press

Catalano (who is a professional writer) Raonic beat fellow Canadian Pospisil in Montreal Associated Press 59 mins ago describes the tension of that visit. Hundreds search Idaho wilderness [T]hey were peppering my husband for missing teen Associated Press with questions. Where is he from? FBI fans through wilderness in Where are his parents from? They search for teen Associated Press asked about me, where was I, where Reality TV meets real world, do I work, where do my parents live. 'Mountain Man' style Associated Press Do you have any bombs, they asked. Heavy Rains, Flood Threats Loom Do you own a pressure cooker? My Over 15 States in US (video) ABC New s husband said no, but we have a rice Videos

cooker. Can you make a bomb with that? My husband said no, my wife uses it to Greinke, Gonzalez lead Dodgers over make quinoa. What the hell is quinoa, they asked. ... tricky Rays Associated Press Have you ever looked up how to make a pressure cooker bomb? My husband, Netanyahu tells U.S. mediator ever the oppositional kind, asked them if they themselves weren’t curious as to Palestinians inciting against how a pressure cooker bomb works, if they ever looked it up. Two of them Israel Reuters admitted they did. Car in Calif. missing teen case found in Idaho Associated Press The men identified themselves as members of the "joint terrorism task force." The composition of such task forces depend on the region of the country, but, as we outlined Vick, Brady sharp and Patriots beat Eagles Associated Press after the Boston bombings, include a variety of federal agencies. Among them: the FBI and Homeland Security. President Reduces AdChoices Amount Homeowners Owe RELATED: PRISM Companies Start Denying Knowledge of the NSA Data Collection Low erMyBills.com Sponsored

Ever since details of the NSA's surveillance infrastructure were leaked by Edward Snowden, Assad sends air force to prevent the agency has been insistent on the boundaries of the information it collects. It is not, by rebel advances in home province Reuters law, allowed to spy on Americans — although there are exceptions of which it takes advantage. Its PRISM program, under which it collects internet content, does not include After a Whale Trainer Is Injured, Man Who Videotaped It Stands by information from Americans unless those Americans are connected to terror suspects by Marineland Takepart.com no more than two other people. It collects metadata on phone calls made by Americans, but reportedly stopped collecting metadata on Americans' internet use in 2011. So how, then, Endangered species thrive on US military ranges Associated Press would the government know what Catalano and her husband were searching for? 'Ocean's 16' Powerball Winner: 'I'm RELATED: Which Tech Company Does the NSA Use Most? Still Up in the Clouds' Good Morning America It's possible that one of the two of them is tangentially linked to a foreign terror suspect, allowing the government to review their internet activity. After all, that "no more than two news.yahoo.com/google-pressure-cookers-backpacks-visit-feds-140900667.html 1/3 StylometryStylometry vsvs ConnoisseurshipConnoisseurship SomeSome CaseCase StudiesStudies

„ ““TheThe CuckooCuckoo’’ss CallingCalling”” investigationinvestigation

„ DonDon FosterFoster’’ss investigationsinvestigations ofof ““TheThe PrimaryPrimary ColorColor”” andand AnthraxAnthrax casecase

„ JacksonJackson PollockPollock paintingspaintings

„ EgyptianEgyptian ArabArab SpringSpring

TheseThese areare justjust aa fewfew casescases II’’mm familiarfamiliar with.with. ThereThere areare numerousnumerous otherother interestinginteresting cases.cases.

8/11/13 Primary Colors (novel) - Wikipedia, the free encyclopedia Unmasking of "Anonymous"

An early reviewer opined that the author wished to remain unknown because "Anonymity makes truthfulness much easier".[4] Later commentators called the publishing of the book under an anonymous identity an effective marketing strategy that produced more publicity for the book, and thus more sales, without calling into question the author's actual inside knowledge.[2]

Several people, including former Clinton speechwriter David Kusnet and, later, Vassar professor Donald Foster, correctly identified Klein as the novel's author, based on a literary analysis of the book and Klein's previous writing. Klein denied writing the book and publicly condemned Foster.[5][6] Klein denied authorship again in Newsweek, speculating that another writer wrote it. Washington Post Style editor David von Drehle, in an interview, asked Klein if he was willing to stake his journalistic credibility on his denial, to which Klein agreed.[7]

On July 17, 1996, after The Washington Post published the results of a handwriting analysis of notes made on an early manuscript of the book, Klein finally admitted that he was "Anonymous".[8]

Plot summary

The book begins as an idealistic former congressional worker, Henry Burton, joins the presidential campaign of Southern governor Jack Stanton, a thinly disguised stand-in for Bill Clinton.[4] The plot then follows the primary election calendar beginning in New Hampshire where Stanton's affair with Cashmere, his wife's hairdresser, and his participation in a Vietnam War era protest come to light and threaten to derail his presidential prospects.[4] In Florida, Stanton revives his campaign by disingenuously portraying his Democratic opponent as insufficiently pro-Israel and as a weak supporter of Social Security.[4] Burton becomes increasingly disillusioned with Stanton, who is a policy wonk who talks too long, eats too much and is overly flirtatious toward women.[4] Stanton is also revealed to be insincere in his beliefs, saying whatever will help him to win.[4] Matters finally come to a head, and Burton is forced to choose between idealism and realism.

en.wikipedia.org/wiki/Primary_Colors_(novel) 2/6 8/11/13 The Wrong Man - David Freed - The Atlantic

Werner Herzog 'Ideological Photos From a The Strange Ends Texting Fixation' Battle With Sexual Quirk of While Driving Explains Fire Filipino Obamacare Seafarers

Politics Business Tech Entertainment Health Sexes National Global China Video Magazine

Special Reports In Focus Events E-books Newsletters Welcome to Holland James Fallows

PROFILE MAY 2010 VIDEO The Wrong Man How Much In the fall of 2001, a nation reeling from the horror of 9/11 was rocked by a series of Energy Does the deadly anthrax attacks. As the pressure to find a culprit mounted, the FBI, abetted by the U.S. Use? media, found one. The wrong one. This is the story of how federal authorities blew the An animated guide to the different energy biggest anti-terror investigation of the past decade—and nearly destroyed an innocent sources that power our nation man. Here, for the first time, the falsely accused, Dr. Steven J. Hatfill, speaks out about his ordeal.

DAVID FREED APR 13 2010, 9:00 AM ET

247

Like

45

Tw eet

3

Share Melissa Golden/Redux More THE FIRST ANTHRAX attacks came days after the jetliner assaults of September 11, 2001. Postmarked Trenton, New Jersey, and believed to have been sent from a mailbox near Princeton University, the initial mailings went to NBC News, the New York Post, and the Florida-based publisher of several supermarket tabloids, including The Sun and The National Enquirer. Three weeks later, two more envelopes containing anthrax arrived at the Senate offices of Democrats Tom Daschle and Patrick Leahy, each bearing the handwritten return address of a nonexistent “Greendale School” in Franklin Park, New Jersey. Government mail service quickly shut down.

The letters accompanying the anthrax read like the work of a jihadist, suggesting that their author was an Arab extremist—or someone masquerading as one—yet also advised recipients to take antibiotics, implying that whoever had mailed them never really intended to harm anyone. But at least 17 people would fall ill and five would die—a photo editor at The Sun; two postal employees at a WRITERS Washington, D.C., mail-processing center; a hospital stockroom clerk in Manhattan whose exposure to anthrax could never be fully explained; and a 94- year-old Connecticut widow whose mail apparently crossed paths with an anthrax letter somewhere in the labyrinth of the postal system. The attacks James Fallows Welcome to Holland AUG 10, 2013 spawned a spate of hoax letters nationwide. Police were swamped with calls from citizens suddenly suspicious of their own mail. www.theatlantic.com/magazine/archive/2010/05/the-wrong-man/308019/ 1/16 StylometryStylometry AnalysisAnalysis forfor ChineseChinese TextsTexts

„ ItIt isis farfar lessless developeddeveloped thanthan forfor EnglishEnglish textstexts

„ ConnoisseurshipConnoisseurship stillstill dominatesdominates thethe investigationinvestigation ofof authorshipauthorship attribution,attribution, andand itit isis atat thethe centercenter ofof somesome highhigh profileprofile authorshipauthorship controversies.controversies.

„ StylometryStylometry analysisanalysis isis moremore challengingchallenging thanthan forfor EnglishEnglish texts.texts.

English words form natural “atoms” for stylometry analysis. But Chinese characters are far less natural “atoms”. Each character by itself has too many (often completely different) meanings. DreamDream ofof thethe RedRed ChamberChamber AA mathematicalmathematical stylometricstylometric StudyStudy DreamDream ofof thethe RedRed ChamberChamber MovieMovie DreamDream ofof thethe RedRed ChamberChamber DreamDream ofof thethe RedRed ChamberChamber

„ WrittenWritten byby CaoCao XueqinXueqin aroundaround 17501750’’ss

„ OneOne ofof ChinaChina’’ss FourFour GreatGreat ClassicalClassical NovelsNovels

„ WidelyWidely acknowledgedacknowledged asas thethe greatestgreatest literaryliterary piecepiece everever writtenwritten inin thethe historyhistory ofof ChineseChinese literature.literature.

„ FirstFirst handhand--copiedcopied manuscriptmanuscript withwith 8080 chapterschapters beganbegan toto circulatecirculate inin 1759.1759.

„ PrintedPrinted versionversion beganbegan toto circulatecirculate inin 1791.1791. ItIt waswas putput togethertogether byby ChengCheng WeiyuanWeiyuan andand GaoGao EE (Cheng(Cheng--GaoGao version).version). ButBut itit hadhad 120120 chapters.chapters.

„ ChengCheng--GaoGao maintainedmaintained thatthat theythey obtainedobtained previouslypreviously unknownunknown manuscriptsmanuscripts ofof CaoCao fromfrom variousvarious sources.sources. AuthorshipAuthorship ControversyControversy andand RedologyRedology

„ ManyMany scholarsscholars werewere skepticalskeptical ofof thethe lastlast 4040 chapters,chapters, andand speculatedspeculated thatthat theythey werewere writtenwritten byby GaoGao E.E. Some,Some, suchsuch asas renownedrenowned scholarscholar HuHu Shi,Shi, hadhad calledcalled thesethese chapterschapters ““fraudfraud”” perpertratedperpertrated byby GaoGao..

„ SomeSome scholarsscholars believebelieve thethe lastlast 4040 chapterschapters areare inferiorinferior toto thethe firstfirst 8080 chapters,chapters, bothboth inin plotplot andand inin writing.writing.

„ SomeSome expertsexperts thoughtthought thethe fatesfates ofof severalseveral characterscharacters inin thethe endend werewere inconsistentinconsistent withwith whatwhat werewere foreshadowed.foreshadowed.

„ TheThe authorshipauthorship questionquestion waswas thethe mainmain focusfocus ofof ““RedologyRedology”” forfor aa longlong time.time. StylometryStylometry AnalysisAnalysis inin RedologyRedology

„ RedologyRedology reliedrelied almostalmost exclusivelyexclusively onon connoisseurship.connoisseurship. StylometryStylometry analysisanalysis hashas beenbeen veryvery rarerare ------inin factfact patheticallypathetically sparsesparse comparedcompared toto thethe sizesize ofof thethe RedologyRedology literature.literature.

„ TheThe fewfew existingexisting onesones areare aa mixedmixed bagbag inin termsterms ofof quality,quality, fromfrom reasonablereasonable toto veryvery flawed.flawed.

„ NoneNone usesuses modernmodern techniquestechniques suchsuch asas machinemachine learninglearning theory.theory. StylometryStylometry AnalysisAnalysis inin RedologyRedology InIn SupportSupport ofof TwoTwo--AuthorAuthor HypothesisHypothesis

„ CaoCao (1985):(1985): analyzedanalyzed thethe useuse ofof functionfunction characterscharacters inin thethe book,book, andand comparedcompared thethe firstfirst 4040 chapters,chapters, middlemiddle 4040 chapterschapters andand lastlast 4040 chapters.chapters.

„ ZhangZhang andand LiuLiu (1986):(1986): examinedexamined thethe useuse ofof characterscharacters outsideoutside thethe GB2312GB2312 system.system. ThereThere areare manymany moremore inin thethe firstfirst 8080 chapters.chapters.

„ YuYu (1998):(1998): focusedfocused onon thethe statisticsstatistics ofof 55 characterscharacters andand sentencessentences endedended inin aa particularparticular way.way.

„ LiLi (1987):(1987): statisticalstatistical analysisanalysis ofof 4747 functionfunction characters,characters, andand suggestedsuggested thatthat thethe lastlast 4040 chapterschapters mightmight bebe editededited byby GaoGao basedbased onon unfinishedunfinished manuscriptsmanuscripts ofof Cao.Cao. StylometryStylometry AnalysisAnalysis inin RedologyRedology InIn SupportSupport ofof OneOne--AuthorAuthor HypothesisHypothesis

„ Chan (1981): perhaps the best known and most extensive study, and Li & Li (2006).

„ Both studies broke the book down into three equal parts. A Frequency Vector for a selected group of characters was built for each part. The correlations of these Frequency Vectors were computed.

„ In Li & Li (2006), 47 characters were selected, and the pairwise correlation among the three Frequency Vectors. The authors concluded that they are all sufficiently correlated.

„ In Chan (1986), a fourth Frequency Vector from selected chapters of a different book was added. By showing that part III was more closely correlated to the first two parts than to the new book, the author drew the one-author conclusion. StylometryStylometry AnalysisAnalysis inin RedologyRedology

„ TheThe twotwo studiesstudies supportingsupporting thethe oneone--authorauthor hypothesishypothesis areare bothboth flawed.flawed. TheThe studystudy ofof ChanChan (1986)(1986) isis severelyseverely flawed,flawed, consideringconsidering thethe bookbook forfor comparisoncomparison waswas ofof aa differentdifferent genre.genre.

„ TheThe otherother studiesstudies areare moremore reareasonable,sonable, althoughalthough almostalmost allall ofof themthem lacklack mathematical/statisticalmathematical/statistical rigor.rigor.

„ WeWe wouldwould likelike toto developdevelop aa moremore rigorousrigorous mathematicalmathematical frameframe workwork forfor testingtesting twotwo--authorauthor hypothesishypothesis inin generalgeneral andand forfor analyzinganalyzing DreamDream ofof thethe RedRed ChamberChamber inin particular.particular. DetectingDetecting ChronologicalChronological DivideDivide

„ Some books are written (or suspected to be written) by two authors, with the first X chapters written by author A and last Y chapters written by author B. There is a shift in writing style in the middle somewhere.

„ The idea is to detect these chronological dividing points (“chrono-divide”)

„ We develop a simple mathematical framework for detecting chrono-divides. The method would not work if two authors write in an interwoven fashion.

„ We use as a case study.

14 XIANFENG HU, YANG WANG, AND QIANG WU

Interestingly, the fact that Chapter 67 appeared as an “outlier” in our classification serves as further evidence to the validity of our analysis. It was only after the tests we realized that the authorship of Chapter 67 itself is one of the controversies in Redology. Unlike the main controversy about the authorship of the first 80 chapters and the last 40 chapters, experts are less unified in their positions here. Again, our results strongly suggests that Chapter 67 is stylistically different from the rest of the first 80 chapters, and it may not be written by Cao. Our finding is consistent with the conclusion of [5].

3.2. Non-separability of the first 80 chapters. To further validate our method we apply the same tests to the first 80 chapters of Dream of the Red Chamber to see whether we can get a chrono-divide (Experiment 2). We use the first 30 and last 30 chapters as the training data and leave chapters 31-50 as the test data. Figure 2 shows the mean cross validation error and the values of SVM classifier on the test data chapters 31-50. The experiment shows many more features have been selected in the 100 repeats, implying the difficulty of find a consistent subset of discriminative features. The large errors on the training data also indicate the difficulty for separation. When the classifier is applied to the test data, there is clearly no chrono-divide. This suggests that our method yields a conclusion that is completely consistent with what is known.

0.4 4

0.35 3

0.3 2

0.25 1

0.2

0 0.15 SVM classification value cross validation error rate −1 0.1

−2 0.05

0 −3 0 50 100 150 200 30 35 40 45 50 number of feature chapter number

(a) (b)

Figure 2. Experiment 2: (a) Mean cross validation error rate; (b) Values of SVM classifier on chapters 31-50. Note there is no chrono-divide. 16 XIANFENG HU, YANG WANG, AND QIANG WU

0.25 5

4

0.2 3

2 0.15

1

0.1 0 SVM classification value cross validation error rate

−1 0.05

−2

0 −3 0 20 40 60 80 100 120 140 160 30 35 40 45 50 number of feature chapter number

(a) (b)

Figure 3. Experiment 3: (a) Mean cross validation error rate; (b) Values of SVM classifier on chapters 96-105, which correspond to the samples 31-50 in all 80 samples. Note two samples come from one chapter in this experiment.

4. Analysis of the other three Great Classical Novels

To further bolster the credibility of our approach we test our method on the other three Great Classical Novels in Chinese literature, Romance of the Three Kingdoms (nIüÂ), Water Margin (YéD), and Journey to the West (ÜiP). Unlike Dread of the Red Chamber, there is no authorship controversy for these other three novels. Thus if our method is indeed robust we should expect negative answers for the two-author hypotheses for all of them by finding no chrono-divides.

As with Dream of the Red Chamber, we split each of the three novels into training samples and test samples. Both Romance of the Three Kingdoms and Water Margin have 120 chapters. In both cases we designate the first 30 chapters and the last 30 chapters as the two classes of training data, and the middle 60 chapters as test data. For Journey to the West the two classes of training data are the first and last 25 chapters respectively, with the middle 50 chapters as test data.

We use the same procedure to test for chrono-divides on the three novels. Compared to Dream of the Red Chamber, the selected features show much lower relative frequencies, indicating difficulty in differentiating between the writing styles. Table 2 show the relative frequencies (with c = 1/30) of the top 8 features for each of the four Great Classical Novels. Also of note is that in the case of STYLOMETRY ANALYSIS OF DREAM OF THE RED CHAMBER 17

2 3 3

2.5 1.5 2

2 1 1 1.5 0.5 0 1 0 −1 0.5 −0.5 0 −2 −1 −0.5

−3 −1.5 −1

−2 −4 −1.5 30 40 50 60 70 80 90 30 40 50 60 70 80 90 20 30 40 50 60 70 80

(a) (b) (c)

Figure 4. Classification results from the test sampels of the other three classical novels: (a) Romance of the Three Kingdoms; (b) Water Margin; (c) Journey to the West.

Water Margin, 51 features are used to build a classifier from the 60 training data, which is clearly another strong indication of the difficulty.

Novel Relative frequencies of top 8 features Dream of the Red Chamber 0.57 0.46 0.43 0.36 0.31 0.30 0.29 0.19 Romance of the Three Kingdoms 0.31 0.27 0.26 0.25 0.23 0.22 0.17 0.15 Water Margin 0.18 0.17 0.16 0.16 0.14 0.11 0.11 0.10 Journey to the West 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.02

Table 2. Relative frequencies of the top ranked 8 features in each of the four Great Classical Novels.

Figure 4 plots the values from the classifiers for all three novels. In all cases the values fluctuate in such a way that it is quite clear that no chrono-divides exist, as expected.

This analysis shows that our approach can reliably reject the two-author hypothesis when it is false, lending further support to the effectiveness and robustness of our method.

5. Conclusions

Inspired by authorship controversy of Dream of the Red Chamber and the application of SVM in the study of literary stylometry, we have developed a mathematically rigorous new method for the analysis of authorship by testing for a chrono-divide in writing styles. We have shown that the method is highly effective and robust. Applying our method to the Cheng-Gao version of Dream of the Red Chamber has led to convincing if not irrefutable evidence that the first 80 chapters and the last 40 chapters of the book were written by two different authors. Furthermore, our analysis STYLOMETRY ANALYSIS OF DREAM OF THE RED CHAMBER 19

20

40

60

80

100

120

140

20 40 60 80 100 120 140

Figure 5. Distances between the first 80 chapters of the Cheng-Gao version, the last 40 chapters of the Cheng-Gao version, and 30 chapters of Continued Dream of the Red Chamber .

[12] A. Pawlowski. Wincenty lutoslawski-a forgotten father of stylometry. Glottometrics, 8:83–89, 2004. [13] J. Rudman. The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities, 31(4):351–365, 1997. [14] E. Stamatatos. A survey of modern authorship attribution methods. Journal of the American Society for infor- mation Science and Technology, 60(3):538–556, 2009.

Department of Mathematics, Michigan State University, East Lanisng, MI 48824, USA. E-mail address: [email protected]

Department of Mathematics, Michigan State University, East Lanisng, MI 48824, USA. E-mail address: [email protected]

Department of Mathematical Sciences, Middle Tennessee State University, Murfreesboro, TN 37132, USA. E-mail address: [email protected]