History Unclassified Peering down the Memory Hole: , Digitization, and the Fragility of Our Knowledge Base

GLENN D. TIFFERT Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019

THE DIGITAL DEEPFAKES,botfarms,andtrollfactoriesassaultingourpublicspherehave thrown an overdue light on the radical changes sweeping through our ecosystem of knowledge.1 New technologies and platforms are making the manipulation of our infor- mation space feasible at a scale and with an ease that would scarcely have been imagin- able a generation ago. In particular, the crude artisanal and industrial forms of publica- tion and censorship familiar to us from centuries past are yielding to an individuated, dynamic model of information control powered by adaptive algorithms that operate in ways even their creators struggle to understand.2 These algorithms curate every facet of our online lives, recursively intermediating our realities according to evolving internal logics that we cannot see. Lately they have even expanded into authorship by indepen- dently synthesizing for public consumption new content from archival collections.3 As their performance improves, “the idea that there’sonearticleforeveryoneisgoingto quickly change to the one article for me,” and the practice of history, to say nothing of other empirical disciplines, may never be the same.4 The Lieberthal-Rogel Center for Chinese Studies at the University of Michigan and the Hoover Institution generously supported this study. Mary Gallagher, Steven Abney, Fred Gibbs, and Kerby Shedden offered valuable feedback. Fu Liangyu helped to locate and acquire materials. Luo Fusheng, Margaret Orton, Ar- den Shapiro, Yan Wei, and Charlotte Yin provided essential research assistance. 1 Samantha Bradshaw and Philip N. Howard, “Challenging Truth and Trust: A Global Inventory of Or- ganized Social Media Manipulation,” Working Paper 2018.1, Project on Computational , Ox- ford Internet Institute, 2018, https://comprop.oii.ox.ac.uk/research/cybertroops2018/; Zeynep Tufekci, “The Road from Tahrir to Trump,” MIT Technology Review 121, no. 5 (2018): 10–17. 2 Robert Darnton, Censors at Work: How States Shaped Literature (New York, 2014); David Weinber- ger, “Our Machines Now Have Knowledge We’ll Never Understand,” Wired, April 18, 2017, https:// www.wired.com/story/our-machines-now-have-knowledge-well-never-understand/. 3 Chris Merriman, “BBC 4.1 Joins the AI Revolution with Two Nights of Neural Network Generated Clips,” The Inquirer,September4,2018,https://www.theinquirer.net/inquirer/news/3061268/bbc-41-joins-the-ai-revolu tion-with-two-nights-of-neural-network-generated-clips; “Xinhua Publishes 1st MGC Video on Two Sessions,” Xinhua News Agency, March 2, 2018, https://www.youtube.com/watch?v=IE8JzO7eyPQ; Ahmed Elgammal, Bingchen Liu, Mohamed Elhoseiny, and Marian Mazzone, “CAN: Creative Adversarial Networks, Generating ‘Art’ by Learning about Styles and Deviating from Style Norms,” arXiv,June21,2017,https://arxiv.org/abs/ 1706.07068 [cs.AI]; Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, “Language Models Are Unsupervised Multitask Learners,” OpenAI,February14,2019,https://d4mucfpksywv. cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. 4 Sean Gourley, founder of the machine intelligence company Primer, quoted in Kelsey Ables, “What Happens When China’s State-Run Media Embraces AI?,” Columbia Journalism Review, June 21, 2018, https://www.cjr.org/analysis/china-xinhua-news-ai.php. ©TheAuthor(s)2019.PublishedbyOxfordUniversityPressonbehalfoftheAmericanHistorical Association. All rights reserved. For permissions,[email protected].

550 Peering down the Memory Hole 551

We can peer into that future with the aid of a case study from the People’s Republic of China (PRC), where online platforms comparable to JSTOR are rewriting the histori- cal record by stealthily redacting their holdings. Using a combination of qualitative and computational methods, I analyze a sample of this censorship, reverse-engineer its logic, and consider where it may take us. My findings expose the leading edge of a coming storm, and because the practices I identify are easily emulated and refined, no corner of the knowledge economy beyond theirreach.Asdigitizationadvancesaround the globe, the opportunities and temptation to exploit the vulnerabilities described here Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 will multiply, with potentially devastatingconsequencesnotjustforthereliabilityof our source base and the knowledge we derive from it, but also for the civic life these public goods sustain. Acquainting ourselves with the threat is essential to preempting that outcome. Most analyses of the digital turn suffer from a common blind spot: they generally presume that the custodians of our digital collections are neutral third parties who have no reason to alter or allow others to alter the records in their care.5 That trust is unwise. In a growing number of countries, the digital domain resembles less a marketplace of ideas than an arena for combat, and the PRC, to take one example, welcomes this fight and weaponizes our credulity.6 True to its Leninist roots, the Chinese Communist Party (CCP) describes the online realm as a “battlefield” on which a tightly disciplined political struggle must be waged and won.7 Intent on seizing the initiative, it exploits the openness of democratic socie- ties to project its influence abroad, while vigilantly policing its own walled garden with 5 “Authenticity Task Force Report,” in The Long-Term Preservation of Authentic Electronic Records: Findings of the InterPARES Project (2002), http://www.interpares.org/book/interpares_book_d_part1.pdf, 21; Roy Rosenzweig, “Scarcity or Abundance? Preserving the Past in a Digital Era,” American Historical Review 108, no. 3 (June 2003): 735–762; Tim Hitchcock, “Confronting the Digital: Or How Academic History Writing Lost the Plot,” Cultural and Social History 10, no. 1 (2013): 9–23; Ludmilla Jordanova, “Historical Vision in a Digital Age,” Cultural and Social History 11, no. 3 (2014): 343–348; Lara Putnam, “The Transnational and the Text-Searchable: Digitized Sources and the Shadows They Cast,” American Historical Review 121, no. 2 (April 2016): 377–402; Abby Smith Rumsey, When We Are No More: How Digital Memory Is Shaping Our Future (New York, 2016); Trevor Owens, The Theory and Craft of Digi- tal Preservation (Baltimore, 2018). 6 Bradshaw and Howard, “Challenging Truth and Trust”; United States Department of Justice, Report of the Attorney General’s Cyber Digital Task Force, July 2, 2018, https://www.justice.gov/ag/page/file/ 1076696/download; Justin Clark, Robert Faris, Ryan Morrison-Westphal, Helmi Noman, Casey Tilton, and Jonathan Zittrain, “The Shifting Landscape of Global ,” Berkman Klein Center for Internet & Society Research Publication, June 2017, http://nrs.harvard.edu/urn-3:HUL.InstRe pos:33084425; Adrian Shahbaz, “Freedom on the Net: The Rise of Digital Authoritarianism,” Freedom House, October 2018, https://freedomhouse.org/report/freedom-net/freedom-net-2018, 29; Paul M. Barrett, Tara Wadhwa, and Dorothée Baumann-Pauly, Combating Russian Disinformation: The Case for Stepping Up the Fight Online, NYU Stern Center for Business and Human Rights, July 2018, https://issuu.com/ nyusterncenterforbusinessandhumanri/docs/nyu_stern_cbhr_combating_russian_di. 7 “Zhangwo yulun zhanchang zhudongquan: Xi Jinping yaoqiu sanweidu dazao meiti xin qijian” 掌握 舆论战场主动权: 习近平要求三维度打造媒体新旗舰 [Grasp the Initiative on the Public Opinion Bat- tlefield: Xi Jinping Demands the Three-Dimensional Forging of a New Flagship for Media], China Central Television (CCTV), February 18, 2017, http://news.cctv.com/2017/02/18/ARTIrsoCDdYTIbTLWNRW p2ii170218.shtml; “Zai quanguo xuanchuan sixiang gongzuo huiyi shang de jianghua” 在全国宣传思想 工作会议上的讲话 [Speech at the National Propaganda Thought Work Conference], in Zhonggong zhon- gyang wenxian yanjiushi 中共中央文献研究室 [CCP Central Committee Documents Research Office], ed., Xi Jinping guanyu quanmian shenhua gaige lunshu zhaibian 习近平关于全面深化改革论述摘编 [Extracts of Xi Jinping on Comprehensively Deepening Reform] (Beijing, 2014), 83; Chai Yifei 柴逸扉, “Xi Jinping de xinwen yulun guan” 习近平的新闻舆论观 [Xi Jinping’s Views on News and Public Opin- ion], Zhongguo gongchandang xinwen wang 中国共产党新闻网 [CCP News Network], February 25, 2016, http://theory.people.com.cn/n1/2016/0225/c40531-28148369.html.

AMERICAN HISTORICAL REVIEW APRIL 2019 552 Glenn D. Tiffert a pervasive system of networked authoritarianism that showcases how illiberal regimes the world over can turn the technologies of the information age to their advantage.8 History is a revealing case in point. Orwell famously observed, “Who controls the past controls the future; who controls the present controls the past.”9 Mindful of that, the CCP vigorously suppresses domestic “attempts to distort or smear socialism with Chinese characteristics, Party history, the history of the PRC, the history of the people’s armed forces, Party leaders, and acclaimed heroes and role models.”10 And lately it has begun to export this censorship regime beyond its borders, leveraging its economic and Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 technological resources to co-opt foreigners, sometimes without their knowledge or consent, in an audacious campaign to sanitize the historical record and globalize its own competing narratives.11 The CCP’stimingisimpeccable.Economicandtechnologicaldisruptionstoour ecosystem of knowledge are eroding our capacity to detect, much less combat, this in- formation war, and nowhere is that more apparent than in our libraries. Motivated by thrift and efficiency, many academic libraries are deaccessioning volumes and outsourc- ing growing parts of their collections to onlineplatforms,trustingtheseplatformsto supply full replacement value and to guarantee the integrity of their products. In 2014 alone, the University of California, Santa Cruz eliminated nearly 60 percent of the books from its Science and Engineering Library, and the University of California, Berkeley’sHaasSchoolofBusinesstransferredvirtuallyitsentireprintcollectionto storage.12 Nearly all of the more than 100,000 scholarly journals that Berkeley’s librar- ies receive each year arrive digitally, and have for more than a decade.13 Much can go wrong with this bargain, especially since many of the publishers and platforms that now aggregate and deliver our knowledge are market-driven ventures subject to commercial pressures.14 They may adhere to different values, priorities, and 8 Shahbaz, “Freedom on the Net,” 6–10; Rebecca MacKinnon, “China’s ‘Networked Authoritarian- ism,’” Journal of Democracy 22, no. 2 (2011): 32–46; Wen-Hsuan Tsai, “How ‘Networked Authoritarian- ism’ Was Operationalized in China: Methods and Procedures of Public Opinion Control,” Journal of Contemporary China 25, no. 101 (2016): 731–744; Paul Mozur, “With Cameras and A.I., China Closes Its Grip,” New York Times, July 8, 2018, A1; Jack Goldsmith and Stuart Russell, “Strengths Become Vul- nerabilities: How a Digital World Disadvantages the United States in Its International Relations,” Aegis Series Paper No. 1806, 2018, https://www.hoover.org/sites/default/files/research/docs/381100534- strengths-become-vulnerabilities.pdf; Chris C. Demchak and Yuval Shavitt, “China’s Maxim—Leave No Access Point Unexploited: The Hidden Story of China Telecom’s BGP Hijacking,” Military Cyber Affairs 3, no. 1 (2018): 1–9; Margaret E. Roberts, Censored: Distraction and Diversion inside China’sGreat Firewall (Princeton, N.J., 2018); Louisa Lim and Julia Bergin, “Inside China’s Audacious Global Propa- ganda Campaign,” The Guardian,December7,2018,https://www.theguardian.com/news/2018/dec/07/ china-plan-for-global-media-dominance-propaganda-xi-jinping. 9 , 1984 (1949; repr., New York, 1961), 248. 10 “Guanyu xin xingshi xia dangnei zhengzhi shenghuo de ruogan zhunze” 关于新形势下党内政治生 活的若干准则 [Certain Norms for Intra-Party Political Life under the New Circumstances], Xinhua she 新华社 [Xinhua News Agency], November 2, 2016, http://www.xinhuanet.com/politics/2016-11/02/c_ 1119838382_2.htm; “China Is Struggling to Keep Control over Its Version of the Past,” The Economist 421, no. 9013 (2016): 37–38; Yan Lianke, “On China’sState-SponsoredAmnesia,” New York Times, April 1, 2013, https://www.nytimes.com/2013/04/02/opinion/on-chinas-state-sponsored-amnesia.html. 11 Christopher Walker, “What Is ‘Sharp Power’?,” Journal of Democracy 29, no. 3 (2018): 9–23. 12 Teresa Watanabe, “Universities Redesign Libraries for the 21st Century: Fewer Books, More Space,” Los Angeles Times,April19,2017,http://www.latimes.com/local/lanow/la-me-college-libraries-20170419- story.html. 13 Matthew Quinlan, “Five Questions for UC Berkeley Librarian Jeffrey Mackie-Mason,” California Magazine,Summer2017,http://alumni.berkeley.edu/california-magazine/summer-2017-adaptation/five- questions-uc-berkeley-librarian-jeffrey-mackie-mason. 14 Maria Bustillos, “Erasing History,” Columbia Journalism Review 57, no. 1 (2018): 112–118.

AMERICAN HISTORICAL REVIEW APRIL 2019 Peering down the Memory Hole 553 standards of stewardship than traditional libraries, and they may be accountable to dif- ferent constituencies, such as shareholders.15 Powerful interest groups or the threat of litigation can influence their decisions. Bankruptcies, corporate restructurings, licensing disputes, and state action can snuff out their online collections without warning.16 And things can go spectacularly wrong when they confront the demands of a mercurial cen- sorship regime and the authoritarian government behind it.17 Not long ago, it might have seemed preposterous to suggest that some of our most respected academic publishers and technology firms would be complicit in state censor- Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 ship. But times have changed.18 In the summer of 2017, acting on a request from its Chi- nese importer, Cambridge University Press (CUP) quietly removed 315 articles and book reviews from the online edition of the respected British academic journal The China Quarterly, without consulting the journal’s editors or the affected authors. For subscribers in China, the items simply disappeared, though they remained accessible elsewhere. After exposure led to negative publicity, CUP ultimately reversed itself and rebuffed a concur- rent request to censor approximately 100 articles from the online edition of the Journal of Asian Studies, the flagship journal of the U.S.-based Association for Asian Studies. By contrast, Springer Nature, which bills itself as the largest academic publisher in the world, capitulated to Chinese requests, effectively arguing that its censorship of over 1,000 of its own publications was a cost of doing business. Enticed by the Chinese market, Google, Apple, and Facebook have proffered similar concessions.19 Traditional post-publication censorship is notoriously toilsome and inefficient. Tear- ing out passages or seizing and destroying entire volumes demands physical control of the relevant texts, and copies often slip through the net to bear witness. Digitization, however, mitigates these deficiencies. It encodes knowledge not in tangible objects dis- persed redundantly among libraries and collectors, but in effortlessly mutable bitstreams delivered from distant servers along a centralized distribution chain. As the CUP and Springer episodes illustrate, the providers who control these servers can silently alter our knowledge base at its source without ever leaving their back offices, making one al- teration after another, each with the potential to propagate instantaneously around the globe. They can apply these alterations as broadly or as narrowly as they wish, forking the sources under their control into myriad editions, each defined by the idiosyncratic circumstances of a given audience. They have proven by their example that it matters 15 Kate Klonick, “The New Governors: The People, Rules, and Processes Governing Online Speech,” Harvard Law Review 131 (2018): 1598–1670. 16 Bustillos, “Erasing History,” 117–118. 17 “At Beijing Book Fair, Publishers Admit to Self-Censorship to Keep Texts on Chinese Market,” South China Morning Post,August24,2017,https://www.scmp.com/news/china/policies-politics/article/ 2108095/beijing-book-fair-publishers-admit-self-censorship-keep; Jacqueline Williams, “A Book on Chi- nese Sway in Australia Hits a Nerve,” New York Times, November 20, 2017, A8. 18 Association for Asian Studies, “Update on Chinese Censorship of Academic Publications,” Novem- ber 7, 2017, http://www.asian-studies.org/asia-now/entryid/103/update-on-chinese-censorship-of-aca demic-publications; Elizabeth Redden, “An Unacceptable Breach of Trust,” Inside Higher Ed, October 3, 2018, https://www.insidehighered.com/news/2018/10/03/book-publishers-part-ways-springer-nature-over- concerns-about-censorship-china; Ben Bland, “China Censorship Drive Splits Leading Academic Publish- ers,” Financial Times, November 6, 2017, 4; Nicholas Loubere and Ivan Franceschini, “How the Chinese Censors Highlight Fundamental Flaws in Academic Publishing,” Chinoiresie, October 16, 2018, http:// www.chinoiresie.info/how-chinese-censors-highlight-fundamental-flaws-in-academic-publishing/. 19 Farhad Manjoo, “Apple’sSilenceinChinaSetsDangerousPrecedent,” New York Times,July31, 2017, B1; Paul Mozur, “China Exerts Digital Control beyond Its Borders,” New York Times,March2, 2018, A1.

AMERICAN HISTORICAL REVIEW APRIL 2019 554 Glenn D. Tiffert not when an item was originally published, since they can digitally modify or wipe it at any later point in time, leaving no trace of their handiwork behind. For censors, the possibilities are mouthwatering. Digital platforms offer them dy- namic, fine-grained mastery over memory and identity, and in the case of China, they are capitalizing on this to engineer a pliable version of the past that can be tuned algo- rithmically to always serve the CCP’s present. Dazzled by the abundance of sources on these platforms, we have failed to grasp these Potemkin-like possibilities, much less their historiographical implications. Let us attend to that now. Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019

POLITICAL-LEGAL RESEARCH 政法研究 and Law Science 法学 were the two dominant aca- demic law journals publishedinthePRCduringthe1950s.Theoriginalprinteditions of these publications document the construction of China’s post-1949 socialist legal sys- tem and the often savage debates that seized it. Few libraries outside the PRC possess complete original print runs of these journals, and with the advent of convenient digital editions, those that do have typically relegated the fragile paper volumes to off-site stor- age. For most users, online access is now the norm. Unfortunately, the online editions of Political-Legal Research and Law Science have been redacted in ways that materially distort the historical record but are invisible to the end user. The consequences are as unsettling as they are deliberate: the more faithful scholars are to this adulterated source base and the sanitized reality it projects, the more they may unwittingly promote the agendas of the censors. Consider the issues originally published in the PRC from 1956 through 1958, which chronicle how budding debates over matters such as judicial independence, the tran- scendence of law over politics and class, the presumption of innocence, and the herita- bility of law abruptly gave way to vituperative denunciations of those ideas and their sympathizers. Currently, only two online platforms—China National Knowledge Infra- structure 中国知网 and the National Social Sciences Database 国家哲学社会科学学 术期刊数据库—offer full-text coverage of these issues, and their holdings are identi- cal, down to their silent omission of exactly the same sixty-three articles, a coincidence that suggests a common blacklist. The temporal distribution of the omissionsisstriking.(SeeFigure1.)Theystart

100% 90% 80% 70% 60% 50% Pages 40% Uncensored 30% Censored 20% 10% 0% 1958.5 1957.2 1957.4 1957.6 1957.8 1956.2 1956.4 1956.6 1956.8 1956.9 1958.1 1958.2 1958.3 1958.4 1958.6 1958.7 1958.8 1958.9 1956.12 1957.10 1956.10 1958.10 1958.12 1957.12 Year.Month

FIGURE 1: Pages censored by publication date, Political-Legal Research and Law Science (1956–1958).

AMERICAN HISTORICAL REVIEW APRIL 2019 Peering down the Memory Hole 555 Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019

FIGURE 2: Articles missing from the online edition of the August 1957 issue of Law Science. abruptly in the summer of 1957, when a wave of political persecutions known as the Anti-Rightist Campaign began, then crest over the next few months before generally ta- pering off as the campaign wound down.20 For the three years in question, more than 8percentofthearticlesand11percentofthetotalpagecounthavebeenerasedfrom the online editions of these journals. Notably, the gaps are concentrated at the tops of their tables of contents, which means that the censors are today suppressing the articles these journals once proudly led with. For instance, the online edition of the October 1957 issue of Political-Legal Research omits seven out of eleven main articles, reduc- ing a fifty-seven-page issue to twenty-three pages. Likewise, the online edition of the August 1957 issue of Law Science omits the first nine articles, reducing a seventy-two- page issue to forty-two pages. The table of contents belonging to that issue appears here in Figure 2, scanned from an original paper edition. The articles missing online are marked with arrows, and their translated titles appear alongside. As one might expect, the search engines on both platforms are blind to the missing con- tent, returning only sanitized results and leaving the end user none the wiser. Similarly, the online tables of contents for affected issues display unbroken lists of articles with no place- 20 The small spike in October 1958 corresponds to the conclusion of the Fourth National Judicial Work Conference, which formalized the of many of the cadres who had led the PRC judicial system dur- ing the preceding decade, and laid down a corrective, strongly leftist ideological line.

AMERICAN HISTORICAL REVIEW APRIL 2019 556 Glenn D. Tiffert Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019

FIGURE 3: Cosine similarity of combined corpora with (un)censored facets. holders or notations for the omissions. On one site, the only hint would be unexplained gaps in the page number sequence of the articles, a detail that is easy to overlook or miss the significance of. The other site omits even this cluebyforgoingpagenumbersaltogether. There is little doubt that the omissions are content-driven. Computational tools illus- trate this clearly. Figure 3 plots the articles from both journals. The spatial arrangement of the markers denotes the relative proximity of the texts to one another based on their discursive (cosine) similarity. Each dot represents a document present in the online corpora, and each triangle represents a document missing from them. If the criteria for the omission of an article were essentially random, one would expect to see both markers distributed similarly across the document space. If, on the other hand, the crite- ria were deterministic, one would expect instead to see structure in the data, manifested as clustering or separation between the markers. The actual results are unmistakable. The uncensored facet is evenly distributed, while the censored facet is not. This finding strongly suggests that the omission of texts is not random, but rather involves a discrimi- nating logic, though we must go deeper to determine what that is. Ihavereadall737articlesinthesecorporaclosely,andIhavedevotedyearsof study to the domain they describe. But I am also mindful of my human subjectivity, es-

AMERICAN HISTORICAL REVIEW APRIL 2019 Peering down the Memory Hole 557

Political-Legal Research 012345678910

Rightist element 7.52, p=.006 Wang Han 6.84, p=.009 Rightist 6.09, p=.014 Wang Jixin 5.33, p=.021 Lu Mingjian 5.24, p=.022

Campaign to Eliminate Counterrevolutionaries 5.23, p=.022 Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 To weigh the evidence 4.79, p=.029 Deduce 4.58, p=.032 Qian Duansheng 4.27, p=.039 Jia Qian 4.19, p=.041 Wu Chuanyi 3.95, p=.047

Law Science 05101520253035404550 Yang Zhaolong Wang Zaoshi 10.11, p=.001 45.53, p=1.50E-11 Rule of law 6.35, p=.012 Kuomintang 5.84, p=.016 Campaign to Eliminate Counterrevolutionaries 5.62, p=.018 Rightist element 5.05, p=.025 Roscoe Pound 4.95, p=.026 News 4.90, p=.027 Schools & Departments 4.84, p=.028 Scientific 4.82, p=.028 Law 4.76, p=.029 Rule of man 4.73, p=.030 Luo Jiaheng 4.64, p=.031 China Democratic League 4.48, p=.034 Rightist 4.48, p=.034 Legislation 4.34, p=.037 Fascist 4.04, p=.045 (Nationalist) Six Codes 3.87, p=.049

FIGURE 4: χ2 feature selection (χ2 score, p-value, α=.05). pecially my susceptibility to ordinary cognitive bias, and the possibility that salient details and relationships among their nearly four million characters of text might elude me. A procedure called χ2 (chi-squared) feature selection mitigates these limitations by measuring the strength of the dependence between the appearance of a term in a text and the membership of that text in the censored class. The higher the score, the stronger the correlation. The results are highly informative and allow me, broadly speaking, to reverse-engineer the logic behind the censorship. Figure 4 lists the terms (features) most closely correlated with the articles censored from the online editions of each journal, the relative strength of those correlations, and their respective degrees of statistical significance. For the uninitiated, these feature lists read like keyword tags from the ferocious ideological debates that seized the Chinese le- gal system in the Anti-Rightist period. Notably, every name that appears on them, save one, identifies a prominent figure who was flagrantly persecuted as a personification of heterodoxy and an example to others. The sole exception is Roscoe Pound, the former

AMERICAN HISTORICAL REVIEW APRIL 2019 558 Glenn D. Tiffert Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019

FIGURE 5: Yang Zhaolong and Roscoe Pound (1947). Historical & Special Collections, Harvard Law School Library. dean of Harvard Law School, who appears on the Law Science list by virtue of his past association with the top feature, Yang Zhaolong.21 (See Figure 5.) 21 When Pound served as a legal advisor to the Nationalist Ministry of Justice from 1946 to 1948, Yang was frequently his interpreter and translator, and the two collaborated closely on field surveys of the Chinese judiciary and on plans for the reform of Chinese legal education. Ai Yongming 艾永明 and Lu Jinbi 陆锦璧, eds., Yang Zhaolong faxue wenji 杨兆龙法学文集 [Collected Legal Writings of Yang Zhao- long] (Beijing, 2005), 467–558.

AMERICAN HISTORICAL REVIEW APRIL 2019 Peering down the Memory Hole 559

0.3

0.25

0.2

0.15 Uncensored tf-idf tf-idf score Censored

0.1 Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019

0.05

0 1957.3 1957.6 1958.3 1956.1 1956.2 1956.5 1957.1 1957.2 1956.3 1956.4 1956.6 1957.4 1957.5 1958.1 1958.2 1958.4 1958.5 1958.6 Year.Issue

FIGURE 6: “Rightist element,” Political-Legal Research (1956–1958).

To get a taste of how censors are purposefully rewriting the history of this period, let us briefly examine the term most closely correlated with censorship in each journal. Figures 6 and 7 plot the weight of these terms over time, arranged in the sequential or- der of the articles in which they appear, and color-coded according to the censorship sta- tus of those articles.22 This casts a spotlight on exactly which arguments the censors are selecting for and against, and the discursive effect that sorting has. The term most highly correlated with censorship in Political-Legal Research is “rightist element.” As one might expect, anyone labeled a “rightist element” was singled out for persecution during the Anti-Rightist Campaign. It turns out that censors have cut the weight of “rightist element” in my three-year sample of this journal by 41 percent, which at the very least warps our sense of how the term was actually used, who it de- scribed, how usage may have changed over time, and why. The warping impact of the censors is still more pronounced for the top term from Law Science:YangZhaolong.Yang(1904–1979) was one of the most internationally respected Chinese jurists of his generation. A graduate of Harvard Law School (SJD ’35), he held a variety of high academic and governmental positions in the Nationalist era (1927–1949), including chief of the Ministry of Justice’s Criminal Division, where he directed Chinese participation in the Tokyo war crimes trials. In 1949, underground CCP operatives persuaded him to remain on the mainland to serve the incoming com- munist regime, though his background soon foreclosed that possibility. During a brief political thaw from 1956 to early 1957, Yang was invited to join Fudan University’s law faculty and the editorial board of Law Science. In those months, he contributed articles to the journal that cogently refuted CCP orthodoxy on the class nature and heritability of law, and on cause and effect in criminal law.23 He also drew 22 Weight refers to a term’s tf-idf score, a standard statistical measurement of the importance of a term in a given document and corpus. 23 Yang Zhaolong 杨兆龙, “Falü de jiejixing he jichengxing” 法律的阶级性和继承性 [On the Class Nature and Heritability of Law], Huadong zhengfa xuebao 华东政法学报 [East China Journal of Politics and Law] 3 (1956): 26–34; Yang Zhaolong 杨兆龙, “Xingfa kexue zhong yinguo guanxi de jige wenti” 刑法科学中因果的几个问题 [Several Problems in the Relationship between Cause and Effect in the Sci- ence of Criminal Law], Faxue 法学 [Law Science] 1 (1957): 61–63.

AMERICAN HISTORICAL REVIEW APRIL 2019 560 Glenn D. Tiffert

0.7

0.6

0.5

0.4

0.3 Uncensored tf-idf score Censored 0.2 Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019

0.1

0 1957.5 1957.6 1956.1 1956.2 1956.3 1957.2 1957.4 1958.1 1958.3 1958.4 1958.8 1957.1 1957.3 1958.2 1958.5 1958.6 1958.7 1958.9 Year.Issue

FIGURE 7: “Yang Zhaolong,” Law Science (1956–1958).

attention in public forums to the slow pace of codification in the PRC, the low quality of CCP legal personnel, and official discrimination against highly trained non-party experts like himself.24 For his prestige and frankness, Yang paid a heavy price. When the Anti-Rightist Campaign struck Shanghai, a wave of calumny enveloped him. The censors have slashed Yang’sfootprintinmysampleofthisjournalby83per- cent, mostly by excising his critiques of the legal system’s practical defects and the sear- ing rebuttals he endured. (See Figure 6.) In fact, those rebuttals account for the largest cohort of articles censored from Law Science,whichiswhyhisscoreontheχ2 feature selection test eclipses all other terms in my corpora by a wide margin. Evidently, the censors wish us to come away with the breathtaking misconception that Yang had little that was controversial to say.25 But back in the real world, his ideas made him the top target of the Anti-Rightist Campaign in Shanghai’s legal community, and ultimately led to twelve years of imprisonment as a rightist and counterrevolutionary. Like Yang, the other individuals who appear on these feature lists promoted values 24 “Peiyang xinsheng liliang, hai you bushao wenti” 培养新生力量, 还有不少问题 [Training Up a New Force, There Are Still Many Problems], Xinmin wanbao 新民晚报 [New People’sEveningPost], May 4, 1957, 1–2; “Shanghai zhishijie tan guanche baijia zhengming wenti” 上海知识界谈贯彻百家争鸣问题 [Shanghai’s Intellectual Community Discusses the Problem of Implementing “Let One Hundred Flowers Bloom, Let One Hundred Schools Contend”], Guangming ribao 光明日报 [Enlightenment Daily], May 1, 1957, 2; “Sifa gong- zuo ‘qiang’ gao ‘gou’ shen, Minmeng shiwei zhaokai sifa zuotanhui pangtingji” 司法工作墙高沟深, 民盟市 委召开司法座谈会旁听记 [The “Walls” Are High and the “Chasm” Deep in Judicial Work: Notes from a Fo- rum on the Administration of Justice Convened by the Municipal Committee of the China Democratic League], Xinmin bao 新民报 [New People’sEveningPost], May 19, 1957, 1; Yang Zhaolong 杨兆龙, “Falüjie dang yu feidang zhi jian” 法律界党与非党之间 [The Split between Party and Non-Party in the Legal Community], Wenhui bao 文汇报 [Wenhui Daily], May 8, 1957, 2; Yang Zhaolong 杨兆龙, “Woguo zhongyao fadian heyi chichi hai bu banbu?” 我国重要法典何以迟迟还不颁布? [Why after So Long Have China’s Key Legal Codes Not Been Promulgated?], Xinwen ribao 新闻日报 [Daily News], May 9, 1957, 2–3; Yang Zhaolong, “Wo tan jidian yijian” 我谈几点意见 [I Discuss Several Points], Xinwen ribao 新闻日报 [Daily News], June 6, 1957, 3. 25 The record indicates otherwise. Mei Naihan 梅耐寒, “Guanyu ‘fa de jiejixing he jichengxing’ de tao- lun: Jieshao Shanghai faxuehui dierci xueshu zuotanhui” 关于‘法的阶级性和继承性’的讨论: 介绍上海 法学会第二次学术座谈会 [The Discussion on “The Class Nature and Heritability of Law”: Introducing the Shanghai Law Society’sSecondAcademicForum],Faxue 法学 [Law Science]2(1957):28–30; Zhang Jinghua 张景华, “Guanyu falü jichengxing zhong de jige wenti” 关于法律继承性中的几个问题 [Several Questions Concerning the Heritability of Law], Faxue 法学 [Law Science] 5 (1957): 18–21.

AMERICAN HISTORICAL REVIEW APRIL 2019 Peering down the Memory Hole 561

People’s Judicature (1957–1958) 402 180

Censored Uncensored Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 Teaching and Research (1956–1958) 100 444

0% 20% 40% 60% 80% 100% Articles

FIGURE 8: Coverage of two other major journals (article count). associated with the rule of law and greater separation between party and state. Today, the record of their arguments and the persecutions they endured sticks like a thorn in the side of a regime that has since not only written the rule of law into its constitution, but also turned history on its head by presenting the current slogan “Socialist Rule of Law with Chinese Characteristics” as the culmination of a proud originalist vision.26 It falls to the censors to discreetly resolve such conflicts between past and present, and they have been busy. In fact, with the CCP intent on publicizing its “China solution” 中国方案 as a proven alternative to liberal democracy, no topic or historical period is safe from their touch. Other publications, such as People’s Judicature 人民司法(工作), the official organ of the courts, and Teaching and Research 教学与研究, a leading social science journal, are missing not just discrete articles but also entire issues. (See Figure 8.) Additionally, President Xi Jinping’s 2001 doctoral dissertation has vanished from relevant databases, as has recent scholarship on the systems of secret informants that currently per- meate Chinese schools and workplaces. Censorship even encumbers the online archives of the CCP’s official newspaper, the People’sDaily.Searchingforsensitivetermsthathaveappearedintheprinteditionof this newspaper can sever a user’s connection and lock out access for several minutes at atime.Trickiertospot,identicalqueriescan produce different results, depending on whether the vendors supplying access to the archive host their servers in China or out- side of it. Similarly, many Chinese state archives are transitioning to digital document delivery, which allows them to screen requests granularly using nothing more than metadata and a patron’sprofile. They can monitor the files a patron receives and the specific pages a patron takes an interest in. Archival documents formerly accessible in paper format are frequently absent from these digital facsimiles. The custodians of such digital collections, and the Western publishers who make common cause with them, are plainly not neutral third parties, which serves as a cau- tionary tale for us all. By stealthily omitting certain topics, voices, and opinions, they are concealing basic facts and distorting what the discourse on a given subject actually 26 Glenn Tiffert, “Socialist Rule of Law with Chinese Characteristics: A New Genealogy,” in Fu Hua- ling, John Gillespie, Pip Nicholson, and William Edmund Partlett, eds., Socialist Law in Socialist East Asia (New York, 2018), 72–96.

AMERICAN HISTORICAL REVIEW APRIL 2019 562 Glenn D. Tiffert looked like, where the weight of opinion on it may have been, and how that might have changed over time. They not only are complicit in the intentional misrepresentation of history, but are also contaminating research based on their holdings and violating the trust of their users. By tendentiously distorting consciousness of China’s past, they are prejudicing its possible futures. Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 WE MUST NOT SUPPOSE that these dangers are peculiar to China.27 Rather, they are emblem- atic of our deepening digital dependence and the redistribution of power it entails. As knowledge transcends traditional fixed media, it is slipping from our grasp and ever more under the control of those able to enclose, harness, and commodify it anew. In myriad ways, this puts us at the mercy of their better angels—reliant on their stewardship, and more vulnerable than ever to the political, regulatory, commercial, and licensing terms that may impinge upon it. Seduced by the digital dream, we hasten our submission when we purge from our shelves the physical evidence necessary to independently monitor the performance of these new providers and hold them to account. Unexpectedly, intellectual property law intensifies our disadvantage. Political-Legal Research and Law Science will be under copyright in the United States until ninety-five years after their original date of publication, or the early 2050s at the earliest, which pre- cludes republishing the censored content without the consent of its Chinese rights hold- ers.28 This means that simply by digitally consolidating sources on servers under its control, a savvy government or other interested party can adulterate the historical record with unprecedented ease, not just at home but also universally, the better to achieve in- formation dominance and shape the global public opinion battlefield. Alternatively, by flexing its market power or by arranging for proxies to take ownership stakes in content providers, it could secure similar results. Either way, the conditions for agitation, propa- ganda, and disinformation to flourish could scarcely be more favorable. We are tempt- ing fate by passively waiting for those audacious or secure enough to exploit them. The KGB had a name for such deeds: active measures.29 27 Matthew Connelly, “State Secrecy, Archival Negligence, and the End of History as We Know It,” Knight First Amendment Institute at Columbia University, September 2018, https://knightcolumbia.org/ content/state-secrecy-archival-negligence-and-end-history-we-know-it; National Archives and Records Ad- ministration, 2018–2022 Strategic Plan,February2018,https://www.archives.gov/files/about/plans- reports/strategic-plan/2018/strategic-plan-2018-2022.pdf; Jennifer Schuessler, “Obama’sLibrarylessLi- brary,” New York Times, February 21, 2019, C1; Bob Clark, “In Defense of Presidential Libraries: Why the Failure to Build an Obama Library Is Bad for Democracy,” The Public Historian 40, no. 2 (2018): 96–103; Meredith R. Evans, “Presidential Libraries Going Digital,” ibid., 116–121. 28 17 U.S.C. §104A, §108(h). However, there can be a limited right for libraries, archives, and muse- ums to reproduce a work during the last twenty years of its copyright term. Tyler T. Ochoa, “Copyright Protection for Works of Foreign Origin,” in Jan Klabbers and Mortimer Sellers, eds., The Internationaliza- tion of Law and Legal Education (London, 2008), 167–190; Elizabeth Townsend Gard, “Creating a Last Twenty (L20) Collection: Implementing Section 108(h) in Libraries, Archives and Museums,” SSRN, Oc- tober 2, 2017, revised December 3, 2017, https://dx.doi.org/10.2139/ssrn.3049158. 29 T. S. Allen and A. J. Moore, “Victory without Casualties: Russia’s Information Operations,” Parame- ters 48, no. 1 (2018): 59–71; Thomas Boghardt, “Operation INFEKTION: Soviet Bloc Intelligence and Its AIDS Disinformation Campaign,” Studies in Intelligence 53, no. 4 (2009): 1–24; David King, The Commissar Vanishes: The FalsificationofPhotographsandArtinStalin’sRussia,newed.(London, 2014); United States Department of State, Active Measures: A Report on the Substance and Process of Anti-U.S. Disinformation and Propaganda Campaigns,DepartmentofStatePublication9630,August 1986, chap. 5.

AMERICAN HISTORICAL REVIEW APRIL 2019 Peering down the Memory Hole 563

I RECKON THAT THE ARTICLES excised from the online editions of Political-Legal Research and Law Science were personally selected by specialists schooled in the pertinent his- tory and its bearing on current controversies. Their choices evince discernment and care. Nevertheless, technology may soon relieve them of this burden. The computa- tional techniques I have employed to analyze these corpora are double-edged weapons; they can be used to automate and enhance the work of the censors, too. In anticipation of this development and asaproofofconcept,Ibuiltapredictive text-classification model that uses machine learning to analyze and censor my corpora. Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 In mere minutes, my model can independently reproduce the choices made by the actual human censors with an average accuracy of 95 percent. It can also process colossal vol- umes of text far more easily than unassisted human censors possibly could.30 With ac- cess to more training data, such as the contents of an online article platform comprising perhaps billions of characters, I could undoubtedly improve its accuracy, and the impli- cations are staggering. My results demonstrate that effective automated manipulation of the past lies within our grasp. By freely modulating the nearly 600,000 parameters in my model, a censor could concoct bespoke versions of the historical record on demand, each exquisitely tuned to the shifting ideological or political requirements of the present, much like a re- cording engineer amplifies, attenuates, adds, or removes sound by manipulating the con- trols on a mixing console to achieve the perfect mix. One could furthermore devolve this task to an artificial intelligence able to roam the breadth of our archives, endlessly reconstructing them according to preprogrammed templates that can adapt in real time to the prevailing winds and learn from the behaviors of countless users recorded every minute of every day. Human reviewers need only weigh in at the margins. This technology swaddles us now. In China, it powers the most sophisticated regime of online and censorship on the planet.31 In the United States, it helps social media firms assemble our newsfeeds and suppresscomputationalagitprop,copyright infringements, and odious content.32 In Europe, it facilitates compliance with the “right to erasure” of personal data codified in the 2016 General Data Protection Regulation (GDPR).33 Curating history is merely an additional use case in a bulging portfolio of applications that has firms around the world scrambling for competitive advantage and market dominance.34 Of course, every advance they make also accrues to the censors. It 30 For a discussion of tools and methods, see the appendix. 31 Zhongguo xintongyuan 中国信通院 [China Academy of Information and Communications Technolo- gy], “Rengong zhineng anquan baipishu” 人工智能安全白皮书 [White Paper on Artificial Intelligence Security], September 2018, http://www.caict.ac.cn/kxyj/qwfb/bps/201809/P020180918473525332978.pdf. 32 John Herrman, “Online Platforms Annexed Much of Our Public Sphere, Playacting as Little Democ- racies—Until Extremists Made Them Reveal Their True Nature,” New York Times,August21,2017, MM18; Mark Zuckerberg, “A Blueprint for Content Governance and Enforcement,” Facebook, November 15, 2018, https://www.facebook.com/notes/mark-zuckerberg/a-blueprint-for-content-governance-and-en forcement/10156443129621634/. 33 “Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation),” Official Journal of the European Union 59, no. L119 (2016): 1–87, here 43–44. 34 Shoshana Zuboff, “The Secrets of Surveillance Capitalism.” Frankfurter Allgemeine, March 5, 2016, http://www.faz.net/aktuell/feuilleton/debatten/the-digital-debate/shoshana-zuboff-secrets-of-surveillance-ca pitalism-14103616.html; Ryan Gallagher, “Google CEO Tells Senators That Censored Chinese Search En- gine Could Provide ‘Broad Benefits,’” The Intercept,October12,2018,https://theintercept.com/2018/10/ 12/google-search-engine-china-censorship/.

AMERICAN HISTORICAL REVIEW APRIL 2019 564 Glenn D. Tiffert is a very short hop indeed from the technologies that compose our digital lives to the nightmare of Orwell’smemoryhole,whererealityiscontinuously reinvented by the powerful at will. Left unchecked, this next-generation paradigm of information control will spread and may well drag the discipline of history into the quicksand of post-truth polemics, where the heroic efforts of individual scholars cannot save us. It will triumph merely by sowing doubt. After all, who can say for sure that a body of sources has not been com- promised? And if scholars are generally conscious of tampering with the digital record Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 as an environmental factor conditioning research, they may not be able to discern the specific distortions it introduces into their work. Generational change and disruptions to our knowledge ecosystem are eroding our mastery of the analog backstops. Identifying and debunking discrete instances of these known unknowns is not a scal- able solution. It would require compiling parallel trusted baseline corpora for compari- son, and advanced training in data science and diplomatics (the discipline devoted to the critical analysis of historical documents). It would necessitate ongoing vigilance as well, because digital censorship can be a moving target. Chasing that target would di- vert scarce resources from research and writing, set scholars against one another, sub- vert the truth value of their claims, and erode the public’sestimationoftheircompe- tence—all to the advantage of those in the shadows. More importantly, we have no reliable mechanism for ensuring that the effort expended in achieving small victories would translate into anything larger. If the platforms make no amends, then others would still wander innocently into the same old traps. This is the predicament Chinese studies confronts today, and it will come to other fields tomorrow. Historians cannot overcome this singly, and obscurity is no defense. Instead, we should learn from the lost opportunities to forestall the crises now afflicting social media and election security. If we are to prevent thepracticesdescribedherefromproliferat- ing, then we must mobilize to confront them structurally. Otherwise, growing unease about the integrity of the knowledge we consume and produce will metastasize and fur- ther sap the trust necessary for robust scholarship and democratic practice. An institutional subscription to an online knowledge platform can cost tens of thou- sands of dollars or more annually, and subscribers, as consumers, must insist that they receive what they are paying for. Demanding that providers make unredacted collec- tions available on alternate servers beyond the reach of interested censors is an impor- tant first step. Knowledge creators, archivists, learned societies, rights holders, and con- tent providers must also design and implement a set of industry-wide best practices to uphold the integrity of our digital collections, transparently disclose omissions and modifications, and defend against tampering at the levels of the individual character, document, and corpus. Such standards must apply not only to the digitization of legacy analog sources (which are, after all, not eternal), but also to those “born” digital, and it is imperative that commercial providers in particular adopt them. In short, we have passed the point where we can naïvely trust; if we truly value the integrity of the sources on which we depend, and in turn our professional credibility, then we must now also verify. A variety of solutions, such as digital signatures, block- chain certification, and ISO 16363 certification, with logos signifying validated stan- dards compliance, are potentially available to meet those needs. We must engineer such technical safeguards into the foundations of our burgeoning digital knowledge infra-

AMERICAN HISTORICAL REVIEW APRIL 2019 Peering down the Memory Hole 565 structure. However, technical solutions by themselves are not enough. We must also back them up with mutually reinforcing collective statements of principle, supple- mented as appropriate by fiduciary obligations stipulated in private contract and public law.35 The menace is real and already among us. Never before has knowledge been prone to such sweeping and supple manipulation. Our understanding of ourselves, and our future, hang in the balance.

Appendix: Tools and Methods Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 To the best of my knowledge, no libraries in the United States possess complete original print editions of Political-Legal Research, People’sJudicature,orTeaching and Re- search, and fewer than a handful possess complete original editions of Law Science. To assemble the corpora analyzed in this article, I therefore drew on fragmentary holdings from multiple institutions, supplemented by my personal reference collection and acquisitions from China. To ensure commensurability and keep the scope of the project manageable, I de- cided to concentrate my analysis on articles published in Political-Legal Research and Law Science from 1956 through the end of 1958, the only period when the two journals overlapped. Fortunately, this permits one to juxtapose two historically significant moments: the Hundred Flowers Campaign and the Anti-Rightist backlash that abruptly followed, when the CCP’s brief solicitation of popular feedback switched to searing ret- ribution. Political-Legal Research published sixty-two issues between 1954 and 1966, when the Cultural Revolution forced its closure. Sponsored by the Chinese Association for Politics and Law in Beijing, it counted many of the highest legal officials in the central government among its patrons, and its coverage generally favored their statist priorities. By contrast, Law Science enjoyed a much shorter life, publishing just eighteen issues, all between 1956 and 1958, when the Anti-Rightist Campaign forced its closure.36 Law Science was sponsored by the East China Institute of Politics and Law in Shanghai, one of a handful of regional academies established in the early 1950s by the PRC state to train a new generation of socialist cadres for administrative and legal positions. Preparing these journals for analysis required several labor-intensive steps, the most basic of which involved converting their printed pages into text files with a high degree of accuracy.37 To maximize fidelity, I used the best-preserved original print editions I could find, not reprints or reproductions. Every page of every issue in my three-year sample, more than 2,000 in total, was scanned at 600 dpi in grayscale without any lossy compression, resulting in nearly 20 gigabytes of data. Second, I used a commercial opti- cal character recognition (OCR) package to convert those scans into plain text. I then sliced the output into individual files, one for each article. My final corpora consisted of 356 files from Political-Legal Research and 381 files from Law Science,comprising 35 An example of such a statement is Association of University Presses, “Facing Censorship: A State- ment of Guiding Principles,” March 21, 2018, http://www.aupresses.org/news-a-publications/news/1692- facing-censorship-a-statement-of-guiding-principles. 36 The first three issues, dating from 1956, bear the title Huadong zhengfa xuebao 华东政法学报 [East China Journal of Politics and Law]. 37 My workflow relied on the following principal software: MacOS (10.13.6), ABBYY FineReader (12.1.11), BBEdit (12.5), Anaconda3 (3.7), Scikit-learn (0.19.1), Keras (2.22), Java 8 (131), Stanford Chi- nese Segmenter (3.7) with the ctb standard, Mallet (2.08), and Microsoft Excel (16.16.3).

AMERICAN HISTORICAL REVIEW APRIL 2019 566 Glenn D. Tiffert nearly four million characters in total.38 The two PRC platforms hosting these journals are currently censoring approximately 8 percent of the articles, or 11 percent of the total page count in my sample, though the latter share exceeds 50 percent for certain issues straddling the 1957/58 divide. Third, university-educated native speakers of Chinese compared my Law Science corpus against the original documents, character by character, to establish the reliability of the OCR. The test set consisted of 127 pages (approximately 207,000 characters) from the original issues, and averaged a remarkable 99 percent accuracy, easily suffi- Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 cient for my purposes. Time and funding precluded an equally exhaustive test of my Political-Legal Research corpus, but spot checks suggested similar accuracy. Fourth, I wrote several programs in Python, the first of which stripped my corpora of semantically expendable characters, such as punctuation, alphanumerics, Cyrillic, gremlins, and whitespace. This reduced each file to an unbroken stream of Chinese characters, which was then fed to a segmentation algorithm that tokenized the texts. Un- like Western languages, Chinese does not separate semantic units with whitespace. The segmentation algorithm performs this step, which the natural language processing (NLP) routines in conventional computational text analysis require.39 It bears mention- ing that my corpora were originally published just as the PRC was moving from tradi- tional Chinese characters to the simplified set used today, and the articles in them conse- quently contain a transitional mixture of both. This confused the segmentation algorithm, and necessitated the preliminary measure of converting both corpora uni- formly to simplified characters. Fifth, the segmentation algorithm performed poorly for named entities, particularly personal names, which can be idiosyncratic. For example, surnames were commonly dissociated from given names. I therefore manually compiled a dictionary of several hundred entries, including every author as well as prominent organizations and individu- als mentioned in my texts. A Python program then scanned the texts for occurrences of these names and reconstituted them correctly. Sixth, I compiled a series of metadata files necessary for the analytics I intended to perform. These files are essentially spreadsheets (.csv) with columns for the filenames, article titles, author names, index number (year-issue#-article#), and censorship status of every article in various slices of my corpora. I built metadata files for the Political- Legal Research corpus, the Law Science corpus, a combined corpus, and yet another 38 One could arrive at a slightly different count, depending on how one divides sidebars and forums with contributions from multiple authors, but the key point is that every character was captured. 39 My methodology and workflow were informed by various texts in computational linguistics, natural language processing, and digital history, including Charu C. Aggarwal and ChengXiang Zhai, eds., Min- ing Text Data (New York, 2012); Steven Bird, Ewan Klein, and Edward Loper, Natural Language Pro- cessing with Python (Cambridge, Mass., 2009); Michael C. Hout, Megan H. Papesh, and Stephen D. Goldinger, “Multidimensional Scaling,” Wiley Interdisciplinary Review of Cognitive Science 4, no. 1 (2013): 93–103; David Mimno, “Computational Historiography: Data Mining in a Century of Classics Journals,” ACM Journal on Computing in Cultural Heritage 5, no. 1 (2012): 3:1–3:19; Jason D. M. Ren- nie, Lawrence Shih, Jaime Teevan, and David R. Karger, “Tackling the Poor Assumptions of Naive Bayes Text Classifiers,” in Tom Fawcett, ed., Proceedings of the Twentieth International Conference on Ma- chine Learning (Menlo Park, Calif., 2003), 616–623; David Underhill, Luke K. McDowell, David J. Mar- chette, and Jeffrey L. Solka, “Enhancing Text Analysis via Dimensionality Reduction,” in Weide Chang and James B. D. Joshi, eds., Proceedings of the IEEE International Conference on Information Reuse and Integration (Piscataway, N.J., 2007), 348–353. Portions of my code adapted examples shared by Paul Vierthaler and countless contributors to Stack Overflow.

AMERICAN HISTORICAL REVIEW APRIL 2019 Peering down the Memory Hole 567 combined corpus that substituted the city of publication (Beijing or Shanghai, respec- tively) for the censorship field, which allowed me to study the legal discourse of this pe- riod across not just time, but also space. Finally, after approximately five months of preparation, the corpora were ready for data analysis. To perform that analysis, I wrote another Python program that stripped out stop- words from the preprocessed corpora and transformed them into both word2vec models and tf-idf matrices against which I could run a battery of exploratory statistical tests and classification algorithms, only a few of which appear in this article.40 Briefly, the word2- Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 vec models capture the contextual relationships among all of the terms in my corpora and facilitate semantic analysis of the underlying texts. The tf-idf matrices describe each of the 737 documents in my corpora in nearly 600,000 dimensions, where each di- mension measures the importance of a unique unigram or bigram (e.g., “judge,”“Yang Zhaolong,”“rule of law”). Next, I experimented with three approaches to building predictive models that could faithfully reproduce the choices made by the human censors. First, I evaluated several types of neural networks using my word2vec models, but these achieved lackluster results, perhaps because neural networks perform best with much larger datasets.41 Sec- ond, using my tf-idf matrices, I evaluated fourteen different classification algorithms and selected the most promising among them for optimization through grid search cross-validation, which iterated over thousands of possible configurations and reported performance for each on several metrics.42 Candidates from the gradient boosting family of classifiers generally achieved the highest scores. Third, I evaluated various ensemble stacking classifiers using the same grid search technique. Ensemble stacking classifiers run competing classification models in parallel, pool the results, and feed those up to a meta-classifier for final evaluation, on the theory that the whole may do better than the sum of its parts. This approach achieved the best performance for my dataset.43 Feature engineering played an important part in optimizing the performance of my prediction models. Recall that just over 8 percent of the documents in my corpora were censored, which means that nearly 92 percent were not. I used SMOTE synthetic sam- pling on the training data for my models to compensate for this imbalance, but not on the validation or testing data used to evaluate them. Likewise, my final stacking classi- 40 The normalized tf-idf values were computed using the TfidfVectorizer class in Scikit-learn 0.19.1, 1 nd according to the following function: tfidf tf log þ 1 .Theexploratorydataanalysis t;d t;d 1 df d;t ð Þ ¼ ð Þ Â þ ðÞþ techniques included t-SNE analysis (TruncatedSVD), principal component analysis, Euclidean distance calculations (MDS), scree plots, cosine similarity calculations (MDS), silhouette coefficient calculations, spherical k-means clustering (MDS), and hierarchical clustering analysis (Ward dendrogram). I also per- formed χ2 feature selection and topic modeling (Mallet, LDA) on my corpora. 41 The candidates included recurrent neural networks with(out) LSTM and a convolutional neural net- work using the Tensorflow backend to Keras. 42 The classification algorithms evaluated were dummy, Gaussian naïve bayes, Bernoulli naïve bayes, multinomial naïve bayes, k-nearest neighbors, logistic regression, random forest, linear support vector ma- chine, non-linear support vector machine, decision trees, gradient boosting, light gradient boosting, xtreme gradient boosting, and a multi-layer perceptron. I evaluated them on their mean accuracy, F1 score, Mat- thews correlation coefficient, sensitivity rate, and specificity rate. 43 The winning model was an optimized two-level ensemble stacking classifier comprising a gradient boosting classifier and light gradient boosting machine at level one, and a Bernoulli naïve bayes meta- classifier at level two. After ten stratified k-fold cross-validations, the model achieved a mean accuracy of 0.95, a mean F1 score of 0.97, a mean Matthews correlation coefficient of 0.69, a mean sensitivity rate of 0.97, and a mean specificity rate of 0.73.

AMERICAN HISTORICAL REVIEW APRIL 2019 568 Glenn D. Tiffert

fier used χ2 feature selection to identify the features most highly correlated with censor- ship, and passed only the feature importances calculated from those by each of the level-one classifiers to the level-two meta-classifier. I validated all of my models using stratified k-fold cross-validation and ranked them on the basis of their Matthews correlation coefficients, a metric suited to imbalanced classes. Stratified k-fold cross-validation divides the corpus into k subsets (called “folds”), each with the same distribution of censored and uncensored documents as the corpus at large. It trains the model on k-1 folds, and then evaluates the model’sperfor- Downloaded from https://academic.oup.com/ahr/article-abstract/124/2/550/5426383 by 183000 user on 16 April 2019 mance against the sole remaining fold, which the model has never seen before. It repeats this process k-1 times, excluding a different fold from the model-building each time. Fi- nally, it averages the metrics returned by each iteration. There is an art to predictive model-building, and no doubt the potential to extract still higher performance remains, but my ultimate goal in this study lay elsewhere: to raise consciousness of an emergent new paradigm of information control that is begin- ning to encroach on the integrity of the historical record and will soon proceed apace. Against the background of a political climate that is testing the vigor of liberalism, this development, coupled with the snowballing aggregation of our global knowledge base onto platforms beyond our control, promises to be game-changing.

Glenn D. Tiffert is a Visiting Fellow at the Hoover Institution, and a historian of modern China. His research has centered on Chinese legal history, including publi- cations on constitutionalism, the construction of a modern judiciary, and the gene- alogy of the rule of law in the PRC. His current book manuscript, provisionally entitled “Judging Revolution,” radically reinterprets the Mao era and the 1949 rev- olution by way of a deep archival dive into the origins of the PRC judicial system. His current research probes the intersections between information technology and authoritarianism, and the ramifications of China’s rise for American interests.

AMERICAN HISTORICAL REVIEW APRIL 2019