Data Mining the 1918 Influenza Pandemic: An Introduction to the Epidemiology of Information

Partner Institutions: Virginia Tech and the University of

Presentation for the Joint Conference on Digital Libraries Washington DC June 12, 2012

Tom Ewing ([email protected]) Virginia Tech

Kansas City Star, October 10,1918, p. 8 Kansas City Star, October 10,1918, p. 8 Accessed from Newsbank America’s Historical Newspapers Collection An Epidemiology of Information

Research Questions – Influenza and the War, Fall 1918 – Tracking the Flow of Information – Prevention, Treatment, and Cure Big Data

100,000 PLUS newspaper articles about the influenza pandemic in the United States and , 1917-1919

Data Sources Chronicling America Peel’s Prairie Provinces Readex Newsbank America’s Historical Newspapers * Proquest Historical Newspapers * Georgia / California Newspaper Projects Newspapers Not (Yet) Digitized (Microfilm) * Subscription only Articles (Key word: “influenza”) Database (Titles) 1917-1919 Just 1918 Chronicling America 12,365 6,389 Peel’s Prairie Provinces 2,147 1,212 Newsbank America’s Historical Newspapers 51,929 31,717 Proquest: New York Times 9,304 3,518 Washington Post 1,545 1,069 San Francisco Chronicle 1,366 914 Los Angeles Times 13,033 1,970 Chicago Tribune 3,430 1,455 Atlanta Constitution 1,772 931 Sun 3,586 1,639 Boston Globe 1,440 843 Georgia Newspaper Project 669 517 California Newspaper Project 203 123 Totals 102,789 52,174 Project Team Principal Investigators: – Tom Ewing, Department of History (VT) – Bernice Hausman, Department of English (VT) – Bruce Pencek, University Libraries (VT) – Naren Ramakrishnan, Dept of Computer Science (VT) – Gunther Eysenbach, Centre for Global eHealth Innovation at University Health Network (UT) Graduate Research Assistants – Samah Gad, Dept of Computer Science (VT) – Michelle Seref, Department of English (VT) – Laura West, Department of History (VT) Kansas City Star, October 10, 1918, p. 8 Washington Star, October 10, 1918, p. 1 Evening Times (Washington) October 10, 1918, p. 6 Evening Missourian, Oct 10, 1918, p. 1 Evening Public Ledger (Philadelphia), October 10, 1918, p. 1 Atlanta Constitution, October 10, 1918 Method: Manual Text Analysis • Step One: Identify Relevant Articles – Key word search within date parameters, or – Review every issue of newspaper on microfilm • Step Two: Read and analyze each article – Identify main themes – Use textual evidence to address research questions – Make comparisons across time / context • Step Three: Develop interpretation – Rhetoric: how does the language convey meaning? – History: how does the text represent experience?

The Digging into Data Challenge

How can big data analysis enhance the methods of manual text analysis, with the goals of 1) contributing to understanding of the 1918 influenza epidemic; and 2) contributing to developing new methods of data mining applicable to other pandemics? Results for “influenza” in Kansas City Star, September 15-November 15, 1918

70

60

50

40

30

20

10

0 Topic Clouds, for Washington Times Newspaper Tone Analysis of Advertisements with word “influenza,” Washington Times and Kansas City Star, October 1918 (n=104)

8 22

27 Alarmist 10 Warning

8 Encouraging

Patriotic / war

27 Explanatory / Information 21 Prevention / Preparation

Selling other products

9 Cure / Treatment

Recovery

62 Building a tone detection algorithm

Categories used to Describe Tone for Paragraphs before/after word “influenza”:  Alarmist/Fear  Warning/Caution  Encouraging/Exhorting  Patriotic/War  Explanatory/information  Prevention/Preparation  Humorous/Anecdotal  Cure/Treatment of Symptoms  Recovery  Incomprehensible Major Challenges

 Matching research questions to practices across disciplines (History, Rhetoric, Public Health, Computer Sciences, Information Sciences)  Inconsistency in data due to OCR procedures  Gaps in digitized newspapers across databases  Incorporating newspapers that are not digital  Defining a consistent methodology that yields meaningful results Next steps

 Develop methods to identify tone in articles about influenza from Chronicling America  Expand applications across newspaper databases  Refine methods for content analysis using data mining techniques  Apply methods to track discussion of disease in contemporary forms of media (Facebook, Twitter, Google) Data Mining the 1918 Influenza Pandemic: An Introduction to the Epidemiology of Information

Partner Institutions: Virginia Tech and the

Contact information: Tom Ewing [email protected]

Roanoke Times, October 27, 1918, p. 2