Data Mining the 1918 Influenza Pandemic: an Introduction to the Epidemiology of Information
Total Page:16
File Type:pdf, Size:1020Kb
Data Mining the 1918 Influenza Pandemic: An Introduction to the Epidemiology of Information Partner Institutions: Virginia Tech and the University of Toronto Presentation for the Joint Conference on Digital Libraries Washington DC June 12, 2012 Tom Ewing ([email protected]) Virginia Tech Kansas City Star, October 10,1918, p. 8 Kansas City Star, October 10,1918, p. 8 Accessed from Newsbank America’s Historical Newspapers Collection An Epidemiology of Information Research Questions – Influenza and the War, Fall 1918 – Tracking the Flow of Information – Prevention, Treatment, and Cure Big Data 100,000 PLUS newspaper articles about the influenza pandemic in the United States and Canada, 1917-1919 Data Sources Chronicling America Peel’s Prairie Provinces Readex Newsbank America’s Historical Newspapers * Proquest Historical Newspapers * Georgia / California Newspaper Projects Newspapers Not (Yet) Digitized (Microfilm) * Subscription only Articles (Key word: “influenza”) Database (Titles) 1917-1919 Just 1918 Chronicling America 12,365 6,389 Peel’s Prairie Provinces 2,147 1,212 Newsbank America’s Historical Newspapers 51,929 31,717 Proquest: New York Times 9,304 3,518 Washington Post 1,545 1,069 San Francisco Chronicle 1,366 914 Los Angeles Times 13,033 1,970 Chicago Tribune 3,430 1,455 Atlanta Constitution 1,772 931 Baltimore Sun 3,586 1,639 Boston Globe 1,440 843 Georgia Newspaper Project 669 517 California Newspaper Project 203 123 Totals 102,789 52,174 Project Team Principal Investigators: – Tom Ewing, Department of History (VT) – Bernice Hausman, Department of English (VT) – Bruce Pencek, University Libraries (VT) – Naren Ramakrishnan, Dept of Computer Science (VT) – Gunther Eysenbach, Centre for Global eHealth Innovation at University Health Network (UT) Graduate Research Assistants – Samah Gad, Dept of Computer Science (VT) – Michelle Seref, Department of English (VT) – Laura West, Department of History (VT) Kansas City Star, October 10, 1918, p. 8 Washington Star, October 10, 1918, p. 1 Evening Times (Washington) October 10, 1918, p. 6 Evening Missourian, Oct 10, 1918, p. 1 Evening Public Ledger (Philadelphia), October 10, 1918, p. 1 Atlanta Constitution, October 10, 1918 Method: Manual Text Analysis • Step One: Identify Relevant Articles – Key word search within date parameters, or – Review every issue of newspaper on microfilm • Step Two: Read and analyze each article – Identify main themes – Use textual evidence to address research questions – Make comparisons across time / context • Step Three: Develop interpretation – Rhetoric: how does the language convey meaning? – History: how does the text represent experience? The Digging into Data Challenge How can big data analysis enhance the methods of manual text analysis, with the goals of 1) contributing to understanding of the 1918 influenza epidemic; and 2) contributing to developing new methods of data mining applicable to other pandemics? Results for “influenza” in Kansas City Star, September 15-November 15, 1918 70 60 50 40 30 20 10 0 Topic Clouds, for Washington Times Newspaper Tone Analysis of Advertisements with word “influenza,” Washington Times and Kansas City Star, October 1918 (n=104) 8 22 27 Alarmist 10 Warning 8 Encouraging Patriotic / war 27 Explanatory / Information 21 Prevention / Preparation Selling other products 9 Cure / Treatment Recovery 62 Building a tone detection algorithm Categories used to Describe Tone for Paragraphs before/after word “influenza”: Alarmist/Fear Warning/Caution Encouraging/Exhorting Patriotic/War Explanatory/information Prevention/Preparation Humorous/Anecdotal Cure/Treatment of Symptoms Recovery Incomprehensible Major Challenges Matching research questions to practices across disciplines (History, Rhetoric, Public Health, Computer Sciences, Information Sciences) Inconsistency in data due to OCR procedures Gaps in digitized newspapers across databases Incorporating newspapers that are not digital Defining a consistent methodology that yields meaningful results Next steps Develop methods to identify tone in articles about influenza from Chronicling America Expand applications across newspaper databases Refine methods for content analysis using data mining techniques Apply methods to track discussion of disease in contemporary forms of media (Facebook, Twitter, Google) Data Mining the 1918 Influenza Pandemic: An Introduction to the Epidemiology of Information Partner Institutions: Virginia Tech and the University of Toronto Contact information: Tom Ewing [email protected] Roanoke Times, October 27, 1918, p. 2 .