Toward New Epistemologies?
Total Page:16
File Type:pdf, Size:1020Kb
HATHITRUST! A Shared Digital Repository! Humanisc Inquiry with Large Corpora of Digized Text and Metadata: Toward New Epistemologies? Sayan Baacharyya Jeremy York, Assistant Director, HathiTrust January 9, 2015 Unless otherwise noted, these slides and their contents are licensed under a Creave Commons A'ribuDon Unported License. Outline • Part 1 – Brief history of digital texts – Overview of HathiTrust / Origins of HTRC • Part 2 – HathiTrust Research Center Demo – HathiTrust Research Center IniDaves • Part 3 – Discussion A Brief History of Digital Texts Digital text resources • Project Gutenberg (U of Illinois) – 1971 • Thesaurus Linguae Graecae (UC Irvine)– 1972 • Oxford Text Archive (U of Oxford) – 1976 • ARTFL Project (U of Chicago)– 1982 • Perseus Digital Library (Tu\s)– 1985 • Text Encoding IniDave – 1987 • Women Writers Project (Brown U) – 1988 Digital Imaging • Yale Open Book Project (1991) • Cornell Demonstraon Project (1993) • Library of Congress Naonal Digital Library; American Memory (1994) • Making of America (University of Michigan and Cornell) (1995) 9 January 2015 5 HathiTrust HathiTrust Members Allegheny College Tufts University University of Connecticut Arizona State University University of Notre Dame University of Delaware Baylor University University of Oklahoma University of Houston Boston College Library of Congress University of Illinois, Chicago Boston University Massachusetts Institute of University of Illinois, Urbana Brandeis University Technology Champaign Brown University McGill University University of Iowa California Digital Library Michigan State University University of Kansas Carnegie Mellon University Montana State University University of Maine Case Western Reserve University Mount Holyoke College University of Maryland Colby College New York Public Library University of Massachusetts, Columbia University New York University Amherst Cornell University North Carolina Central University University of Miami Dartmouth College North Carolina State University University of Michigan Duke University Northeastern University University of Minnesota Emory University Northwestern University University of Missouri Florida State University System Ohio State University University of Nebraska-Lincoln Georgetown University Oklahoma State University University of New Mexico Getty Research Institute Pennsylvania State University University of North Carolina, Harvard University Princeton University Chapel Hill Indiana University Purdue University University of Pennsylvania Iowa State University Rutgers University University of Pittsburgh Johns Hopkins University Stanford University University of Queensland Kansas State University University of California, University of Tennessee, Knoxville Lafayette College Berkeley University of Texas System Universidad Complutense de Davis University of Utah Madrid Irvine University of Vermont University of Alabama Los Angeles University of Virginia University of Alberta Merced University of Washington University of Arizona Riverside University of Wisconsin-Madison University of British Columbia San Diego Utah State University University of Calgary San Francisco Vanderbilt University Syracuse University Santa Barbara Virginia Tech Temple University Santa Cruz Wake Forest University Texas A&M University University of Chicago Washington University, St. Louis Texas Tech University Yale University The Name • The meaning behind the name – Hathi (hah-tee)--Hindi for elephant – Big, strong – Never forgets, wise – Secure – Trustworthy Growth of the CollecDon 14,000,000 13,022,118 12,000,000 10,599,355 10,878,121 9,966,572 10,000,000 7,836,698 8,000,000 6,000,000 5,221,092 4,000,000 2,477,871 2,000,000 0 2008 2009 2010 2011 2012 2013 2014 9 Total Numbers • 13 million total volumes • 6.6 million book Dtles • 340,000 serial Dtles • 4.8 million volumes in the public domain (~37%) • 576,000+ US government publicaons 9 January 2015 10 Content Providers University of Virginia Purdue University Northwestern Duke University University Yale University Universidad University of Chicago University of North University of Complutense Massachuse's Columbia University Carolina at Chapel Hill University of Keio Ge'y Research Boston College Minnesota University Ohio InsDtute Princeton University Library of State University of Florida New York Public Congress Library University of Penn State ConnecDcut Cornell University University of Illinois Indiana University University of Michigan University of Wisconsin University of California Harvard University 9 January 2015 11 Language DistribuDon (1) The top 10 languages make up Remaining Lan, 1% ~87% of all content Languages, 13% Arabic, 2% h'p://www.hathitrust.org/ Italian, 3% visualizaons_languages Japanese, 3% English, 49% Russian, 4% Chinese, 4% German, 9% Spanish, 5% French, 7% Language DistribuDon (2) The next 40 languages Romanian, 1% Telugu, MulDple-languages, 1% Slovenian, 1% Finnish, 1% Armenian, 1% Malayalam, 1% 1% make up Panjabi, 1% Slovak, Greek,- Malay, 1% Turkish,- Yiddish, 1% ~12% of Ancient- Marathi, 1% 1% O'oman, 1% (to-1453), 1% Catalan, Nepali, 0% total 1% Serbian, Portuguese, 7% 1% Polish, 7% Greek,- Ukrainian, 1% Bulgarian, 1% Modern- Dutch, 5% (1453--), 2% Vietnamese, 1% Sanskrit, 2% Hebrew, 5% Norwegian, 2% Hindi, 5% Bengali, 2% Hungarian, 2% Tamil, 2% Persian, 2% Indonesian, 4% Korean, 4% Croaan, 3% Thai, 3% Swedish, 4% Czech, 3% Danish, 3% Urdu, 3% Turkish, 3% Dates 0-1500, 0.04% 1500-1599, 0.07% 1600-1699, 0.01% 1800-1849 2000-2009 1850-1899 1700-1799, 0.01% 3% 10% 10% 1900-1909 1990-1999 4% 14% 1910-1919 4% 1920-1929 4% 1980-1989 14% 1930-1939 4% 1960-1969 1970-1979 1940-1949 11% 13% 4% 1950-1959 6% Content DistribuDon Government Documents 4% In Copyright or “Public Domain” Public Domain (US) Undetermined 37% Public Domain 19% 14% 63% Open Access 0.1% Creave Commons 0.1% Core FuncDonality • Preservaon fucDons – validaon, error- checking • Discovery • Reading/Access • Collecons • APIs • Data Mining/Analysis 9 January 2015 16 HATHITRUST.ORG Full text and catalog search 12 November 2014 18 Read on-screen or download 12 November 2014 19 Build and share a collecDon 12 November 2014 20 APIs • Bibliographic API – Volume and rights informaon – MARC records – h'p://www.hathitrust.org/bib_api • OAI – h'p://www.hathitrust.org/data • “Hathifiles” – h'p://www.hathitrust.org/hathifiles • Data API – Volume and rights informaon – Page images – OCR – h'p://www.hathitrust.org/data_api Link takes you to HathiTrust Records loaded into DPLA, local library catalogs, and commercial databases 12 November 2014 22 Data Mining/Analysis • Dataset DistribuDon – h'p://www.hathitrust.org/datasets • Research Center 12 November 2014 23 Examples of uses • Oxford English DicDonary research @bgzimmer Ben Zimmer 7/4/11 @armavirumque Problem is "cut the mustard" (OED 1891) predates "muster." Earliest I've seen for "muster" is 1912.h'p://bit.ly/kOy3aD • Thesis research • Islamic Manuscripts • Local/Family History Projects (1) • Burton, Vernon. “The South as ‘Other,’ the Southerner as ‘Stranger.’” – Explore how atudes expressed in print about slavery, southerners, and non-southerners have changed over both Dme and space. • Ted Underwood, Associate Professor of English at the University of Illinois, Urbana-Champaign. – Using public domain texts received from HathiTrust to explore changing relaonships in literary genres from 1700-1899. • Andrew Piper, Associate professor of German literature at McGill University. – Analyzing linguisDc paers in German texts from 1700-1900 Projects (2) • Amanda Watson, librarian at New York University. – Studying How poetry anthologies in selected texts reflect the rise and fall of poets’ reputaons over the course of the 19th century. • Glenn Worthey, Digital HumaniDes Librarian at Stanford University Libraries. – Performing spao-temporal invesDgaon into the history of Brazilian Portuguese, to be accomplished by text-mining methods (n-gram analysis, etc.). • Mahew Wilkens, Assistant professor of English, University of Notre Dame. – American Council of Learned SocieDes (ACLS) fellowship for project “Literary Geography at Scale.” How to find out more • About: h'p://www.hathitrust.org/about • Resources: h'p://www.hathitrust.org/resources • Twier: h'p://twi'er.com/hathitrust • Facebook: h'p://www.facebook.com/hathitrust • Monthly newsle'er: – h'p:www.hathitrust.org/updates – RSS h'p://www.hathitrust.org/updates_rss • Contact us: [email protected] • Blogs: h'p://www.hathitrust.org/blogs – Large-scale Search – Perspecves from HathiTrust Introduc*on to the HathiTrust Research Center (HTRC) Sayan Bhaacharyya With grateful acknowledgements to: Harrie' Green, Erica Parker, Lore'a Auvil, Boris Capitanu, Ted Underwood, Peter Organisciak, and other memBers of the Illinois and Indiana HTRC teams Workshop Outline • Overview of the HathiTrust and the HathiTrust Research Center • How to use the HTRC Portal – Create/manage your own custom set of HathiTrust materials • How to use HTRC Portal Workset Builder – Textual analy<cs on a workset: • How to use HTRC Portal Algorithms, HTRC Bookworm, Data Capsule • Opportunies to connect you and your research with the HathiTrust Research Center Content DistriBuIon HathiTrust Collecon Builder HTRC Portal www.hathitrust.org/htrc Log in to the HTRC Portal, h'ps://htrc2.pI.indiana.edu Create a login id (i.e. username) How to create a workset Log In Again to Workset Builder Workset Builder (currently works only on non-copyrighted material not digiIzed By Google) Why worksets? • Two reasons: – Organizaonal: