Introduction to HathiTrust https://www.hathitrust.org
Overview
HathiTrust is a large-scale collaborative repository of digital content from over 100 research libraries, the Google Books project, and the Internet Archive.
Hathi, pronounced "hah-tee," is the Hindi word for "elephant," an animal famed for its long-term memory. (source: Wikipedia)
The collection contains over 17 million digitized items, over 6 million of which are available as full-text PDF downloads. There are many items relevant to Hawai‘i.
HathiTrust also provides a number of discovery and access services, notably, full-text search across the entire repository. These tools are not covered in this handout.
Make sure to Login as a UH Affiliate
Search Full Text or the Catalog Entry
Limit Search to Full View
Advance Search in Full-Text
Download PDFs
Use Collections
If you are using Zotero: if you are looking at a particular record or PDF, the icon should look like a book and you can download that particular reference. if you are looking at a list of references, the icon will look like a folder and you can download multiple references at once.
Fall, 2019 1
Finding Items in HathiTrust
Key options when searching include: full text or the catalog (the bibliographic record) full text will give you more results, but make sure you use terms in use when the book was published, i.e. Diamond Head used to be called Diamond Hill the text has created Optical Character Recognition and may not be "clean" text
full view only or limited (search only) Full view allows direct access to the document as either PDF or text searching limited access texts still allows you to determine if an item is relevant access would have to be found another way (such as finding the physical book) limited access can also be analyzed using HathiTrust's data tools.
You can view: the full text (if available) the catalog record
Filters
Along the left column, you can limit results by:
Subject Author Language Place of Publication Date Original Format Original Location (of the item)
Search Tips
Try one of the advanced searches so you can specify your search criteria Also try using phrases by putting quotation marks around the words, such as "Diamond Head"
Adding to a Collection (see page four for more on collections):
To add a book to one of your collections click on the "Select Collection" to choose your target selection click on the check box to the left of the entry (or check the "Select all on page" box) click on the "Add Selected" button in the top right
Fall, 2019 2
Working with PDFs in HathiTrust
Downloading PDFs
To download a PDF from HathiTrust, you must be logged in through the University of Hawai‘i
To download an item (book, article, etc.):
Find the item Click on what portion of the book you want to download: o this page o a selection of pages o the entire book Hathi will then construct your PDF document and download it to your downloads folder
Fixing PDFs
After you have downloaded the PDF, you may want to delete some of the pages, such as those at the front or back of a book or blank pages in the middle.
Many programs and websites are available to edit PDFs, such as deleting pages, adding pages together, etc.
Adobe Acrobat (full license only - Acrobat Reader will not work) Preview (Macintosh only)
You can also search online for an PDF editor website.
These sites will typically allow you to upload a PDF, perform basic editing on the document, and then allow you to download the new PDF.
Working with PDFs
Once downloaded, the PDF is a file on your computer that you can open with other programs.
Your PDF reader (Preview, Acrobat Reader) will allow you to search the text, copy text (which may require further cleaning).
Fall, 2019 3
Working with Collections
Collections are groups of texts stored on the HathiTrust that you can refer to, share with others, and use when other people have shared theirs.
Adding an item to one of your collections is described at the bottom of page 1 above.
Collections are accessed via the "Collections" and "My Collections" tabs
Create a new collection
Search for collections
Access a collection by clicking on the collection's title
Limit the list to your own collections
Collections form the basis for the textual analysis tools provided by HathiTrust.
Options for Your Collections
You can make your collection public or private.
You can copy the URL to your public collections and share them with others.
Fall, 2019 4
Hathi Analytics
The textual analytic tools connected to Hathi are available at https://analytics.hathitrust.org
You must sign up for an account on the analytics website to use the tools. There are very precise requirements for usernames and passwords.
Creating and Validating a Workset
Worksets define the set of texts to be analyzed by a given tool. Typically these worksets begin as a collection on the HathiTrust digital library. To begin creating a workset, click on worksets, and then click on the “create a workset” tab.
There are two ways to create a workset. If your workset is a public collection on the HathiTrust Digital Library, you can use the sharable collection link to import the collection as a workset. If your workset is a private collection, you can download the collection’s metadata as a .TSV (Tab-Delimited Text) file and upload this file to create a workset.
Validating a workset is not necessary to use Hathi’s analytical tools, however it can be worthwhile, especially when working with older worksets. When worksets are validated, Hathi checks to make sure that all the items contained within the workset are still held within the HathiTrust repository. To validate a workset, simply click on the “validate a workset” tab, and select the workset you would like to validate.
Algorithms
The HathiTrust Research Center has developed several general purpose algorithms to analyze worksets. InPhO Topic Model Explorer o Creates a visualization of computationally derived “topics” within a workset Named Entity Recognizer o Creates a list of all persons, places, organizations, and dates within the workset, along with the item and page number in which these named entities appear. Token Count and Tag Cloud Creator o Measures word frequency throughout a workset, and creates a word cloud from the most frequently used words.
Fall, 2019 5
In addition to these algorithms, there is also an “extracted features download helper” tool which can be used to download an extracted features dataset to be used with other user created algorithms.
Data Capsule
Data Capsules are controlled virtual machines, which allow users to create and run custom algorithms in against volumes in the HathiTrust in a secure setting. By default, data capsules only have access to out of copyright materials, however users can request indirect access to copyrighted materials, up to the entire HathiTrust corpora.
Data Sets
Data sets include page level extracted features of all 14.7 million volumes of the HathiTrust, and extracted features of all volumes of English language literature published between 1700 and 1920. The HathiTrust has only just begun directly providing datasets, and the list of datasets available will likely grow.
Hathi+Bookworm
The Bookworm tool, found under the explore tab, allows users to create visualizations of word use over time as shown in the HathiTrust repository. Users can see frequency within the entire repository or within selected sub-classes of deposited items (e.g. law, medicine, literature).
Books about Music are more likely to be Groovy
Fall, 2019 6