Google Ngram Viewer by Andrew Weiss

Google Ngram Viewer By Andrew Weiss Digital Services Librarian California State University, Northridge Introduction: Google, mass digitization and the emerging science of culturonomics Google Books famously launched in 2004 with great fanfare and controversy. Two stated original goals for this ambitious digitization project were to replace the generic library card catalog with their new “virtual” one and to scan all the books that have ever been published, which by a Google employee’s reckoning is approximately 129 million books. (Google 2014) (Jackson 2010) Now ten years in, the Google Books mass-digitization project as of May 2014 has claimed to have scanned over 30 million books, a little less than 24% of their overt goal. Considering the enormity of their vision for creating this new type of massive digital library their progress has been astounding. However, controversies have also dominated the headlines. At the time of the Google announcement in 2004 Jean Noel Jeanneney, head of Bibliothèque nationale de France at the time, famously called out Google’s strategy as excessively Anglo-American and dangerously corporate-centric. (Jeanneney 2007) Entrusting a corporation to the digitization and preservation of a diverse world-wide corpus of books, Jeanneney argues, is a dangerous game, especially when ascribing a bottom-line economic equivalent figure to cultural works of inestimable personal, national and international value. Over the ten years since his objections were raised, many of his comments remain prescient and relevant to the problems and issues the project currently faces. Google Books still faces controversies related to copyright, as seen in the current appeal of the Author’s Guild v. Google Books decision; poor metadata, as pointed out in several studies including Nunberg (2009a) (2009b), James and Weiss (2012), and others; poor scanning as described by James in 2010; the destabilization of current markets, as pointed out by the Author’s guild (2013) and others such as Jaron Lanier (2013); and, finally, the general problem of representative diversity in their digitized corpus, as mentioned by Weiss and James (2013). Despite these major obstacles and problems, the Google Books project has continued without cessation even as other well-funded projects such as the Microsoft-Yahoo book digitization project, for example, have disappeared. (Arrington 2008) Google Books’ longevity is partly a result of the passion of Google’s leaders as well as the deep resources that spring from them. The project has even lasted within Google. This is no small feat as many well-known and beloved projects and services have been discontinued by the company over the seventeen years it has existed. The shortlist of discontinued services includes iGoogle, Google Answers, Google Wave, and Google Reader. But Google Books has staying power. A major reason for this longevity and potential long-term influence may have less to do with individual access to particular titles within its collections and more to do with the overall datasets – especially metadata and indexes – generated by the project. These openly available data sets are being used by researchers for a new kind of research named culturonomics. This chapter will explain how libraries and librarians can make use of the research procedures established by culturonomics and the tool developed for it, the Google Ngram Viewer, to aid their services to patrons. Culturonomics – a brief description Jean-Baptiste Michel and colleagues have been mining the publicly available data sets from Google Books to “track the use of words over time, compare related words and even graph them.” (2011). Their research involves examining the frequency with which words appear over time. The researchers examined 5,195,769 books, roughly 4 percent of all printed books, and measured the frequency of appearance of terms across variable times. (Ibid.) Some caveats with this method exist, however. As a result of the small sample, variations and fluctuations in the overall corpus may exist. Furthermore, it is often unclear -- and Google hasn’t fully disclosed this -- just how many texts are available online. Although 30 million is offered as a figure by Google, this is not verifiable. Additionally, the Google Books corpus does not include newsprint, maps, or other non-monographic materials, which account for a sizable part of library collections. Nevertheless, the print monograph record can surely be used as an aggregate mirror to peek into the culture as a whole, but one must not forget that it is still one slice of a larger mass, some of which—such as physical experience, unrecorded experience, spoken word, so-called deviant and/or censored material, illegal “black market” material—may never be captured to a satisfactory degree in digital formats. Additionally, works subjected to optical character recognition (OCR) software can be rendered illegible if the fonts are not standard (a true problem for scripts of non-Roman letter such as Japanese) or if other conditions such as faded ink, poor paper quality, and marks or blemishes, reduce the digital image quality. Despite some of these limitations, the result of Michel and colleagues’ study has been increased interest in digital humanities and “culturomics,” which is defined by them researches as an approach that relies on “massive, multi-institutional and multi-national consortia[,] . novel technologies enabling the assembly of vast datasets . and . the deployment of sophisticated computational and quantitative methods.” (Ibid.) The Google Books N-Gram Viewer Developed out of the Michel et al. experiments, the Google Ngram Viewer represents a new phase in the use of print materials and demonstrates the power of the added value that a digitized corpus can provide to researchers. Users are especially benefited by its search and data-mining capabilities. Historians, writers, artists, social scientists, and librarians will find the tool useful. The Google Books Ngram viewer corpus was created by aggregating 500 billion words from various monograph/book materials (excluding serials) found in the Google Books collection. The N-gram viewer defines terms and concepts as ‘grams’. A single uninterrupted word, such as ‘banana’ would be considered a 1-gram; ‘stock market’ would be a 2-gram; ‘The United States of America’ would be a 5- gram. Overall, conceptually speaking, a sequence of any number of 1-grams would be called an n-gram. According to Michel, books scanned during the Google Books digitization project were selected for their quality of OCR (optical character recognition) and then indexed. The corpus breaks down by language as follows: English: 361 billion; French: 45 billion; Spanish: 45 billion; German: 37 billion; Russian: 35 billion; Chinese: 13 billion; Hebrew: 2 billion. As shown in Figure 1, by far the largest amount of n-grams represented in their sample is English. This suggests that the search would be somewhat compromised in terms of diversity and representation of different cultures and languages. (Michel et al. 2011) Figure 1: English represents the largest number of n-grams in the Google Ngram Viewer program. French, Spanish, German and Russian are the next four most represented languages. Using the Ngram Viewer: Japanese authors & the “Nobel bump” in notoriety This author conducted several searches using the Ngram viewer (url: https://books.google.com/ngrams) to demonstrate some of the power as well as the impact on libraries the tool can have in the study of current trends in digital librarianship and scholarship, including the digital humanities. In the first example, shown in Figure 2, a search using the names of three of Japan’s most well-known authors, Yasunari Kawabata, Kenzaburo Oe, and Haruki Murakami was conducted. The search results show the frequency with which the words appeared in the English language, both British and American, during an approximately eighty year period between A.D. 1930 and 2008. Figure 2: Google Ngram Viewer demonstrating frequency of n-grams of three well-known Japanese authors whose careers have been influenced by the Nobel Prize for Literature. For the first 60 years, Kawabata is the most mentioned author among the three. From 1930- 1970, Kawabata was the most well-established author; Oe’s first publication didn’t occur until 1957; Murakami did not start writing until 1978. Interestingly, there are large spikes in the mention of these authors corresponding with their association with the Nobel Prize for literature. Kawabata was awarded the Nobel Prize in 1968. From 1968 – 1992, Kawabata remained the most frequently mentioned author. His overall peak occurred in 1974, six years after his award. Likely the reason for this peak is the amount of time it takes to create new translations of his works to appear in English as well as the time it takes to disseminate critical responses to these works. One could surmise that the rising frequency of mentioning Kawabata is partly in response to people writing about him, mulling over his new status as Nobel Laureate, the new resulting translations, and finally the scholarly and popular discussions that appear. Fame begets more fame. A similar spike in frequency occurs for Oe as well. He was awarded the prize in 1994. At this point in time he surpasses Kawabata in the frequency of mentions until the early 2000s. His peak popularity occurred in 1998, 4 years after the award, and is likely, again, a result of the increased number of translations of his work being published. The international prize, the attendant reviews, discussions, and scholarship that arise from it appear to provide the “bump” in popularity. Around 2003, Murakami begins to appear in the discussion as a finalist for the Nobel Prize, too. His popularity has increased as a result of the “Nobel bump,” and the longer his name is bandied about and associated with Nobel -- even if he has yet to win it -- the more it propels his notoriety.

Google Ngram Viewer by Andrew Weiss

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support