<<

Charting the Geosciences with Ngram Viewer

Danita S. Brandt, Department of Earth and Environmental Geosciences, Michigan State University, East Lansing, Michigan 48824, USA, [email protected]

INTRODUCTION available at www..org and was followed by volumes two and three in Frequency of mention in books can be www.ngrams.googlelabs.com. 1832 and 1833, respectively. The N-gram frequency chart supports the hypothesis used to trace the evolution of a discipline, CAVEATS TO USING THE CORPUS that Lyell’s books contributed to an from the first recorded use of the word or increase in the frequency of the unigram phrase to its current standing, as measured Problems with the unfiltered use of the “geology”; the conclusion that Lyell’s work by the number of books that include the corpus are well-documented, had a major impact on the growth of geol- phrase. Ngram Viewer, the tool developed including errors introduced during optical scanning and entering metadata (Nunberg, ogy is supported independently by histori- by a team at Google Books (Michel et al., ans of our discipline (Rudwick, 2010). 2011) places a database (“corpus”) of >500 2009). Pechenick et al. (2015) described limits to inferring cultural and linguistic N-gram frequency of “micropaleontol- billion words at the disposal of its users ogy” reached a maximum in the early (http://books.google.com/ngrams). Here I evolution from the Google N-gram corpus, including the problem of the burgeoning 1950s, coincident with that decade’s describe how this tool can be used to number of scientific texts since 1990, “petroleum” boom, and reflects the well- examine patterns suggested by qualitative which skews the results toward academic documented connection between micro- ideas about the intellectual development of usage of N-grams and is therefore less biostratigraphy and petroleum exploration the geosciences. An example of the Ngram reflective of cultural context. However, if (Haq and Boersma, 1998). However, not all Viewer output is given in Figure 1. the user’s purpose is to trace the history of possible correlations are easily tested using Ngram Viewer; an attempt to chart the a scientific discipline rather than a cultural N-GRAMS N-grams “micropaleontology” and “petro- phenomenon, as the purpose is here, the bias leum” on the same graph returned a display An N-gram is a contiguous string of n Pechenick et al. (2015) described skews in a in which the line tracing the frequency of items from a given sequence of text or constructive direction. Because the database “micropaleontology” was indistinguish- speech. A 1 gram (also known as a uni- consists of books only, rather than journal able from the x-axis; the frequency of the gram) is a string of characters uninter- articles, N-gram results might lag the intel- N-gram “petroleum” swamped “micropa- rupted by a space, e.g., “trilobite” or lectual development of a discipline. “3.14159.” An N-gram is a sequence of leontology.” The corpus is also sensitive to 1 gram, e.g., “trilobite extinction” (2 gram APPLICATION TO THE N-gram size and word order; the trigram or ), and “Michigan State University” GEOLOGICAL SCIENCES “extinction of trilobites” successfully (3 gram or trigram). N-grams are used by returned results; a query for “trilobite Ngram Viewer is useful for suggesting extinction” returned no N-grams. Although computer scientists and computational lin- testable hypotheses by identifying correla- Ngram Viewer does not allow for easy guists for and natural language tions. Two important caveats to keep in comparison of N-grams with wildly differ- processing (Jurafsky and Martin, 2014). mind when using Ngram Viewer are, as in ent occurrence rates, this obstacle can be Google Books, a service of search-engine any analysis, correlation does not necessar- overcome by downloading and replotting giant Google Inc., has amassed a database ily indicate causation, and, as with any the Ngram Viewer data using programs of more than 25 million scanned books. online resource (Wikipedia, for example), such as R. From this resource, a subset of over five Ngram Viewer provides a starting point to Cause-and-effect is suggested by the million books, chosen for the quality of stimulate further investigation, not an end in graph of “geosynclines” and “plate tecton- their optical scan and metadata (e.g., date of itself. Here, in approximate chronological ics” (Fig. 1). The graph traces the displace- publication), comprises the corpus of Google order, are three examples of Ngram Viewer ment of the older “geosynclines” paradigm Ngram Viewer. Currently, Ngram Viewer is searches drawn from geological topics cho- for explaining crustal tectonics by the restricted to a maximum word string length sen to illustrate the potential and the limita- emergence of “plate tectonics.” The dra- of n = 5 (five-grams), and counts only tions of these data. Search terms and phrases matic shift from “geosynclines” to “plate N-grams that occur at least 40 times in the (the N-grams) are enclosed in quotes. tectonics” occurred in the mid-1970s, as corpus. The data consist of books published The frequency of the unigram “geology” plate tectonic theory supplanted the pre- from the 1500s to 2000, and includes chil- shows an increase at 1830, coincident with tectonic explanation of crustal dynamics dren’s literature, trade, and other books but publication of the first volume of Charles and made its way into textbooks. The no journal articles. The full data set is Lyell’s Principles of Geology. Volume one apparent causal connection between the

GSA Today, v. 28, doi: 10.1130/GSATG348GW.1. Copyright 2018, The Geological Society of America. CC-BY-NC. Figure 1. Screenshot of Ngram Viewer chart showing the frequency in the Google Books corpus of the N-grams “geosyncline” and “plate tectonics,” from 1900 to 2000. Y-axis is frequency of the N-gram in the corpus.

rise of plate tectonics and the fall of geo- The decisions to change department Haq, B.U., and Boersma, A., eds., 1998, Intro­ synclines can be examined more closely by names, revise course descriptions, and ini- duction to marine micropaleontology (2nd edition): Amsterdam, Elsevier, 376 p. accessing the corpus on which the search tiate new journals described here were Jurafsky, D., and Martin, J.H., 2014, Speech and is based. In addition to the chart (Fig. 1), made before there was a Google Books Language Processing: An Introduction to Natural Ngram Viewer searches return links to the corpus, but these decisions were undoubt- Language Processing, Computational , and (2nd edition): New corpus on which the search is based, edly affected by trends in metrics, like binned by year of publication. Clicking on York, Prentice Hall, 1024 p. student enrollment and funding priorities, Lyell, C., 1830, Principles of geology, being an these bins opens a page which are now indirectly reflected in that attempt to explain the former changes of the with links to each publication included in database. Earth’s surface, by reference to causes now in the corpus. The diligent researcher can operation: London, John Murray, volume 1. Lyell, C., 1832, Principles of geology, being an then sort through the titles and assess the SUMMARY quality of the data on which the Ngram attempt to explain the former changes of the The output of Google’s “shiny new toy Earth’s surface, by reference to causes now in Viewer chart is based. operation: London, John Murray, volume 2. for nerds” (Zhang, 2015), Ngram Viewer, Lyell, C., 1833, Principles of geology, being an OTHER USES FOR N-GRAMS IN is not sufficient to support hypotheses of attempt to explain the former changes of the THE GEOSCIENCES causality suggested by the correlations it Earth’s surface, by reference to causes now in operation: London, John Murray, volume 3. Charting word frequency trends can generates, but its accessibility and ease of Michel, J.B., Shen, Y.K, Presser Aiden, A., Veres, contribute to identifying directions for use can serve an important function in A., Gray, M.K., Brockman, W., The Google research or investment of resources. In introducing scholars to the possibilities of Books Team, Pickett, J.P., Hoiberg, D., Clancy, the U.S., a number of Departments of D., Norvig, P., Orwant, J., Pinker, S., Nowak, digital research (Cohen, 2010). The fre- M.A., and Lieberman Aiden, E., 2011, “Geology” became Departments of quency of N-grams through time maps Quantitative analysis of culture using millions of “Geological Sciences” in the late 1970s where we have been, and, mindful of the digitized books: Science, v. 331, p. 176–182, and early 1980s (including the department adage, “those who cannot remember the https://doi.org/10.1126/science.1199644. at Michigan State University), mirroring Nunberg, G., 2009, Google’s book search: A past are condemned to repeat it,” history disaster for scholars: The Chronicle of Higher the increase in frequency of the bigram ought not be ignored in identifying trends Education, http://www.chronicle.com/article/ “geological sciences.” In 2016, MSU’s in support of education, policy, planning, -Book-Search-A/48245/ (last accessed department changed its name, again, to 10 May 2017). and funding objectives of our discipline. “Earth and Environmental Sciences,” Pechenick, E.A., Danforth, C.M., and Dodds, P.S., reflecting the increase in frequency of the 2015, Characterizing the Google Books corpus: ACKNOWLEDGMENTS Strong limits to inferences of socio-cultural and “Environmental Sciences” bigram, which linguistic evolution: PLoS One, v. 10, no. 10, A.M. Velbel introduced me to Ngram Viewer started in 1990. The N-gram frequency of https://doi.org/10.1371/journal.pone.0137041. and was instrumental in the evolution of this other geologic disciplines also chart what Rudwick, M.J.S., 2010, Worlds before Adam: The manuscript. Three reviewers contributed to a more reconstruction of geohistory in the age of reform: might be interpreted as evolving priorities, focused and improved final version. especially in the textbook-rich academic Chicago, University of Chicago Press, 648 p. Zhang, S., 2015, The pitfalls of using Google environment: References to “evolutionary REFERENCES CITED N-gram to study language, https://www.wired. biology” now approach those of “paleo­n­ com/2015/10/pitfalls-of-studying-language-with- Cohen, D., 2010, Initial thoughts on the Google google-N-gram/ (last accessed 10 May 2017). tology.” As frequency of the bigram “evo- Books N-gram Viewer and datasets, http://www lutionary biology” increased, through the .dancohen.org/2010/12/19/initial-thoughts-on- Manuscript received 11 May 2017 mid-1970s, the Paleontological Society the-google-books-N-gram-viewer-and-datasets/ Revised manuscript received 3 January 2018 debuted its new journal, Paleobiology. (last accessed 10 May 2017). Manuscript accepted 7 February 2018