Google Ngram Viewer by Andrew Weiss

Total Page:16

File Type:pdf, Size:1020Kb

Google Ngram Viewer by Andrew Weiss Google Ngram Viewer By Andrew Weiss Digital Services Librarian California State University, Northridge Introduction: Google, mass digitization and the emerging science of culturonomics Google Books famously launched in 2004 with great fanfare and controversy. Two stated original goals for this ambitious digitization project were to replace the generic library card catalog with their new “virtual” one and to scan all the books that have ever been published, which by a Google employee’s reckoning is approximately 129 million books. (Google 2014) (Jackson 2010) Now ten years in, the Google Books mass-digitization project as of May 2014 has claimed to have scanned over 30 million books, a little less than 24% of their overt goal. Considering the enormity of their vision for creating this new type of massive digital library their progress has been astounding. However, controversies have also dominated the headlines. At the time of the Google announcement in 2004 Jean Noel Jeanneney, head of Bibliothèque nationale de France at the time, famously called out Google’s strategy as excessively Anglo-American and dangerously corporate-centric. (Jeanneney 2007) Entrusting a corporation to the digitization and preservation of a diverse world-wide corpus of books, Jeanneney argues, is a dangerous game, especially when ascribing a bottom-line economic equivalent figure to cultural works of inestimable personal, national and international value. Over the ten years since his objections were raised, many of his comments remain prescient and relevant to the problems and issues the project currently faces. Google Books still faces controversies related to copyright, as seen in the current appeal of the Author’s Guild v. Google Books decision; poor metadata, as pointed out in several studies including Nunberg (2009a) (2009b), James and Weiss (2012), and others; poor scanning as described by James in 2010; the destabilization of current markets, as pointed out by the Author’s guild (2013) and others such as Jaron Lanier (2013); and, finally, the general problem of representative diversity in their digitized corpus, as mentioned by Weiss and James (2013). Despite these major obstacles and problems, the Google Books project has continued without cessation even as other well-funded projects such as the Microsoft-Yahoo book digitization project, for example, have disappeared. (Arrington 2008) Google Books’ longevity is partly a result of the passion of Google’s leaders as well as the deep resources that spring from them. The project has even lasted within Google. This is no small feat as many well-known and beloved projects and services have been discontinued by the company over the seventeen years it has existed. The shortlist of discontinued services includes iGoogle, Google Answers, Google Wave, and Google Reader. But Google Books has staying power. A major reason for this longevity and potential long-term influence may have less to do with individual access to particular titles within its collections and more to do with the overall datasets – especially metadata and indexes – generated by the project. These openly available data sets are being used by researchers for a new kind of research named culturonomics. This chapter will explain how libraries and librarians can make use of the research procedures established by culturonomics and the tool developed for it, the Google Ngram Viewer, to aid their services to patrons. Culturonomics – a brief description Jean-Baptiste Michel and colleagues have been mining the publicly available data sets from Google Books to “track the use of words over time, compare related words and even graph them.” (2011). Their research involves examining the frequency with which words appear over time. The researchers examined 5,195,769 books, roughly 4 percent of all printed books, and measured the frequency of appearance of terms across variable times. (Ibid.) Some caveats with this method exist, however. As a result of the small sample, variations and fluctuations in the overall corpus may exist. Furthermore, it is often unclear -- and Google hasn’t fully disclosed this -- just how many texts are available online. Although 30 million is offered as a figure by Google, this is not verifiable. Additionally, the Google Books corpus does not include newsprint, maps, or other non-monographic materials, which account for a sizable part of library collections. Nevertheless, the print monograph record can surely be used as an aggregate mirror to peek into the culture as a whole, but one must not forget that it is still one slice of a larger mass, some of which—such as physical experience, unrecorded experience, spoken word, so-called deviant and/or censored material, illegal “black market” material—may never be captured to a satisfactory degree in digital formats. Additionally, works subjected to optical character recognition (OCR) software can be rendered illegible if the fonts are not standard (a true problem for scripts of non-Roman letter such as Japanese) or if other conditions such as faded ink, poor paper quality, and marks or blemishes, reduce the digital image quality. Despite some of these limitations, the result of Michel and colleagues’ study has been increased interest in digital humanities and “culturomics,” which is defined by them researches as an approach that relies on “massive, multi-institutional and multi-national consortia[,] . novel technologies enabling the assembly of vast datasets . and . the deployment of sophisticated computational and quantitative methods.” (Ibid.) The Google Books N-Gram Viewer Developed out of the Michel et al. experiments, the Google Ngram Viewer represents a new phase in the use of print materials and demonstrates the power of the added value that a digitized corpus can provide to researchers. Users are especially benefited by its search and data-mining capabilities. Historians, writers, artists, social scientists, and librarians will find the tool useful. The Google Books Ngram viewer corpus was created by aggregating 500 billion words from various monograph/book materials (excluding serials) found in the Google Books collection. The N-gram viewer defines terms and concepts as ‘grams’. A single uninterrupted word, such as ‘banana’ would be considered a 1-gram; ‘stock market’ would be a 2-gram; ‘The United States of America’ would be a 5- gram. Overall, conceptually speaking, a sequence of any number of 1-grams would be called an n-gram. According to Michel, books scanned during the Google Books digitization project were selected for their quality of OCR (optical character recognition) and then indexed. The corpus breaks down by language as follows: English: 361 billion; French: 45 billion; Spanish: 45 billion; German: 37 billion; Russian: 35 billion; Chinese: 13 billion; Hebrew: 2 billion. As shown in Figure 1, by far the largest amount of n-grams represented in their sample is English. This suggests that the search would be somewhat compromised in terms of diversity and representation of different cultures and languages. (Michel et al. 2011) Figure 1: English represents the largest number of n-grams in the Google Ngram Viewer program. French, Spanish, German and Russian are the next four most represented languages. Using the Ngram Viewer: Japanese authors & the “Nobel bump” in notoriety This author conducted several searches using the Ngram viewer (url: https://books.google.com/ngrams) to demonstrate some of the power as well as the impact on libraries the tool can have in the study of current trends in digital librarianship and scholarship, including the digital humanities. In the first example, shown in Figure 2, a search using the names of three of Japan’s most well-known authors, Yasunari Kawabata, Kenzaburo Oe, and Haruki Murakami was conducted. The search results show the frequency with which the words appeared in the English language, both British and American, during an approximately eighty year period between A.D. 1930 and 2008. Figure 2: Google Ngram Viewer demonstrating frequency of n-grams of three well-known Japanese authors whose careers have been influenced by the Nobel Prize for Literature. For the first 60 years, Kawabata is the most mentioned author among the three. From 1930- 1970, Kawabata was the most well-established author; Oe’s first publication didn’t occur until 1957; Murakami did not start writing until 1978. Interestingly, there are large spikes in the mention of these authors corresponding with their association with the Nobel Prize for literature. Kawabata was awarded the Nobel Prize in 1968. From 1968 – 1992, Kawabata remained the most frequently mentioned author. His overall peak occurred in 1974, six years after his award. Likely the reason for this peak is the amount of time it takes to create new translations of his works to appear in English as well as the time it takes to disseminate critical responses to these works. One could surmise that the rising frequency of mentioning Kawabata is partly in response to people writing about him, mulling over his new status as Nobel Laureate, the new resulting translations, and finally the scholarly and popular discussions that appear. Fame begets more fame. A similar spike in frequency occurs for Oe as well. He was awarded the prize in 1994. At this point in time he surpasses Kawabata in the frequency of mentions until the early 2000s. His peak popularity occurred in 1998, 4 years after the award, and is likely, again, a result of the increased number of translations of his work being published. The international prize, the attendant reviews, discussions, and scholarship that arise from it appear to provide the “bump” in popularity. Around 2003, Murakami begins to appear in the discussion as a finalist for the Nobel Prize, too. His popularity has increased as a result of the “Nobel bump,” and the longer his name is bandied about and associated with Nobel -- even if he has yet to win it -- the more it propels his notoriety.
Recommended publications
  • Corpora: Google Ngram Viewer and the Corpus of Historical American English
    EuroAmerican Journal of Applied Linguistics and Languages E JournALL Volume 1, Issue 1, November 2014, pages 48 68 ISSN 2376 905X DOI - - www.e journall.org- http://dx.doi.org/10.21283/2376905X.1.4 - Exploring mega-corpora: Google Ngram Viewer and the Corpus of Historical American English ERIC FRIGINALa1, MARSHA WALKERb, JANET BETH RANDALLc aDepartment of Applied Linguistics and ESL, Georgia State University bLanguage Institute, Georgia Institute of Technology cAmerican Language Institute, New York University, Tokyo Received 10 December 2013; received in revised form 17 May 2014; accepted 8 August 2014 ABSTRACT EN The creation of internet-based mega-corpora such as the Corpus of Contemporary American English (COCA), the Corpus of Historical American English (COHA) (Davies, 2011a) and the Google Ngram Viewer (Cohen, 2010) signals a new phase in corpus-based research that provides both novice and expert researchers immediate access to a variety of online texts and time-coded data. This paper explores the applications of these corpora in the analysis of academic word lists, in particular, Coxhead’s (2000) Academic Word List (AWL). Coxhead (2011) has called for further research on the AWL with larger corpora, noting that learners’ use of academic vocabulary needs to address for the AWL to be useful in various contexts. Results show that words on the AWL are declining in overall frequency from 1990 to the present. Implications about the AWL and future directions in corpus-based research utilizing mega-corpora are discussed. Keywords: GOOGLE N-GRAM VIEWER, CORPUS OF HISTORICAL AMERICAN ENGLISH, MEGA-CORPORA, TREND STUDIES. ES La creación de megacorpus basados en Internet, tales como el Corpus of Contemporary American English (COCA), el Corpus of Historical American English (COHA) (Davies, 2011a) y el Visor de Ngramas de Google (Cohen, 2010), anuncian una nueva fase en la investigación basada en corpus, pues proporcionan, tanto a investigadores noveles como a expertos, un acceso inmediato a una gran diversidad de textos online y datos codificados con time-code.
    [Show full text]
  • BEYOND JEWISH IDENTITY Rethinking Concepts and Imagining Alternatives
    This book is subject to a CC-BY-NC license. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/ BEYOND JEWISH IDENTITY Rethinking Concepts and Imagining Alternatives This book is subject to a CC-BY-NC license. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/ This book is subject to a CC-BY-NC license. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/ BEYOND JEWISH IDENTITY rethinking concepts and imagining alternatives Edited by JON A. LEVISOHN and ARI Y. KELMAN BOSTON 2019 This book is subject to a CC-BY-NC license. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/ Library of Congress Control Number:2019943604 The research for this book and its publication were made possible by the generous support of the Jack, Joseph and Morton Mandel Center for Studies in Jewish Education, a partnership between Brandeis University and the Jack, Joseph and Morton Mandel Foundation of Cleveland, Ohio. © Academic Studies Press, 2019 ISBN 978-1-644691-16-8 (Hardcover) ISBN 978-1-644691-29-8 (Paperback) ISBN 978-1-644691-17-5 (Open Access PDF) Book design by Kryon Publishing Services (P) Ltd. www.kryonpublishing.com Cover design by Ivan Grave Published by Academic Studies Press 1577 Beacon Street Brookline, MA 02446, USA [email protected] www.academicstudiespress.com Effective May 26th 2020, this book is subject to a CC-BY-NC license. To view a copy of this license, visit https://creativecommons.org/licenses/ by-nc/4.0/.
    [Show full text]
  • CS15-319 / 15-619 Cloud Computing
    CS15-319 / 15-619 Cloud Computing Recitation 13 th th November 18 and 20 , 2014 Announcements • Encounter a general bug: – Post on Piazza • Encounter a grading bug: – Post Privately on Piazza • Don’t ask if my answer is correct • Don’t post code on Piazza • Search before posting • Post feedback on OLI Last Week’s Project Reflection • Provision your own Hadoop cluster • Write a MapReduce program to construct inverted lists for the Project Gutenberg data • Run your code from the master instance • Piazza Highlights – Different versions of Hadoop API: Both old and new should be fine as long as your program is consistent Module to Read • UNIT 5: Distributed Programming and Analytics Engines for the Cloud – Module 16: Introduction to Distributed Programming for the Cloud – Module 17: Distributed Analytics Engines for the Cloud: MapReduce – Module 18: Distributed Analytics Engines for the Cloud: Pregel – Module 19: Distributed Analytics Engines for the Cloud: GraphLab Project 4 • MapReduce – Hadoop MapReduce • Input Text Predictor: NGram Generation – NGram Generation • Input Text Predictor: Language Model and User Interface – Language Model Generation Input Text Predictor • Suggest words based on letters already typed n-gram • An n-gram is a phrase with n contiguous words Google-Ngram Viewer • The result seems logical: the singular “is” becomes the dominant verb after the American Civil War. Google-Ngram Viewer • “one nation under God” and “one nation indivisible.” • “under God” was signed into law by President Eisenhower in 1954. How to Construct an Input Text Predictor? 1. Given a language corpus – Project Gutenberg (2.5 GB) – English Language Wikipedia Articles (30 GB) 2.
    [Show full text]
  • Internet and Data
    Internet and Data Internet and Data Resources and Risks and Power Kenneth W. Regan CSE199, Fall 2017 Internet and Data Outline Week 1 of 2: Data and the Internet What is data exactly? How much is there? How is it growing? Where data resides|in reality and virtuality. The Cloud. The Farm. How data may be accessed. Importance of structure and markup. Structures that help algorithms \crunch" data. Formats and protocols for enabling access to data. Protocols for controlling access and changes to data. SQL: Select. Insert. Update. Delete. Create. Drop. Dangers to privacy. Dangers of crime. (Dis-)Advantages of online data. [Week 1 Activity: Trying some SQL queries.] Internet and Data What Exactly Is \Data"? Several different aspects and definitions: 1 The entire track record of (your) online activity. Note that any \real data" put online was part of online usage. Exception could be burning CD/DVDs and other hard media onto a server, but nowadays dwarfed by uploads. So this is the most inclusive and expansive definition. Certainly what your carrier means by \data"|if you re-upload a file, it counts twice. 2 Structured information for a particular context or purpose. What most people mean by \data." Data repositories often specify the context and form. Structure embodied in formats and access protocols. 3 In-between is what's commonly called \Unstructured Information" Puts the M in Data Mining. Hottest focus of consent, rights, and privacy issues. Internet and Data How Much Data Is There? That is, How Big Is the Internet? Searchable Web Deep Web (I maintain several gigabytes of deep-web textual data.
    [Show full text]
  • Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom
    Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Durand, Neva C., James T. Robinson, Muhammad S. Shamim, Ido Machol, Jill P. Mesirov, Eric S. Lander, and Erez Lieberman Aiden. “Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom.” Cell Systems 3, no. 1 (July 2016): 99-101. As Published http://dx.doi.org/10.1016/j.cels.2015.07.012 Publisher Elsevier Version Final published version Citable link http://hdl.handle.net/1721.1/105759 Terms of Use Creative Commons Attribution 4.0 International License Detailed Terms http://creativecommons.org/licenses/by/4.0/ Tool Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom Graphical Abstract Authors Neva C. Durand, James T. Robinson, Muhammad S. Shamim, Ido Machol, Jill P. Mesirov, Eric S. Lander, Erez Lieberman Aiden Correspondence [email protected] In Brief We introduce Juicebox, a tool for exploring contact maps generated using Hi-C and other 3D genome-sequencing technologies. Users can zoom in and out interactively, just as a user of Google Earth might zoom in and out of a geographic map. Highlights d Juicebox enables users to explore Hi-C contact maps d Users can create maps from their own experimental data d Users can zoom in and out of a contact map in real time d Maps can be compared to 1D tracks, 2D feature sets, or one another Durand et al., 2016, Cell Systems 3, 99–101 July 27, 2016 ª 2016 Elsevier Inc.
    [Show full text]
  • Introduction to Text Analysis: a Coursebook
    Table of Contents 1. Preface 1.1 2. Acknowledgements 1.2 3. Introduction 1.3 1. For Instructors 1.3.1 2. For Students 1.3.2 3. Schedule 1.3.3 4. Issues in Digital Text Analysis 1.4 1. Why Read with a Computer? 1.4.1 2. Google NGram Viewer 1.4.2 3. Exercises 1.4.3 5. Close Reading 1.5 1. Close Reading and Sources 1.5.1 2. Prism Part One 1.5.2 3. Exercises 1.5.3 6. Crowdsourcing 1.6 1. Crowdsourcing 1.6.1 2. Prism Part Two 1.6.2 3. Exercises 1.6.3 7. Digital Archives 1.7 1. Text Encoding Initiative 1.7.1 2. NINES and Digital Archives 1.7.2 3. Exercises 1.7.3 8. Data Cleaning 1.8 1. Problems with Data 1.8.1 2. Zotero 1.8.2 3. Exercises 1.8.3 9. Cyborg Readers 1.9 1. How Computers Read Texts 1.9.1 2. Voyant Part One 1.9.2 3. Exercises 1.9.3 10. Reading at Scale 1.10 1. Distant Reading 1.10.1 2. Voyant Part Two 1.10.2 3. Exercises 1.10.3 11. Topic Modeling 1.11 1. Bags of Words 1.11.1 2. Topic Modeling Case Study 1.11.2 3. Exercises 1.11.3 12. Classifiers 1.12 1. Supervised Classifiers 1.12.1 2. Classifying Texts 1.12.2 3. Exercises 1.12.3 13. Sentiment Analysis 1.13 1. Sentiment Analysis 1.13.1 2.
    [Show full text]
  • Arxiv:2007.16007V2 [Cs.CL] 17 Apr 2021 Beddings in Deep Networks Like the Transformer Can Improve Performance
    Exploring Swedish & English fastText Embeddings for NER with the Transformer Tosin P. Adewumi, Foteini Liwicki & Marcus Liwicki [email protected] EISLAB SRT Department Lule˚aUniversity of Technology Sweden Abstract In this paper, our main contributions are that embeddings from relatively smaller cor- pora can outperform ones from larger corpora and we make the new Swedish analogy test set publicly available. To achieve a good network performance in natural language pro- cessing (NLP) downstream tasks, several factors play important roles: dataset size, the right hyper-parameters, and well-trained embeddings. We show that, with the right set of hyper-parameters, good network performance can be reached even on smaller datasets. We evaluate the embeddings at both the intrinsic and extrinsic levels. The embeddings are deployed with the Transformer in named entity recognition (NER) task and significance tests conducted. This is done for both Swedish and English. We obtain better performance in both languages on the downstream task with smaller training data, compared to re- cently released, Common Crawl versions; and character n-grams appear useful for Swedish, a morphologically rich language. Keywords: Embeddings, Transformer, Analogy, Dataset, NER, Swedish 1. Introduction The embedding layer of neural networks may be initialized randomly or replaced with pre- trained vectors, which act as lookup tables. One of such pre-trained vector tools include fastText, introduced by Joulin et al. Joulin et al. (2016). The main advantages of fastText are speed and competitive performance to state-of-the-art (SotA). Using pre-trained em- arXiv:2007.16007v2 [cs.CL] 17 Apr 2021 beddings in deep networks like the Transformer can improve performance.
    [Show full text]
  • Improved Aedes Aegypti Mosquito Reference Genome Assembly Enables Biological Discovery and Vector Control
    bioRxiv preprint doi: https://doi.org/10.1101/240747; this version posted December 29, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Improved Aedes aegypti mosquito reference genome assembly enables biological discovery and vector control Benjamin J. Matthews1-3*, Olga Dudchenko4-7*, Sarah Kingan8*, Sergey Koren9, Igor Antoshechkin10, Jacob E. Crawford11, William J. Glassford12, Margaret Herre1,3, Seth N. Redmond13,14, Noah H. Rose15, Gareth D. Weedall16,17, Yang Wu18,19, Sanjit S. Batra4-6, Carlos A. Brito-Sierra20,21, Steven D. Buckingham22, Corey L Campbell23, Saki Chan24, Eric Cox25, Benjamin R. Evans26, Thanyalak Fansiri27, Igor Filipović28, Albin Fontaine29-32, Andrea Gloria-Soria26, Richard Hall8, Vinita S. Joardar25, Andrew K. Jones33, Raissa G.G. Kay34, Vamsi K. Kodali25, Joyce Lee24, Gareth J. Lycett16, Sara N. Mitchell11, Jill Muehling8, Michael R. Murphy25, Arina D. Omer4-6, Frederick A. Partridge22, Paul Peluso8, Aviva Presser Aiden4,5,35,36, Vidya Ramasamy33, Gordana Rašić28, Sourav Roy37, Karla Saavedra-Rodriguez23, Shruti Sharan20,21, Atashi Sharma38, Melissa Laird Smith8, Joe Turner39, Allison M. Weakley11, Zhilei Zhao15, Omar S. Akbari40, William C. Black IV23, Han Cao24, Alistair C. Darby39, Catherine Hill20,21, J. Spencer Johnston41, Terence D. Murphy25, Alexander S. Raikhel37, David B. Sattelle22, Igor V. Sharakhov38,42, Bradley J. White11, Li Zhao43, Erez Lieberman Aiden4-7,13, Richard S. Mann12, Louis Lambrechts29,31, Jeffrey R. Powell26, Maria V. Sharakhova38,42, Zhijian Tu19, Hugh M.
    [Show full text]
  • A Collection of Swedish Diachronic Word Embedding Models Trained on Historical
    A Collection of Swedish Diachronic Word Embedding Models Trained on Historical Newspaper Data DATA PAPER SIMON HENGCHEN NINA TAHMASEBI *Author affiliations can be found in the back matter of this article ABSTRACT CORRESPONDING AUTHOR: Simon Hengchen This paper describes the creation of several word embedding models based on a large Språkbanken Text, Department collection of diachronic Swedish newspaper material available through Språkbanken of Swedish, University of Text, the Swedish language bank. This data was produced in the context of Språkbanken Gothenburg, SE Text’s continued mission to collaborate with humanities and natural language [email protected] processing (NLP) researchers and to provide freely available language resources, for the development of state-of-the-art NLP methods and tools. KEYWORDS: word embeddings; semantic change; newspapers; diachronic word embeddings TO CITE THIS ARTICLE: Hengchen, S., & Tahmasebi, N. (2021). A Collection of Swedish Diachronic Word Embedding Models Trained on Historical Newspaper Data. Journal of Open Humanities Data, 7: 2, pp. 1–7. DOI: https://doi. org/10.5334/johd.22 1 OVERVIEW Hengchen and Tahmasebi 2 Journal of Open We release diachronic word2vec (Mikolov et al., 2013) and fastText (Bojanowski et al., 2017) models Humanities Data DOI: 10.5334/johd.22 in their skip-gram with negative sampling (SGNS) architecture. The models are trained on 20-year time bins, with two temporal alignment strategies: independently-trained models for post-hoc alignment (as introduced by Kulkarni et al. 2015), and incremental training (Kim et al., 2014). In the incremental scenario, a model for t1 is trained and saved, then updated with the data from t2.
    [Show full text]
  • English Words and Culture
    Wayne State University Anthropology Faculty Research Publications Anthropology 2021 The Lexiculture Papers: English Words and Culture Stephen Chrisomalis Wayne State University, [email protected] Follow this and additional works at: https://digitalcommons.wayne.edu/anthrofrp Part of the Anthropological Linguistics and Sociolinguistics Commons, Digital Humanities Commons, Discourse and Text Linguistics Commons, and the Linguistic Anthropology Commons Recommended Citation Chrisomalis, Stephen (editor). 2021. The Lexiculture Papers: English Words and Culture. https://digitalcommons.wayne.edu/anthrofrp/4/ This Book is brought to you for free and open access by the Anthropology at DigitalCommons@WayneState. It has been accepted for inclusion in Anthropology Faculty Research Publications by an authorized administrator of DigitalCommons@WayneState. The Lexiculture Papers English Words and Culture edited by Stephen Chrisomalis Wayne State University 2021 Introduction 1 Stephen Chrisomalis Lexiculture: A Student Perspective 6 Agata Borowiecki Akimbo 8 Agata Borowiecki Anymore 14 Jackie Lorey Artisanal 19 Emelyn Nyboer Aryan 23 David Prince Big-ass 28 Nicole Markovic Bromance 34 Alistair King Cafeteria 39 Deanna English Childfree 46 Virginia Nastase Chipotle 51 Gabriela Ortiz Cyber 56 Matthew Worpell Discotheque 60 Ashley Johnson Disinterested 66 Hope Kujawa Douche 71 Aila Kulkarni Dwarf 78 John Anderson Eleventy 84 Julia Connally Erectile Dysfunction 87 Corrinne Sanger Eskimo 93 Valerie Jaruzel Expat 97 Michelle Layton Feisty 103 Kathryn Horner
    [Show full text]
  • Tracing Significant Change in 17Th-Century English Lexis: the Civil-War Effect
    TRACING SIGNIFICANT CHANGE IN 17TH-CENTURY ENGLISH LEXIS: THE CIVIL-WAR EFFECT Table 1. Number of words whose frequencies are significantly different in the two periods of the CEEC (1600–1639 and 1640–1681), at various significance thresholds α (n = 46,440). α Log-likelihood ratio test Bootstrap test Both 0.01 2,685 (6 %) 2,365 (5 %) 2,199 (5 %) 0.001 1,400 (3 %) 1,209 (3 %) 1,108 (2 %) 0.0001 937 (2 %) 759 (2 %) 722 (2 %) For more information, see Lijffijt et al. (forthcoming 2012). REFERENCES CEEC = Corpus of Early English Correspondence. 1998. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi & Minna Palander-Collin at the Department of English, University of Helsinki. http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/ Dunning, Ted. 1993. “Accurate methods for the statistics of surprise and coincidence”. Computational Linguistics 19(1): 61–74. Google Books Ngram Viewer. http://books.google.com/ngrams/ HT = Historical Thesaurus of the Oxford English Dictionary. 2009. Edited by Christian Kay, Jane Roberts, Michael Samuels & Irené Wotherspoon. OED Online. http://www.oed.com/thesaurus Kilgarriff, Adam. 2001. “Comparing corpora”. International Journal of Corpus Linguistics 6(1): 97–133. Lijffijt, Jefrey, Tanja Säily & Terttu Nevalainen. Forthcoming 2012. “CEECing the baseline: Lexical stability and significant change in a historical corpus”. To appear in the proceedings of the Helsinki Corpus Festival. http://www.helsinki.fi/varieng/journal/ Lijffijt, Jefrey, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki & Heikki Mannila. Forthcoming. “Significance testing of word frequencies in corpora”. Submitted to International Journal of Corpus Linguistics. Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K.
    [Show full text]
  • Networking Best Practices for Large Deployments Google, Inc
    Networking Best Practices for Large Deployments Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043 www.google.com Part number: NETBP_GAPPS_3.8 November 16, 2016 © Copyright 2016 Google, Inc. All rights reserved. Google, the Google logo, G Suite, Gmail, Google Docs, Google Calendar, Google Sites, Google+, Google Talk, Google Hangouts, Google Drive, Gmail are trademarks, registered trademarks, or service marks of Google Inc. All other trademarks are the property of their respective owners. Use of any Google solution is governed by the license agreement included in your original contract. Any intellectual property rights relating to the Google services are and shall remain the exclusive property of Google, Inc. and/or its subsidiaries (“Google”). You may not attempt to decipher, decompile, or develop source code for any Google product or service offering, or knowingly allow others to do so. Google documentation may not be sold, resold, licensed or sublicensed and may not be transferred without the prior written consent of Google. Your right to copy this manual is limited by copyright law. Making copies, adaptations, or compilation works, without prior written authorization of Google. is prohibited by law and constitutes a punishable violation of the law. No part of this manual may be reproduced in whole or in part without the express written consent of Google. Copyright © by Google Inc. Google provides this publication “as is” without warranty of any either express or implied, including but not limited to the implied warranties of merchantability or fitness for a particular purpose. Google Inc. may revise this publication from time to time without notice.
    [Show full text]