NEWS OF THE WEEK

DIGITAL DATA Opens Books to New Cultural Studies In March 2007, a young man with dark, curly Their 2007 study of the of English or “degenerate,” such as the painter Pablo hair and a accent knocked on the verbs, for example, made the cover of Nature. Picasso. Indeed, the n-gram trace of their door of , the head of research But they had never contended with the amount names in the German corpus plummets dur- at Google in Mountain View, California. It of data that offered. It currently ing that period, while it remains steady in the was Erez Lieberman Aiden, a mathematician includes 2 trillion words from 15 million English corpus. doing a Ph.D. in genomics at Harvard Univer- books, about 12% of every book in every lan- Once the researchers had identifi ed this sity, and he wanted some data. Specifi cally, guage published since the Gutenberg Bible in signature of political suppression, they ana- Lieberman Aiden wanted access to Google 1450. By comparison, the is a lyzed the “fame trace” of all people men- Books, the company’s ambitious—and mere 3-billion-letter poem. tioned in German books across the same controversial—project to digitally scan every Michel took on the task of creating the soft- period, ranking them with a “suppression page of every book ever published. ware tools to explore the data. For the analy- index.” They sent a sample of those names to By analyzing the growth, change, and sis, they pulled in a dozen more researchers, a historian in Israel for validation. Over 80% decline of published words over the centuries, including Harvard linguist Steven Pinker. of the people identifi ed by the suppression the mathematician argued, it should be possi- The fi rst surprise, says Pinker, is that books index are known to have been censored— ble to rigorously study the evolution of culture contain “a huge amount of lexical dark mat- for example, because their names were on on a grand scale. “I didn’t think the idea was ter.” Even after excluding proper nouns, more blacklists—proving that the technique works. crazy,” recalls Norvig. “We were doing the than 50% of the words in the n-gram data- But more intriguing, there is now a list of scanning anyway, so we would have the data.” base do not appear in any published diction- people who may have been victims of sup- The first explorations of the Google ary. Widely used words such as “deletable” pression unknown to history. Books data are now on display in a “This is a wake-up call to the humanities study published online this week by that there is a new style of research that Science (www.sciencemag.org/ can complement the traditional on December 16, 2010 content/early/2010/12/16/ styles,” says Jon Orwant, a science.1199644.abstract). computer and The researchers have director of digital humani- revealed 500,000 English ties initiatives at Google. words missed by all dic- In a nod to data-intensive tionaries, tracked the rise genomics, Michel and and fall of ideologies and Lieberman Aiden call this

famous people, and, perhaps nascent fi eld “.” www.sciencemag.org most provocatively, identified Humanities scholars are possible cases of political suppression reacting with a mix of excitement and unknown to historians. “The ambition is frustration. If the available tools can be enormous,” says Nicholas Dames, a literary and obscure ones like “slenthem” (a type of expanded beyond word frequency, “it could scholar at Columbia University. musical instrument) slipped below the radar become extremely useful,” says Geoffrey The project almost didn’t get off the of standard references. By the research team’s Nunberg, a linguist at the University of ground because of the legal uncertainty sur- estimate, the size of the English language has California, Berkeley. “But calling it ‘cul- rounding Google Books. Most of its content nearly doubled over the past century, to more turomics’ is arrogant.” Nunberg dismisses Downloaded from is protected by copyright, and the entire proj- than 1 million words. And vocabulary seems most of the study’s analyses as “almost ect is currently under attack by a class action to be growing faster now than ever before. embarrassingly crude.” lawsuit from book publishers and authors. It was also possible to measure the cul- Although he applauds the current study, Norvig admits he had concerns about the tural infl uence of individual people across the Dames has a score of other analyses he would legality of sharing the digital books, which centuries. For example, notes Pinker, track- like to perform on the Google Books corpus cannot be distributed without compensat- ing the ebb and fl ow of “Sigmund Freud” and that are not yet possible with the n-gram data- ing the authors. But Lieberman Aiden had an “Charles Darwin” reveals an ongoing intel- base. For example, a search of the words in idea. By converting the text of the scanned lectual shift: Freud has been losing ground, the vicinity of “God” could reveal “seman- books into a single, massive “n-gram” and Darwin fi nally overtook him in 2005. tic shifts” over history, Dames says. But the database—a map of the context and fre- Analysis of the n-gram database can also current database only reveals the fi ve-word quency of words across history—scholars reveal patterns that have escaped the atten- neighborhood around any given term. could do quantitative research on the tomes tion of historians. Aviva Presser Aiden led an Orwant says that both the available data without actually reading them. That was analysis of the names of people that appear and analytical tools will expand: “We’re enough to persuade Norvig. in German books in the fi rst half of the 20th going to make this as open-source as pos- Lieberman Aiden teamed up with fellow century. (She is a medical student at Har- sible.” With the study’s publication, Google Harvard Ph.D. student Jean-Baptiste Michel. vard and the wife of Erez Lieberman Aiden.) is releasing the n-gram database for pub- The pair were already exploring ways to study A large number of artists and academics of lic use. The current version is available at written language with mathematical tech- this era are known to have been censored dur- www.culturomics.org.

niques borrowed from evolutionary biology. ing the Nazi period, for being either Jewish –JOHN BOHANNON CREDITS:B. MICHEL J. ET AL .; WORDLE.COM

1600 17 DECEMBER 2010 VOL 330 SCIENCE www.sciencemag.org Published by AAAS