SPECIAL SECTION PREDICTION THE PULSE OF THE PEOPLE Can internet data outdo costly and unreliable polls in predicting election outcomes?

By on February 3, 2017 n an apartment on New York City’s Up- 1000 polls found evidence of widespread per West Side on 8 November 2016, data fabrication (Science, 4 March 2016, Hernan Makse and several friends p. 1014). By contrast, Makse’s group tracked cooked branzino and sipped Chablis the political opinions of millions of people as they watched the U.S. presidential directly, second by second, for months—and election unfold. They hopscotched they got those data for free. between MSNBC and Fox News while Twitter isn’t the only online data stream keeping an eye on The New York Times that scientists are funneling into predic- website on a laptop. The Times was tive models of everything from elections to streaming live updates of its “presidential street protests. The largest tech companies Ielection forecast.” It was still early, and re- such as Facebook and Google generate data http://science.sciencemag.org/ sults from key states had not yet come in. On that are free for researchers to use, though a chart labeled “Chance of Winning Presi- with varying degrees of inconvenience. So dency” that reflected the polling data rolling Makse and many other social scientists are in, Hillary Clinton bounced above 80%, leav- asking: Could online data enhance polling ing Donald Trump mired below 20%. as a forecasting tool, or even replace it? Makse, a statistical physicist at City Uni- The election night verdict: not yet. As the versity of New York, had placed a scientific evening wore on, Makse’s forecast based on Downloaded from bet on the outcome. The day before, his freely harvested tweets continued to match Both polling and an analysis of pre–election night lab group had posted a research paper to the pricey polling data, predicting a win for tweets failed to flush out Trump’s hidden voters. arXiv, the online preprint repository. They Clinton with 55.5% of the vote. But both had feverishly revised it to make the 4 p.m. forecasts got it wrong. Before their dinner tion to a community or society seems like deadline and publish on Election Day. Like was done, Makse watched as the projections a nonstarter. “But in some ways that is an the gauge chart on the Times website, they on the Times’s data-driven blog, The Up- easier problem,” says Taha Yasseri, a com-

predicted who would become president. But shot, caught up with reality. “It was funny putational social scientist at the University S whereas the Times used data from state-by- to see how at around 8 p.m.,” he says, “they of Oxford Internet Institute in the United G E state polling, Makse’s prediction was based switched from 20% to 95% for Trump.” Kingdom. He offers an analogy from phys- I M A

entirely on data gathered from Twitter in The internet, it seems, can’t yet reliably ics: Although the movement of a single G E TT Y the months leading up to the election. take the pulse of the people. But Makse particle looks random, “the behavior of

If Makse’s group nailed the election fore- and many other social scientists are con- a gas made up of millions of particles is TR I N G E R / S cast, they would have reason to brag. Poll- vinced that it eventually will—if only they very predictable.” A R E Z / ing, whether done by phone or door-to-door, can figure out how to translate terabytes of The idea that society can be treated like V A L

is extremely labor intensive and expensive: data into human intentions. a physics problem has deep roots. In the Z It fuels an $18 billion industry. And it has 1950s, science fiction author Isaac Asimov problems. Not only have response rates FORECASTING WHAT PEOPLE WILL DO, and conjured up a branch of science called A R DO M U N O fallen to single digits, leaving pollsters to why, is the essence of social science. psychohistory. With powerful computers U O : E D

rely on a thin and biased sample of people, Considering how hard it is to divine even a and gargantuan data sets, he imagined, re- T

but also an analysis last year of more than single person’s behavior, scaling up predic- searchers would forecast not just elections, PH O

470 3 FEBRUARY 2017 • VOL 355 ISSUE 6324 sciencemag.org SCIENCE

Published by AAAS

DA_0203SpecialNewsSection.indd 470 2/1/17 10:20 AM on February 3, 2017 http://science.sciencemag.org/

but the rise and fall of empires. from 86 different countries going back to the authors quip. Others agree that for Downloaded from A lifetime later, the computers and the World War II. To predict winners, Kennedy, the time being, polling reigns. “If you’re data Asimov envisioned are becoming real- David Lazer, a social scientist at Northeast- trying to predict a decision people will ity. But for now, polling—costly and ineffi- ern University in Boston, and his Ph.D. stu- make, there’s just no substitute for ask- cient as it is—remains the tool of choice for dent Stefan Wojcik statistically modeled the ing them directly,” says Andrew Gelman, predicting group behavior such as elections. elections using voter polling data as well a statistician at Columbia University. And a study of electoral races around the as data on other factors that can tip elec- Yet Lazer, for one, believes our reliance world on p. 515 suggests that polls are still tions: the country’s economic prosperity, on polling may not last much longer. “Ca- reliable, despite last November’s surprise. democratic freedoms—using a third-party nonical polling methods are in crisis,” he Ryan Kennedy, a social scientist at the measurement called a Polity score—and says. One factor is people’s growing im- University of Houston in Texas, and col- whether an incumbent was running. patience with pollsters; another is the death leagues focused on a data set of presidential They trained their models on data of the landline. You can’t survey people if elections. They avoided the complexity of up to 2007 and then tested them on the you don’t know how to find them. Could a comparing different government systems by most recent 8 years, totaling 128 elec- fire hose of data from the internet plug the limiting the study to elections in which vot- tions. Overall, they correctly predicted gap? That holds “great promise,” says Lazer, ers chose a national leader directly, rather the winner 80% to 90% of the time. And “but a lot of work has to be done before than, for example, through a party-based of all the indicators, polling proved the those approaches are validated.” parliamentary system like the United King- most powerful, by far. “We predict that One challenge is that it is hard to deci- dom’s. That filter left plenty of data: The reports of the death of quantitative elec- pher people’s motivations from their inter- final tally came to more than 500 elections toral forecasts are greatly exaggerated,” net habits: that is, their web searches and

SCIENCE sciencemag.org 3 FEBRUARY 2017 • VOL 355 ISSUE 6324 471

Published by AAAS

DA_0203SpecialNewsSection.indd 471 2/1/17 10:20 AM SPECIAL SECTION PREDICTION

posts. If millions of people vates someone to visit a website, tweet, and, amplifying a point of view. Deploying such tweet sentiments supportive of a candidate ultimately, vote one way or another. Once bots is like planting people in an audience to or critical of an opponent, can it be de- they solve the anonymity problem, he says, laugh at your jokes. duced, reliably, how they will vote? Predict- the team hopes to start predicting outcomes To use Twitter as a voter opinion poll, ing people’s behavior is tough, Yasseri says, such as elections within a few years. Makse’s team had to detect all those bots and “if you don’t know what motivates them.” filter them out first. They did that before elec- A promising test ground for probing MAKSE HIMSELF IS TRYING to improve his tion night by analyzing not only the content motivation is , a website used Twitter-based model. The morning after and timing of the tweets, but also information by a remarkably broad swath of humanity Trump’s election, he met his graduate stu- about the accounts behind them. A telltale as a one-stop shop for basic information dents and postdocs in the lab. The mood was sign of a bot is an account that does not use one on almost any topic. To see what Wikipe- grim. “Most of them are foreigners,” he says, of the standard Twitter software clients and dia’s traffic might reveal about electoral and the anti-immigrant rhetoric of Trump’s relentlessly retweets content from other ac- outcomes, Yasseri and fellow Oxford re- campaign had been bruising. counts verbatim. As they debotted their searcher Jonathan Bright have data, a stark pattern emerged: been tracking the number of Whereas the pro-Trump tweet- daily visitors to the Wikipedia Conventional wisdom ers were riddled with bots, those pages devoted to political par- During the party conventions last summer, the number of individuals who supporting Clinton seemed al- ties competing in the European tweeted in favor of Hillary Clinton (blue) spiked, far outstripping those most exclusively human. The ef- Union’s parliamentary elections who backed Donald Trump (red)—in line with polls that had Clinton in the lead. fect the bots had on the election every 5 years. Because the vot- remains an open question. ers speak different languages, 500 Another unknown is the num- Democratic

an d ) Republican Yasseri and Bright gather data Convention Convention ber of Twitter users who are paid separately from each of 14 lan- 400 hacks. One of the most influ- guage versions of the site. ential pro-Trump tweeters of s ( t h o u 300

The number of visitors to e r all, based on Makse’s analysis on February 3, 2017 each political party’s Wiki- u s of the cascades of retweeting pedia page would not have reli- 200 in Twitter’s echo chamber, was

ably predicted who ultimately f d aily @LindaSuhler. According to the 100 won seats in the 2009 and 2014 account profile, it is a “Linda elections. “It’s not that simple,” Suhler, Ph.D.” The internet has 0 N u m b e r o Yasseri says. His theory is that June July August no record of that person and voters are “information misers” direct Twitter messages from seeking out the minimum infor- Science were never answered. mation needed to make a decision. The few, the loud The close match last Novem- And indeed, they found, the most After filtering out bots that retweeted pro-Trump sentiments, researchers found ber between Twitter-based pro- active Wikipedia pages tended to that Trump’s backers tweeted far more often than did Clinton’s. Election night jections and voter polls could be be those of newly formed parties, predictions undervalued that indicator of stronger enthusiasm for Trump. a fluke. Only more elections will http://science.sciencemag.org/ with visits peaking in the week tell. But Makse suspects that 5 before an election. By monitor- both approaches have the same er

ing those pre-election spikes, limitation: Certain voters were O R K Y

4

Yasseri and Bright reported last underrepresented. The Trump W N E

year in EPJ Data Science, they s p er u voters “just weren’t there,” he

t 3 G E O F

were able to forecast the gains e says. The rural U.S. population

of the nationalist parties within that propelled Trump to victory O L E f t w

2 C France and the United Kingdom may have been the very people Downloaded from CI T Y in their respective national elec- not using Twitter or answer- ,

1 V E T tions. But what’s needed, they ing pollsters’ questions. Either O N u m b er o say, is more information about 0 that, he says, or the “shy Trump those Wikipedia visitors to trans- June July August voter” theory is right: People

late browsing choices into likely who intended to vote for Trump A L E X AN D R B AND

voting choices. They did a postmortem on their Twitter tended to keep quiet about it. In retrospect, S To put human prediction to the test, study to look for signs they should have seen. says Makse, the much higher intensity of Yasseri is part of a European team building a Although the Twitter data were far easier to tweeting from pro-Trump Twitter users “social data bank,” which, like a genetic data gather than the polls, they were harder to in- versus pro-Clinton users (see chart, above) A) H E R NAN M AK

bank, would provide deep information— terpret, posing challenges that pollsters need “was a red flag” for deep differences be- T A

demographics, health records, traces of never consider. For example, among the tween them. ; (D online browsing, and even mobile phone 73 million tweets about Trump or Clinton in If those biases can be tracked, data from CIE NCE

data—on a slice of the population. To start, it the 4 months before the election, how many social media could increase the accuracy S / U

will focus on the United Kingdom, Finland, were written by humans? Twitter’s platform of election predictions, Makse says. But O Y . J Hungary, Spain, and Slovenia. “We have to allows “bots”—computer programs imitating how much accuracy do we need? There is a )

figure out how to keep the data anonymous,” humans—to take part in the online discus- downside to psychohistory, Gelman warns. A P H IC G R Yasseri says. The hope is that tracking all sion. They aren’t labeled as such, and to most If we could predict election outcomes with ( the online behavior of relatively few people observers they appear to be enthusiastic fel- perfect fidelity, he says, the act of voting it- S : j will allow researchers to deduce what moti- low voters, echoing political slogans and self “would be meaningless.” C R E D IT

472 3 FEBRUARY 2017 • VOL 355 ISSUE 6324 sciencemag.org SCIENCE

Published by AAAS

DA_0203SpecialNewsSection.indd 472 2/1/17 10:20 AM The pulse of the people John Bohannon (February 2, 2017) Science 355 (6324), 470-472. [doi: 10.1126/science.355.6324.470]

Editor's Summary

This copy is for your personal, non-commercial use only.

Article Tools Visit the online version of this article to access the personalization and

article tools: on February 3, 2017 http://science.sciencemag.org/content/355/6324/470 Permissions Obtain information about reproducing this article: http://www.sciencemag.org/about/permissions.dtl http://science.sciencemag.org/ Downloaded from

Science (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. Copyright 2016 by the American Association for the Advancement of Science; all rights reserved. The title Science is a registered trademark of AAAS.