COMMUNICATIONS CACM.ACM.ORG OF THEA C M 11/2015 VOL.58 NO.11 Information Cartography
Algorithmic Authors Fail at Scale Inductive Programming When Technologies Manipulate Our Emotions
Association for Computing Machinery contributed articles
DOI:10.1145/2735624 The problem of automatically extract- A metro map can tell a story, ing structured knowledge from large datasets is increasingly prevalent. as well as provide good directions. Several methods have sought to sum- marize and visualize narratives.2,28,29 BY DAFNA SHAHAF, CARLOS GUESTRIN, However, most work only for simple ERIC HORVITZ, AND JURE LESKOVEC stories that are linear in nature. In contrast, complex stories exhibit a nonlinear structure; stories spaghetti into branches, side stories, dead ends, and intertwining narratives. To explore Information them, users need a map to guide them through unfamiliar territory. We previously introduced a meth- odology for creating structured sum- Cartography maries of information we call “metro maps.” The name is metaphoric; just as cartographic maps have been relied on for centuries to help us understand our surroundings, metro maps help us un- derstand the information landscape. In this article, we explore methods we “RAISE YOUR HAND if you don’t quite understand this have developed for automatically creat- ing metro maps of information.25–27 whole financial crisis,” said David Leonhardt’s New Metro maps consist of a set of lines York Times article, March 2008. The credit crisis had with intersections or overlaps. Most important, they explicitly show the re- been going on for seven months and extensively and lationships among different pieces of continuously covered by every major media outlet in information in a way that captures a the world. Despite that coverage, many readers felt story’s evolution. Each metro stop is a cluster of articles, and lines follow they did not understand what it was about. coherent narrative threads. Different Paradoxically, pervasive media coverage may have lines focus on different aspects of the story; for example, the map in Figure contributed to the public’s lack of understanding, 1 was automatically generated for the a phenomenon known as information overload. query “Crimea.” The map outlines the Recent technology advances allow us to produce data at bewildering rates, while the surge of the Web key insights " Though human attention and has brought down the barriers of distribution. Yet comprehension can be overwhelmed by the data deluge, automatic methods despite this accelerating data deluge, knowledge and can extract structured knowledge and attention remain precious and scarce commodities. provide maps of complex information landscapes to help people understand Writers, researchers, and analysts spend countless ideas, connections, and storylines. hours gathering information and synthesizing " Properties of good maps are difficult to formalize; important characteristics meaningful narratives, examining and inferring include coherence of storylines, coverage of diverse and important relationships among pieces of information. Subtleties topics, and relationships among pieces and relationships in an evolving story are easy to lose of information. " These principles can be used to in an echo chamber created by the modification and synthesize meaningful narratives from reuse of content, as fueled by incentives to attract large datasets across multiple domains, including news stories, research papers, indexers, eyeballs, and clicks on advertisements. legal cases, and works of literature.
62 COMMUNICATIONS OF THE ACM | NOVEMBER 2015 | VOL. 58 | NO. 11 METRO MAP CREATED BY ALBERTO ANTONIAZZI ANTONIAZZI ALBERTO BY METRO MAP CREATED
NOVEMBER 2015 | VOL. 58 | NO. 11 | COMMUNICATIONS OF THE ACM 63 contributed articles
Figure 1. Sample output: metro map of the 2014 Crimean crisis.
Legend on the left of each line lists the important words for the line; the lines correspond to the Russian, Ukrainian, and Western points of view. Each metro stop is a cluster of articles; the callout bubbles are manual annotations of the content. The timeline is at the bottom of the map.
Crimea Putin declares independence recognizes independence
Putin Crimean Ukraine’s Ousted rubber- Ukrainian parliamentary Crimea Seeks stamps leader Occupation delegation to Become Crimea urges
Ukrainian Ukrainian Ukraine, Pro-Russia Putin, Ukrainian officer leader dies in Occupiers concedes leader, urge Ukraine wants to first gun Stay Put loss keep China Canada Ukraine won’t crisis ‘not West sanctions recognize another referendum Cold War’
Crimea Ukraine, force, Eastern, West steps votes to up Russia More sanctions warns, Russia, crisis join Russia sanctions
Obama: Ofgem Republican Obama, Merkel West says the U.S. discuss Ukraine Won’t Let Ukraine, sanctions, Competition lawmakers Kremlin Crimea, seek, Obama and seek
Mar 8 Mar 17 Apr 4 Apr 30
2014 Crimean crisis, with the three users to digest information. We also An objective function. Before we can lines corresponding to the Russian, integrated capabilities for supporting come up with an algorithm for comput- Ukrainian, and Western points of view. user interaction into the methodol- ing good maps, we must craft an objec- The legend to the left of each line shows ogy, letting users guide formulation tive function, which is especially im- the important words for the line. The of the maps. portant for maps, where the objective timeline appears at the bottom of the We demonstrate that metro maps is not clear, a priori. In the following figure. The Russian (green) line starts can help people understand informa- sections, we motivate and formalize in March, with the Crimea parliament tion in many areas, including news sto- several (sometimes conflicting) crite- voting to join Russia and Vladimir Putin ries, research areas, legal cases, even ria. In the next section, we present a recognizing Crimean independence. works of literature. Metro maps can principled approach to constructing The Ukrainian (orange) line starts with help them cope with information over- maps that optimizes trade-offs among Ukraine’s former prime minister urg- load, framing a direction for research these criteria. ing the West to stop Russian aggres- on automated extraction of informa- First, recall our goal. Given a set of sion. The Ukrainian line then joins tion, as well as on new representations documents, we seek to compute a met- the Western (blue) line to discuss the for summarizing and presenting com- ro map that summarizes and organizes West’s attempts to support Ukraine. Fi- plex sets of interrelated concepts. the documents. A metro map consists nally, the Russian and Ukrainian lines of a set of metro lines, each an ordered intersect when pro-Russia groups took Finding a Good Map sequence of stops, where a stop is a over police stations in Ukraine. We start by formalizing the character- subset of articles. Each line follows a Our representation is motivated istics of good maps and formulating coherent narrative thread, and differ- by the strong empirical evidence that their construction as an optimization ent lines focus on different aspects of map representations help users gain problem. We then provide efficient, the story. Intersections across lines and retain knowledge; for example, scalable methods with theoretical reveal the ways different storylines in- mind maps and knowledge maps guarantees for constructing maps. Our teract; for example, we computed the have been shown to increase memory description of the characteristics is in- map in Figure 1 over news articles con- recall in students,11,23 as well as moti- tentionally abstract. Later, we demon- taining the word “Crimea” from March vation and concentration.15 We have strate how to adapt these abstract no- to April 2014. Each stop is a cluster of also found map visualizations enable tions to various domains. articles. The map includes three story-
64 COMMUNICATIONS OF THE ACM | NOVEMBER 2015 | VOL. 58 | NO. 11 contributed articles lines, following the Russian, Ukrainian, good maps, but is it sufficient? Pur- tra coverage encourages us to pick docu- and Western points of view. suing an answer, we found maximally ments that cover new topics instead. Coherence. A first requirement is coherent lines for the query “Bill Clin- We next introduce weights for each that each metro line tells a coherent ton.” The results were discouraging. element, indicating the element’s im- story; following the articles along a While the lines were indeed coherent, portance. The weights bias the map line should give the user a clear under- they were not important. Many lines toward covering important elements standing of the evolution of a story. revolved around narrow topics (such while also offering a natural mecha- Consider a chain of clusters, where as Clinton visiting Belfast). Moreover, nism for personalization. In Shahaf a cluster is a set of documents. For as there was no notion of diversity, et al.,26 we discussed learning weights the sake of the presentation, we fo- multiple lines included redundant from user feedback, resulting in a per- cus on singletons, with each cluster information. This example suggests sonalized notion of coverage. a single document. In order to define selecting the most coherent lines does Connectivity. Finally, a map is more coherence, a natural first step is to not guarantee a good map. Instead, than a set of lines, with information measure similarity between each two the key challenge is balancing coher- in its structure as well. Our final prop- consecutive articles along the chain. ence and coverage; in addition to be- erty is thus connectivity. A map should As a single bad transition can destroy ing coherent, lines must also cover di- convey the underlying structure of the the coherence of an entire chain, we verse topics important to the user. story and how different aspects of the measure the strength of the chain by We define a set of elements the map story interact. the strength of its weakest link. can cover. The elements can depend on Intuitively, different stories have However, this simple approach the domain; in the case of news articles, different structures. Some stories are can produce poor chains. Consider, we select words (such as “Obama” and almost linear, while others are much for example, chains A and B. Both “China”),26 so a high-coverage map dis- more complex. In order to capture the have the same endpoints, yet Chain cusses many important words. In the structure of a story, we compute the A is significantly less coherent. Note case of a scientific corpus, we select pa- minimum number of lines that cover the transitions of Chain A are all rea- pers25 so a high-coverage map touches a all metro stops. This objective prefers sonable when examined out of con- large chunk of the corpus. long storylines whenever possible; lin- text; the first two articles are about We calculate a coverage function, ear stories become linear maps, while debt default, the second and third measuring how well each document complex stories maintain their inter- are about Republicans, and so on. covers each element. We extend it to a weaving threads. Despite these local connections, the set function, measuring how well a set of Tying it all together. We now for- overall effect is incoherent. documents covers each element. In or- mulate the problem of finding a good Now take a closer look at the two der to encourage diversity, this function metro map, given a set of documents. chains. Figure 2 shows word appear- is submodular; if the map covers an ele- We need to consider trade-offs among ance along both chains; for example, ment well already, adding another docu- the properties discussed earlier: “clus- the word “Greece” appears through- ment covering that element well thus ter quality,” “line coherence,” “map out Chain B. It is easy to spot the as- provides little extra coverage. Lack of ex- structure,” and “coverage under bud- sociative flow of Chain A. Words ap- pear for short stretches; some words Figure 2. Word patterns in Chain A (left) and B (right); bars correspond to word appearance in the articles listed above. appear, then disappear and reappear. Contrast this with Chain B, where stretches are longer and transitions are smoother. This observation moti- • Europe weighs possibility of debt default • Europe weighs possibility of debt default in in Greece Greece vates our definition of coherence. • Why Republicans don’t fear a debt default • Europe commits to action on Greek debt We transform the problem into • Italy; The Pope is leaning toward • European Union moves toward a bailout Republican ideas of Greece a linear programming optimization problem, where the goal is to choose • Italian-American groups • Greece set to release austerity plan protest “Sopranos” • Greek workers protest austerity plan a small set of words and score the • Greek workers protest austerity plan chain based solely on these words. To ensure the strength of each tran- Chain A (Incoherent) Chain B (Coherent) sition, the score of a chain (given a set of active words) is the score of the weakest link; see Shahaf and Guestrin24 for details. Greece Greece The score of a single link might de- Europe Europe pend on the domain. In Shahaf et al.26 Republican Debt we showed how to compute a score, Italy Austerity given article content alone. In Shahaf Protest Protest et al.,25 we showed how to take advan- tage of links among articles. Coverage. Coherence is crucial for
NOVEMBER 2015 | VOL. 58 | NO. 11 | COMMUNICATIONS OF THE ACM 65 contributed articles
get”; for example, maximizing cover- Problem 1 (Metro maps: Informal) Algorithm age leads to a disconnected map, as A map must satisfy We now briefly review the main ideas be- there is no reason to reuse a cluster for High coverage (o1) hind the algorithm, which starts by com- more than one line. Maximizing coher- High structure quality (o2) puting a set of documents from a query. ence often results in repetitive, narrow- Subject to We then segment the articles into time scope chains. It is thus better to treat Minimal level of line coherence (c1) windows and compute good clusters for coherence as a constraint; a chain is Minimal cluster quality (c2) each window (constraint c2 in Problem either coherent enough to be included Maximal map size (c3) 1) using a community-detection algo- in the map, or it is not. Coverage and rithm on word co-occurrence graphs.27 structure, on the other hand, should See Shahaf et al.27 for a formal state- These clusters serve as metro stops. both be optimized. We define the map ment of the algorithm and optimiza- Once we have clusters, we can pro- objective like this: tion. ceed to computing coherent lines (con-
Figure 3. A metro map for the query “Boston” in May 2013.
Two lines discuss the aftermath of the Boston Marathon bombings, with one line focusing on the suspect, the other on community events; the other two lines are about Boston major league sports—hockey and baseball.
Boston Cemeteries Dzhokhar Boston marathon, refuse to Indy 500 Tsarnaev’s Marathon: marathon bury bomb fans face confession Man Shot bombings suspects Cemetery long lines Boston note Last of by FBI had right explos Response marathon Tamerl to bury to bombing Police victims Boston Boston to be Photos: finish line, Chief: We leave Mass Local music Marathon reviewed Runners Boston Are Not acts set runners to finish the marathon Barbarians get another for Boston last mile benefit
Fan’s For Bruins’ Red Sox’ Here’s ‘Toronto An added Boston Bruins, challenge to Rangers, back lie Andrew picking series, Florida, Stronger’ a Chance boosted close out Indians Bailey Penguins in NHL playoffs sign angers to Steal a by rookie hammer faces seven Red Sox
Lee Arcia HR, Misplayed This 0-2 Lackey baffles wires, top, Baseball, popup costs hole seems Red Sox error boost Red Sox larger for Red Sox, recap, page Twins in Phillies
May 08 May 17 May 31
Figure 4. Overview of the algorithm. We compute clusters, encode coherent lines in a graph, and use the graph to compute the structure of the story. We then pick K lines from the structure that maximize coverage.