A Probabilistic Model of Semantics in Social Information Foraging

Peter Pirolli

Palo Alto Research Center, Inc. 3333 Coyote Hill Road Palo Alto, CA 94116 [email protected]

Abstract structure of the environment, especially the linguistic Probabilistic theories of semantic category formation in environment. cognitive psychology can be used to model how semantic The Topic Model based on Latent Dirichlet Allocation sense may be built up in the aggregate by a network of (LDA). Griffiths and Steyvers [9] have presented a cooperating users. A model based on Latent Dirichlet probabilistic model of latent category topics that is similar Allocation is applied to word activity data in the Lostpedia in spirit to a semantic learning model of individual to reveal the structure of topics being written about, the information foraging called InfoCLASS [15]. In contrast to trends in attention devoted to topics, and relating the model the incremental learning process of InfoCLASS, the Topic inferences to events known to have occurred in the real Model [9] is an approach based on a generative probability world. Relevance of the models to progress in exploratory model. search and social information foraging is discussed. . The Topic Model has been mainly directed at modeling the gist of words and the latent topics that occur in collections of documents. It seems especially well suited to modeling INTRODUCTION the communal production of content, such as scientific literatures [8], or , as will be illustrated below. Over the past decade, the Web and the Internet have The Topic Model assumes there is an inferred latent evolved to become much more social, including being structure, L, that represents the gist of a set of words, g, as much more about socially mediated information foraging a probability distribution over T topics. Each topic is a and sensemaking. This is evident in the growth of blogs distribution over words. A document is an example of a set [10], wikis [20], and social tagging systems [16]. It is also of words. The generative process is one in which a evident in the growth of academic interest in socially document (a set of words) is assumed to be generated as a mediated intelligence and the cooperative production of statistical process that selects the gist as a distribution of knowledge [2, 18, 19]. It has been argued [13] that models topics contained in the document, and then selects words in cognitive psychology can lead to greater understanding from this topic distribution and the distribution of words and better designs for human-information interaction within topics. This can be specified as systems. With that motivation, this paper presents extensions of Information Foraging Theory [14] aimed at T modeling the acquisition of semantic knowledge about P(wi | g) = P(wi | zi)P(zi | g) (1) some topic area at the social level. The model is an attempt  zi =1 to start to address the question: how do the semantic representations of a community engaged in social where g is the gist distribution for a document, w is the ith information foraging and sensemaking evolve over time? i word in the document, selected conditionally on the topic zi selected for the ith word conditional on the gist distribution. In essence, P(z|g) reflects the prevalence of topics within a Probabilistic Semantic Representations document and P(w|z) reflects the importance of words within a topic. Probabilistic representations—particularly those based on Bayesian approaches—of semantics have arisen as the This generative model is an instance of a class of three- result of rational analyses of cognition [1, 17]. Rational level hierarchical Bayesian models known as Latent analysis is a framework in which it is heuristically assumed Dirichlet Allocation [4] or LDA. The probabilities of an that human cognition approximates an optimal solution to LDA model can be estimated with the Gibbs sampling the problems posed by the environment. Topic category technique in Griffiths and Steyvers [8]. judgments are viewed as prediction problems, and these can be guided by statistical inferences made from the

Figure 1. A spatial representation for the Topic Model

Mapping Probabilistic Semantics onto Semantic Spaces Another pervasive way of representing semantics is to use spatial representations. Latent Semantic Analysis [LSA, 12] is one approach to semantic space representations that Figure 2. Daily page creation and revision rates in Lostpedia. has had a long history of influence in human-computer interaction [3, 5-7]. LSA models words in a high- dimensional semantic space where the dimensions can be Lostpedia interpreted as latent topics. Figure 1 presents a similar dimensionality-reduction Lostpedia was launched on September 22, 2005 at the formulation of the Topic Model. Topics are represented as beginning of Season 2 of American Broadcasting probability distributions over words, and document gist is Corporation program . As of August 14, 2007 represented as probability distributions over topics. Lostpedia had 3,250 articles, about 19,000 registered users, Theoretical evaluations that argue for the superiority of and over 92 million page views. Parts of Lostpedia are in probabilistic approaches to LSA can be found in Griffiths German, Dutch, Portuguese, Polish, Spanish, French, and and Steyvers [9]. Italian, in addition to English. Lostpedia has been cited as "the best example of how an online community can complement a series” [11]. Semantic Topics in Peer-produced Content One reason why Lostpedia is an interesting complement to the series is the nature of the program. Lost is a dramatic How do the semantic representations of a community mystery with elements of science fiction. The series engaged in social information foraging and sensemaking follows the lives of survivors of a plane crash on a evolve over time? One approach to this is to apply the mysterious tropical island. A typical episode involves Topic Model to content produced by socially mediated events in real time on the island, as well as flashbacks means. Wiki’s are one example of systems built to support about events in the lives of characters. The fictional the communal production of content. According to universe presented in the Lost episodes is complemented by book tie-ins, games, and an alternate reality game called A wiki is a collaborative website which can be the Lost Experience. directly edited by anyone with access to is. Ward Crucial to the series is that it is purposely cryptic, Cunningham, developer of the first wiki, presenting numerous unresolved mysteries and clues to WikiWikiWeb, originally described it as “the simplest their answer. This inspires a great deal of analysis and online database that could possible work”. One of the sharing of ideas and theories about the resolution of these best-known wikis is Wikipedia mysteries. For instance, mysteries involve numerology (the In the current modeling effort, there were two simple aims: recurring numbers 4-8-15-16-23-42), “crosses” in which (a) understanding the semantic topics that underlie a characters unknowingly cross paths in their past with community wiki, and (b) understanding the changes in characters who are now involved in events on the island, production of topic knowledge over time in reaction to and glyphs and maps which occur in many scenes. Fan events in the world. The Lostpedia wiki theories are so central to Lostpedia that contributors must (www.lostpedia.com) turned out to be useful for this follow a theory policy. The policy discriminates between purpose. Lostpedia is a wiki devoted to users interested in canonical and non-canonical sources. Discussion pages the television show Lost, and because of the nature of the associated with articles are often full of uploaded frame show and its multimedia channels, it is a wiki devoted to grabs from the videos of episodes so that fans can do collecting intelligence from a variety of sources, and detailed analysis. making sense of that information. Lostpedia is interesting because it is very much like an open source intelligence gathering operation in which information is collected and analyzed in order to formulate

Figure 3. The Lostpedia topic space. A multidimensional scaling visualization of T = 20 topics in Lostpedia. Distances reflect inter- topic differences, to the extent possible in a 2D space. Topics are represented by the 10 most probable words associated with the topic. Topics in the foreground are more probable in Lostpedia.

and evaluate theories. The latent topics in this communal database was analyzed to produce a word activity table. system surely change over time as new evidence and This table was computed from all page revision text mysteries are presented in the world (mainly via the Lost recorded in the database, passed through a stemmer and program, but also via other media and information releases with stop words filtered out. Every Lostpedia page by writers and producers). We can therefore compare our identified by a page identifier in the database was analyses of the ebb and flow of topics in Lostpedia with considered a document, and for each document, and for what we know about events in the “real world” of releases every day in the database, the word activity for that page of information in episodes of Lost or other media sources. was computed. The word activity for a page on a date is represented as a frequency of occurrence of words in revisions for that page on that date. The word activity table The Lostpedia Database contains D = 18,238 pages and W = 34,352 distinct words.

A Topic Model was estimated from a Lostpedia database containing page editing information. The database covered the period from September 21, 2005 to December 13, Lostpedia Topic Space 2006. That period covers Season 2 and part of Season 3 of The first goal was to represent a set of topics to encompass Lost. the span of time covering the whole available database. A Over that period, 18,238 pages were created and 160,204 word-document co-occurrence matrix was constructed by page revisions were performed. The mean number of pooling all the word activity over all days. The resulting revisions per page was 8.78 revisions/page and the median word-document co-occurrence matrix of 18,238 pages and was 2 revision/page. 34,352 words contained 809,444 entries. Figure 2 shows the rate of page creations and revisions LDA was performed using Gibbs sampling (see the over the time period covered in the database. Each point in Appendix) to produce T = 20 topics. Exploration of other the chart is a frequency count over a day. There is a large values of T (both larger and smaller) were generally bust of activity at the beginning of the period. Since consistent with the T = 20 model, but this number of topics Lostpedia was created at the beginning of Season 2 of Lost, is about the right size to discuss in a paper. The sampling there was a backlog of material to be written up. There is a algorithm was run for 500 iterations with parameters  = large burst of activity around the Season 2 finale on May T/50 and  = 200/W (see Appendix). 24, 2005. There is another burst of activity following In the results of this estimation, each topic is represented as November 6, 2006, when Lost went on a mid-season hiatus an estimated probability distribution over words. One way until the following February (outside the range of our to understand each topic is to examine the most probable database). words associated with each topic. In Figure 3, the Lostpedia topics are presented as boxes containing the 10 most probable words associated with each topic. Since the A Topic Model for Lostpedia topics are represented as probabilities, the inter-topic similarities (or dissimilarities) can be defined using a KL The Topic Model technique used to model topics in a divergence measure (see Appendix) and submitted to a scientific literature [8] was used as a guide for a model of multidimensional scaling (MDS) analysis that can be used the Lostpedia database. To develop this model, the to plot the topics as points in a lower-dimensional space

Figure 4. Stable and bursty interest in Lostpedia topics over Figure 5. Appearance of a new character topic (Topic 1) time. Topics from Figure 3. causes a burst of content production. Dharma Initiative and involved in the lives of several Lost such that the spatial distance among points reflects (as best characters. possible) the dissimilarities among topics. Figure 3 is a plot An interesting aspect of the MDS visualization is that of the 2-D MDS solution for the inter-topic similarities. theories, which in the Lost community are a driving force, The plot provides a way of visualizing a multidimensional are central in the Topic Model semantic space, with topics semantic space interpretation of the Topic Model. associated with “facts” presented in the Lost program on one side of the theory space, and topics associated with Interpretation of the Lostpedia Topic Space “facts” from the Lost tie-ins on the other side. Several regions in Figure 3 can be given a semantic interpretation as indicated by annotations to the figure. On the lower right, there is a set of topics that capture Lostpedia Topic Trends characters and relations among characters. As noted above One way to assess the value of the Topic Model is to see one driving force in Lost is the characters, relations, what kinds of trends it detects in attention to topics over crossings among characters, and flashbacks. For instance, time, and to see how the model assessments relate to what in the center of the Characters cluster in Figure 3 is Topic we know about releases of information in the world of 11, whose most probable words refer to the characters Lost. If there seems to be a good correspondence between Locke, Jack, Charlie, Michael, Walt, Kate, Sawyer, and what the model infers, and what we know about the Boone, who are main characters, a subset of whom are progression of events in Lost, and assuming that people referred to as the “A team”, with Locke and Jack as write about what is happening in Lost, we might gain sometimes antagonists, a Jack-Kate-Sawyer love triangle, confidence in applying this technique to other social and a Michael-Walt father-son reconciliation storyline. information systems to understand the topics that people In the center of Figure 3, are theories about the mysteries are writing about. presented by Lost. For instance, Topic 18 is represented by To analyze Lostpedia topic trends, the Lostpedia word words that refer to a fictional scientific group and activity database was queried to produce a set of words and experiment conducted in the past on the Lost island called word frequencies for each week in the database, with each the “Dharma Initiative”, that has left behind a number of week starting on a Wednesday and ending on a Tuesday, hidden stations (“Swan” and “Pearl” being the names of since episodes of Lost air on Wednesday nights. Each word two stations discovered by mid-Season 3), each of which is occurrence in a weekly word profile was assigned to topics associated with a logo, and sometimes containing according to the probability with which it would be drawn orientation videos that provide clues about the workings of from that topic, and for each topic the mass of word the Dharma Initiative and the island. activity assigned to it was summed up. The strength of a At the lower left of Figure 3 are topics that appear to be topic was determined as the proportion of weekly word mainly associated with media tie-ins that provide activity assigned to that topic compared to all topics. information that is basically independent of the show (and Figure 4 presents a plot of topic strengths for a relatively especially independent of the characters) but still stable topic over time against two bursty topics. Topic 11, contribute evidence to the theories. For instance, Topic 12 as discussed above is associated with the main characters is strongly associated with words representing “Rachel and relations. It remains very stable over the 64 week Blake” a fictional reporter investigating the Hanso period. The topic with the biggest burst of interest among Corporation as part of the alternate reality game called the all topics is Topic 7. Topic 7 is highly associated with the Lost Experience. Although neither Rachel Blake nor the words representing the characters Desmond, Kelvin, Lost Experience have much to do with the Lost show, the Locke, Penelope, Jack, and a boat and it is also most highly Hanso Corporation is a mysterious organization behind the associated with Lostpedia Web pages for the Season 2 final episode called “Live together or die alone“ and “Desmond Howard“, about whom it is revealed is caught in time loop. where each word wi belongs to some document di. One Penelope is his girlfriend, Kelvin is someone who deserted way to represent this is as a Word-Document co- him, and Desmond was sailing a “boat“ around the world occurrence matrix containing frequencies.. An observed to impress the father of his girlfriend, and Desmond has a Word-Document matrix is viewed as the result of great deal of interaction with Locke in the Season 2 finale. stochastic draws using the probabilities depicted in Figure The other bursty topic is Topic 9, which is highly 1 associated with two Lostpedia pages that apparently were P(w|z) is represented as a set  of T multinomial discovered to contain information created by hoaxers and distributions over the W words such that received substantial revision. ( j) P(wi | zi = j) = w (A.1) Figure 5 presents another bursty topic, Topic 1 is plotted where zi is the topic associated with the word wi in against the fairly stable character Topic 11. In this case th document di and the j topic is represented as a Topic 1 is primarily associated with the character Eko, who (j) multinomial distribution with parameters  . is introduced in a Lost episode that occurs in Week 17 of P(z) is represented as a set  of D multinomial distributions the database and remains a central character for the over the T topics, such that for each document there is a remainder of the database period. multinomial distribution with parameter (di), such that for The Topic Model has been applied to the analysis of every word w in document d the topic z of the word is scientific literatures and used to make sense of trends in the i i i d i (A.2) scientific literature. It appears that the Topic Model can be P(zi = j) =  j used to analyze how a socially mediated sense making site The Dirichlet distribution is the Bayesian conjugate for the can be similarly analyzed to reveal the waxing and waning multinomial distribution and provides Dirichlet priors on (j) (di). of interest in semantic topics in reaction to events in the the parameters  .and  (j) world. Blei et al. [4] give an algorithm for finding estimates of  . and hyperparameters for (di). The approach used in this paper comes from Griffiths and Steyvers [8] that employs General Discussion Gibbs sampling and the use of single-valued hyperparameters  and  to specify the nature of the The Topic Model was applied to the Lostpedia wiki and Dirichlet priors (j) and (di). found to reveal a sensible set of semantic topics and The Topic Model provides estimates of T multinomial tracked interest in those topics in reaction to events in Lost. distribution parameters, one for each topic such that the The Lostpedia wiki was a useful social sense making probability of any word w conditional on a topic j is system to study because releases of new information in the defined as world (from “canonical” sources including the program P(w | z = j) = ˆ ( j) (A.3) itself, and sanctioned media tie-ins and press) is easily w determined, and the effects of these releases on the social The Topic Model also estimates a set of document sensemaking system can be investigated. probability distributions for over topics for each of the There are a variety of possible practical uses for models of document d. ˆ (d ) the evolution of semantic topics in a social sensemaking P(z = j) =  j (A.4) system. One possible use is to provide a framework for measuring the impact of perturbations in systems on the structure and dynamics of topics, whether through new Acknowledgments interface techniques, or policies, or community structure. Similarly such models could provide a framework for Portions of this paper have been submitted to CHI 2008. understanding how the semantic structure and dynamics This research was supported by a contract from the have consequences for the quality of decision making , Disruptive Technology Office. I thank Kevin Croy for problem solving, and action. providing access to the Lostpedia data, and to Bryan Pendleton and Bongwon Suh for computing the basic word activity data. Thanks also to Mark Steyvers and Thomas Griffiths for the public availability of their Matlab topic Appendix toolbox code. The Topic Model applied to the Lostpedia data used the Matlab Topic Modeling Toolbox 1.3.1 available at References http://psiexp.ss.uci.edu/research/programs_data/toolbox.ht m. 1. Anderson, J.R. The adaptive character of thought. Assume there are D documents, expressing T topics, and Lawrence Erlbaum Associates, Hillsdale, NJ, 1990. containing W words. The dataset can be represented as 2. Benkler, Y. The wealth of networks: How social words production transforms markets and freedom. Yale w = {w ,w ,...,w } University Press, New Haven, CT, 2005. 1 2 n 3. Blackmon, M.H., Kitajima, M. and Polson, P.G. Web interactions: Tool for accurately predicting Website navigation problems, non-problems, problem severity, and effectiveness of repairs. CHI 2005, ACM Conference on Human Factors in Computing Systems, CHI Letters, 7 1(2005). 31-40. 4. Blei, D.M., Ng, A.Y. and Jordan, M.I. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(2003). 993-1022. 5. Dumais, S.T. Data-driven approaches to information access. Cognitive Science, 27(2003). 491-524. 6. Dumais, S.T., Furnas, G.W., Landauer, T.K., Deerwester, S. and Harshman, R., Using latent semantic analysis to improve access to textual information. in Conference on Human Factors in Computing Systems, CHI '88, (Washington, D.C., 1988), ACM Press, 281-285. 7. Furnas, G.W., Landauer, T.K., Gomez, L.W. and Dumais, S.T. The vocabulary problem in human-system communication. Communcations of the ACM, 30(1987). 964-971. 8. Griffiths, T.L. and Steyvers, M. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(2004). 5228-5235. 9. Griffiths, T.L., Steyvers, M. and Tenenbaum, J.B. Topics in semantic representation. Psychological Review, 114 2(2007). 211-244. 10. Jensen, M. A brief history of Weblogs Columbia Journalism Review, 2003. 11. Kushner, D. The Webs best, most obsessive sites for television's most additictive shows. Rolling Stone(2007). 34. 12. Landauer, T.K. and Dumais, S.T. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(1997). 211-240. 13. Pirolli, P. Cognitive models of human-information interaction. in Durso, F.T. ed. Handbook of applied cognition (2nd ed.), John Wiley & Sons, West Sussex, England, 2007. 14. Pirolli, P. Information foraging: A theory of adaptive interaction with information. Oxford University Press, New York, 2007. 15. Pirolli, P. The InfoCLASS model: Conceptual richness and inter-person conceptual consensus about information collections. Cognitive Studies: Bulletin of the Japanese Cognitive Science Society, 11 3(2004). 197-213. 16. Rainie, L. 28% of online Americans have used the Internet to tag content, Pew Internet & American Life Project, 2007. 17. Steyvers, M., Griffiths, T.L. and Dennis, S. Probabilistic inference in human semantic memory TRENDS in Cognitive Science, 2006. 18. Sunstein, C.S. Infotopia. Oxford University Press, New York, 2006. 19. Surowiecki, J. The wisdom of the crowds. Random House, New York, 2004. 20. Voss, J., Measuring Wikipedia. in International Society for Scientometrics and Informetrics, ISSI 2005, (Stockholm, 2005), ISSI.