Arxiv:1803.02592V4 [Cs.SI] 23 Jun 2018
Total Page:16
File Type:pdf, Size:1020Kb
Foundations of Temporal Text Networks Davide Vega and Matteo Magnani InfoLab Department of Information Technology Uppsala University, Sweden fdavide.vega, [email protected] June 26, 2018 Abstract cused conversations among small groups of individu- als to broad political discussions involving heteroge- Three fundamental elements to understand human infor- neous audiences from large geographical areas [1, 2]. mation networks are the individuals (actors) in the net- work, the information they exchange, that is often observ- This information is undoubtedly very valuable, as able online as text content (emails, social media posts, shown for example by the large revenues of big In- etc.), and the time when these exchanges happen. An ex- ternet companies and by its usage during political tremely large amount of research has addressed some of campaigns, but it is also very complex because of these aspects either in isolation or as combinations of two its joint textual, structural and temporal nature. of them. There are also more and more works studying To cope with this complexity, researchers have typi- systems where all three elements are present, but typi- cally using ad hoc models and algorithms that cannot be cally focused on either the topology of the network, easily transfered to other contexts. To address this het- as commonly done in Network Science, or the text erogeneity, in this article we present a simple, expressive exchanged among individuals, using methods from and extensible model for temporal text networks, that we Computational Linguistics. In some cases time has claim can be used as a common ground across different also been taken into consideration as in, respectively, types of networks and analysis tasks, and we show how the fields of Temporal Networks and Temporal Infor- simple procedures to produce views of the model allow mation Retrieval. the direct application of analysis methods already devel- oped in other domains, from traditional data mining to However, despite this broad interest in human in- multilayer network mining. formation networks, only a limited number of works arXiv:1803.02592v4 [cs.SI] 23 Jun 2018 have been developed to address text, network topol- ogy and time in an integrated way and using a com- 1 Introduction mon data model. In our opinion, this is partly a result of the over-specialization of today's academia, A large amount of human-generated information is and the fragmented and discipline-specific develop- available online in the form of text exchanged be- ment of network research. Unfortunately, omitting tween individuals at specific times. Examples include any of the three basic elements of temporal text net- social network sites, online forums and emails. The works may lead to significant information loss and public accessibility of several of these sources allows prevent a deeper understanding of the information us to observe our society at various scales, from fo- system, as exemplified in the next section. 1 1.1 A motivating example be composed to easily construct new algorithms for temporal text networks. One typical usage of social media data in research Our claim is that such a model can play a similar is to study how information propagates online. In role of other recent attempts to unify related areas of one of the many studies on this topic, the authors network science, such as multilayer networks, which have analyzed different aspects of the propagation have boosted research in already existing fields (e.g., process considering the online reactions generated by multiplex network analysis) by showing that results the death of a well-known Italian TV anchorman [3]. in one area could be directly applied to other types of In Figure 1 we have reproduced (a) the information data now expressed using a uniform terminology and propagation network, showing which posts contained mathematical form. Our objective is to define an information obtained by which others, (b) the text essential model, with a minimal number of fea- of some of the posts generated about this event, and tures, so that several existing models can be (c) a temporal pattern indicating the number of com- unified into it without a significant increase in ments per day. model complexity. We also believe that a unified While each of these pieces of information alone re- model will promote the development of software li- veals something, putting them together into a tempo- braries providing different data analysis functions for ral text network (Figure 1d) we obtain a much more temporal text networks inside a single system, from comprehensive understanding of the process. On the centrality measures to community detection and gen- one hand, we can see that for the posts represent- erative models. ing explicit attempts to propagate information (e.g., The article is organized as follows. In the next Mike passed away) publication time is fundamental section we present an overview of related work, high- to determine their success, and only the first of this lighting how a large amount of research has been pro- type of posts generated a large and sudden burst of duced to analyze human information networks. As reactions in a very short time; on the other hand, the main objective of this article is to introduce a conversational posts evolving from it (e.g., How has data model for temporal text networks, our overview television changed?) can appear later and still create of the state of the art focuses on the data models al- long but less dense chains of reactions. Other posts ready introduced in the literature, to allow a precise not present in the information propagation network comparison with our model. In Section 3 we define neither explicitly give the news nor ask for an answer, our model as a simple attributed bipartite network. generating no or few reactions, but still have the role We also show how this simple model can be used to of re-activating the information cascade so that even represent many existing types of text-based interac- the latecomers can find a trace of it; some of these tions, such as direct messages, multicast and broad- posts (e.g., Bye granpa Mike!, or R.I.P.) form what cast. In addition, we show how to express different has been called an online mourning ritual. types of information networks using our model, and In summary, time, text and topology together can how to extend it with additional features. Finally, we lead to a deeper understanding of how this informa- provide a detailed comparison of our model with the tion network evolved into its current structure and ones presented in the state of the art, showing how how information propagated through it. some existing models can be expressed using ours, while others can be obtained by applying some lossy 1.2 Contribution and outline processing to ours, e.g., replacing the exchanged text with a bag of words, a set of topics, a sentiment, etc. In this work we introduce a simple but expressive and Section 4 explains how the model can be used in data easily extensible model for temporal text networks, analysis. We show how the direct manipulation of the and define two main approaches to analyze this type model can be complemented by two additional types of data. We also show how existing primitive data of analysis: continuous and discrete. In the contin- manipulation operations for multilayer networks can uous case, time and text are treated as points in a 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Mike passed away! ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Bye grandpa Mike ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● R.I.P. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● How has television changed? ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● R.I.P. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ... ● ● ● ● ● ● ● ● ● (a) Topology (c) Time (b) Text (d) Topology, text and time Figure 1: Three elements of an online human information network: a) The topology, where each edge represents an observed information propagation path: user A writes a post about some news, user B reads the post and writes herself about it, for example by commenting on it; b) the text exchanged between users, that is, the text of posts and comments; c) the number of comments over time; d) Topology, time and text combined into a temporal text network. Only details about two posts are shown. 3 metric space, and analysis operations are based on of works using some of the models is very large, we the computation of similarities between these points. have sometimes arbitrarily and unavoidably chosen a In the discrete case, discretization operations (such key set of references based on our knowledge and per- as time slicing and topic modeling) are applied, en- sonal selection. Therefore, please notice that in the coding text and time into multiple discrete layers and table we only indicate selected representative refer- enabling the direct application of the large number of ences; additional references are included in the text. methods already available for multilayer networks. In Figure 2 complements Table 1 providing a visual in- Section 5 we present a practical example of our model tuition of the reviewed models and of the new models and analysis strategies applied to Twitter data. introduced in this