Using Tweets As Bootstrap with Application to Autism-Related Topic Content Analysis

, and , amongst contain a URL link to this web page. As per our previous others. This processes allows us to extract relevant content observations, this means that a tweet with multiple URLs reliably without any manual input, with few errors. We found ‘inherits’ topics from multiple web pages, and equally, a web that rare errors do not affect the ultimate result of our method page lends its topics to all tweets which link to it. due to their randomness – there is no reason why any particular erroneous term (word) would consistently be included in the

0.00014 extracted text, which means that any speciﬁc error ends up being treated as noise in our bag-of-words model used as input 0.00012 to the main topic modelling and tracking algorithm described 0.00010 next.

0.00008 C. Modelling topic evolution over time 0.00006 Twitter is by its very nature highly dynamic: new content is 0.00004

Number of days (normalized) rapidly created, transient periods of high popularity of speciﬁc 0.00002 ideas are followed by their rapid decline and the emergence

0.00000 0 5000 10000 15000 20000 25000 30000 of follow-up etc. Therefore it is imperative for the analysis Number of tweets per day of the topic structure of a tweet corpus to have a temporal (a) Daily tweeting rate component, which additionally has to be sufﬁciently robust and ﬂexible to capture the range of complex interactions and 0.00035 causal cascades that the corresponding exchanges exhibit. 0.00030 Our approach begins by discretizing time into epochs 0.00025 thereby dividing the tweet corpus into a series of chronological

0.00020 but overlapping sub-corpora. The topic structure of each sub- corpus associated with a particular epoch is then extracted 0.00015 separately using an HDP-based model. In other words each 0.00010

Number of days (normalized) epoch is considered to span a sufﬁciently short time period 0.00005 that the associated sub-corpus of tweets can be treated as

0.00000 a static collection of documents. To ensure the validity of 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 Number of URLs per day this assumption it is crucial that the time span of an epoch (b) Daily in-tweet URL rate is chosen based on the speed of changes which the model is supposed to track (to reflect the highly dynamic nature of Fig. 3. The statistics of relevant data availability rate, estimated from our Twitter in this work we used three day epochs, with an overlap data set (please see Sec. III for additional detail). Shown are histograms corresponding to (a) the daily number of tweets identified as ASD-related [11], of successive epochs of two days). To associate topics across and (b) the daily count of URLs harvested from these tweets. different epochs we create a similarity graph, connecting topics (as nodes) across successive epochs, and infer complex We retrieve tweets using Twitter’s stream and search APIs dynamics based on the strengths of these connections. This (please see Sec. III for further detail). This allows us to fetch approach, first proposed in [10], is explained next. 1) Measuring inter-topic similarity: The key idea behind our approach stems from the observation that while topics may change significantly over time, by their very nature the change between successive epochs is limited. Therefore we infer the continuity of a topic in one epoch by relating it to all topics in the immediately subsequent epoch which are sufficiently similar to it under some similarity measure. This can be seen to lead naturally to a similarity graph representation whose nodes correspond to topics and whose edges link those topics in two epochs which are related. Formally, the weight of the directed edge that links φ , the j-th topic in epoch t, and φ is set Fig. 4. A conceptual illustration of a similarity graph that models topic j,t k,t+1 dynamics over time. A node corresponds to a topic in a specific epoch; the equal to ρ (φj,t, φk,t+1) where ρ is an appropriate similarity weight of an edge connecting two nodes is equal to the similarity of the measure. In this work we adopt the use of the Jaccard corresponding topics. similarity between the probability distributions described by φj,t and φk,t+1 i.e. ρ (φj,t, φk,t+1) ≡ dJaccard (φj,t, φk,t+1). A conceptual illustration of a similarity graph is shown in of its draws, the DP is a highly useful prior for Bayesian Fig.4. It shows three consecutive time epochs t − 1, t, and mixture models. By associating different mixture components t + 1 and a selection of topics in these epochs. Graph edge with atoms φk of the so-called stick-breaking process [31], weight i.e. inter-topic similarity is encoded by varying the iid and assuming xi|φk ∼ F (xi|φk) where F (.) is the likelihood thickness of the corresponding line connecting two nodes – a kernel of the mixing components, the Dirichlet process mixture thicker line signifies more similar topics. We use a threshold model (DPM) can be formulated. The DPM is suitable for to eliminate weak edges automatically, retaining only the nonparametric clustering of exchangeable data in a single edges which correspond to sufficiently similar topics in group e.g. words in a document where the DPM models the adjacent epochs. It can be seen that this readily allows us to underlying structure of the document with potentially an infi- detect the disappearance of a particular topic, the emergence nite number of topics. However, many real-world problems are of new topics, as well as the splitting or merging of different more appropriately modelled as comprising multiple groups of topics. In summary: exchangeable data (e.g. a collection of documents). In such Emergence If a node does not have any edges incident cases it is usually desirable to model the observations of to it, the corresponding topic is taken as having emerged different groups jointly, allowing them to share their generative in the associated epoch (e.g. φj+2 at time t in Fig.4). clusters. This idea is known as the sharing statistical strength and is achieved using a hierarchical structure. Disappearance If no edges originate from a node, the Amongst different ways of linking group-level DPMs, corresponding topic is taken to vanish in the associated HDP [48] offers an interesting solution whereby base measures epoch (e.g. φj at time t in Fig.4). of document-level DPs are drawn from another DP. In this way the atoms of the corpus-level DP (i.e. topics in our case) are Splitting If more than a single edge originates from a shared across the corpus. Formally, if x = {x1,..., xJ } is a node, the corresponding topic is understood as being document collection where xj = xj1, . . . , xjNj is the j-th split into multiple topics in the next epoch (e.g. φi is document comprising Nj words, each document is modelled iid split into φj and φj+1 in Fig.4). with a DPM Gj|α0,G0 ∼ DP (α0,G0) where its DP prior is further endowed by another DP G0|γ, H ∼ DP (γ, H). Merging If more than a single edge is incident to a node, Since the base measure of Gj is drawn from G0, it takes the the topics of the nodes from which the edges originate same support as G0. Also the parameters of the group-level are understood as having merged together to form a new mixture components, θji, share their values with the corpus- topic (e.g. φi and φi+1 merge to form φj+1 in Fig.4). level DP support on {φ1, φ2,...}. The posterior for θji follows 2) Hierarchical Dirichlet process mixture models: The a Chinese restaurant franchise process which can be used to Dirichlet process is a widely used prior for mixture modelling develop inference algorithms based on Gibbs sampling [49] which readily allows a document collection to accommodate which we adopt in the present work. an arbitrary number of topics. It is the building block of many Bayesian nonparametric methods [21]. Succinctly put, a Dirichlet process DP (γ, H) is defined as a distribution of a III.EVALUATION AND RESULTS random probability measure G over a measure space (Θ, B, µ), such that for any finite measurable partition (A1,A2,...,Ar) In this section we turn our attention to the empirical T of Θ the random vector [G (A1) ,...,G (Ar)] is a Dirichlet evaluation of the proposed framework. We first succinctly T distribution with parameters [γH (A1) , . . . , γH (Ar)] . describe our evaluation data set, and then report and discuss Owing to the discrete nature and infinite dimensionality the obtained results. A. Evaluation data days, with successive epochs exhibiting an overlap of two days. Furthermore, for the pruning of inter-epoch similarity Twitter’s Terms of Service explicitly prohibit the sharing graphs we used the Jaccard similarity threshold of τprune = 0.5 or redistribution of tweets, including for research purposes. which was determined empirically using a small training Consequently, as there was no public benchmark that we could corpus and qualitative judgement (see Sec.IV for an indication adopt we used a large data set collected by ourselves, acquired of how this threshold can be learnt directly from data). as described in [11]. For full detail the reader is referred to the original publication. Here we note the sole difference in B. Results and discussion the tweet collections used in the present work and that in [11]: We started our analysis of the results obtained with the their sizes. proposed method by examining the topic structures discovered Recall that we collected tweets using Twitter’s stream and with different epochs. As in all related work, given that the search APIs by periodically retrieving tweets which contain technical term ‘topic’ is understood to be a formalization and any of the four keywords “autism”, “adhd”, “asperger”, and a proxy for the more abstract notion of a ‘topic’ in everyday “aspie” (or any of their derivatives obtained by suffixation). speech, this analysis is inherently subjective in nature as it is Following the publication of our original work we continued not possible to define or indeed obtain an objective ground- this process and now have at our disposal a data set over truth. an order of magnitude greater than that initially described A representative selection of popular topics discovered in [11]. Our corpus now includes 5,650,989 ASD-related within an epoch is shown in Fig.5. This figure shows four tweets collected in the period starting on 26 August 2013 and representative topics by the proposed method in a single epoch ending on 1 Oct 2014 (i.e. more than 13 consecutive months). of our ASD tweet dataset. The topics are visualized using the 1) Algorithm parameters: The analytics framework we standard world-cloud representation whereby term frequency described in the preceding sections requires the values of a is proportional to the corresponding font size (note that differ- number of free parameters to be set. For the sake of com- ent colours are merely used for the sake of easier visualization pleteness and full reproducibility of the experiments reported of a large number of terms of different frequencies in limited in this paper, here we summarize how this was performed space; colour encodes no pertinent information about different and what values were used. Note that this process needs to terms). There are two key observations that are important be performed only once – the framework is fully automatic to make here. The first of these is that the extracted topics thereafter. In addition, in the discussion of possible directions are readily interpreted as being meaningful in the context for future work in Sec.IV we also outline the ideas we are of what is known about autism. For example, Topic 1 can currently working on which would allow the optimal values of be seen to relate to the persisting myth of a link between the free parameters to be learnt directly from data. This would childhood vaccination and autism development [23]. Related make the entire process automatic and remove the need for issues which concern the effects of thimerosal (a vaccine human input even to the minimal extent described here. preservative) and mercury which feature in this topic are also The first of the parameters, introduced in Sec. II-A, is the related to this myth [30]. Topic 2 concerns Chris Tuttle, a man size of the term dictionary used to represent documents as with Asperger’s syndrome working in Wegmans who attracted fixed-size bags of words/terms. It should be noted that in much media attention in November 2013 after being yelled at principle the entire set of terms extracted from the document by a customer. Topic 3 captures discussions of research on corpus could be used without any negative effects on the the relationship between autism and various health conditions quality of the extracted topic information, as the importance and reflects the increased willingness of the general public, of each term is inferred automatically by our HDP-based particularly those affected by rare and debilitating conditions, model. The sole reason for choosing a subset of terms as a to examine the scientific literature. Lastly Topic 4 pertains dictionary lies in the reduced computational overhead (both to the schooling of children with learning disabilities which time-wise and storage-wise). Following the usual procedure is one of the most common concerns of many parents with adopted in the literature (for detailed discussion see e.g. [12]) autistic children (Lou Salza, whose name features significantly we construct the dictionary using the most frequent terms in this topic, is the outspoken headmaster of a school for which account for approximately 90% of the text of the children with learning disabilities). The second important document corpus [18]. In our case this results in 6560 terms. It observation, one which strongly supports the main premise is insightful to observe the much greater size of this dictionary of the present work and demonstrates the usefulness of the than the 1500-term one constructed in exactly the same manner proposed framework, is that the topics feature more than a in [12] but using main tweet text only. This reflects the more couple of dominant (by frequency) terms. This means that expressive nature of content linked to from tweets compared they cannot be readily captured using a small set of keywords to the content within tweets themselves. such as hashtags or words extracted directly from tweets. The remaining free parameters concern the proposed algo- Having demonstrated that our method extracts meaningful rithm for temporal tracking of topic structure changes of the topics, we next sought to investigate what type of information document corpus. In particular, to reflect the highly dynamic was obtained by tracking changes in the topic structure over nature of Twitter we chose short epochs, each spanning three epochs. Here too we readily observed that the proposed as the corresponding rates of new topic emergence (“birth”), topic disappearance (”death”), merging, and splitting. It can be seen that our framework is effective in capturing the highly dynamic nature of Twitter information exchange as witnessed by the changes in the number of inferred topics per epoch across time, consistently observed emergence of new topics and disappearance of others, topic splitting and merging. It is interesting to observe the high rate of new topic emergence, compared to the rate of topic creation through the merging of old topics, which illustrates the ephemeral nature of most (a) Example topic 1 (b) Example topic 2 Twitter exchanges [28].

book autism autism naoki book autism naoki language book japanese naoki Japanese Japanese language author writer author intelligence

01-07 Oct 05-11 Oct 13-19 Oct (a) Example topic progression 1 (linear) (c) Example topic 3 (d) Example topic 4 student school student school Fig. 5. An illustration of typical topics discovered by the proposed child student child learn program child method in a single epoch of our ASD tweet dataset, shown as size-coded school social word-clouds – a larger font indicates a proportionally more probable term education education (different colours are merely used for easier visualization and encode no behaviour skill information pertaining to the corresponding terms themselves). Topic 1 can be program cancer readily related to the persistent myth of a link between childhood vaccination and autism development as well as related issues concerning the effects of 29 Oct – 04 Nov 02-08 Nov 06-11 Nov thimerosal and mercury [30], Topic 2 to the incident involving Chris Tuttle, a Wegmans employee with Asperger’s syndrome who attracted media attention (b) Example topic progression 2 (linear) in November 2013, Topic 3 to medical literature on autism, children, and brain development, and Topic 4 to the schooling of children with learning disabilities. child student school social method was powerful in extracting interesting and insightful behaviour skill vaccine autism knowledge. Illustrative examples are pictured in Fig.6. The study diagrams in Fig. 6(a) and 6(b) show examples of “linear” acetaminophe topic evolutions i.e. changes in the nature of conversation autism n mercury autism child pertaining to a single, continuously developing topic. For acetaminophen child increase http vaccine example, the evolution in Fig. 6(a) concerns the bestselling food study book “The Reason I Jump: The Inner Voice of a Thirteen-Year- cell level food http diet Old Boy with Autism” written by the autistic Japanese author autism child Naoki Higashida, and which amongst other things discusses vaccine autism product 09-15 Oct the highly emotive issues of language and intelligence [26]. child court On the other hand, Fig. 6(c) shows multiple topics merging parent 05-11 Oct to form new topics. For example, the topic on behavioural wakefield and social aspects of schooling can be seen to merge with the topic concerning the already mentioned myth about a 01-07 Oct link between vaccination and autism, to produce a topic (c) Example topic progression 3 (merging) which encompasses a combination of these issues. Lastly, Fig. 6. Examples of automatically discovered Twitter topic temporal changes: the ability of our framework to capture and track highly (a,b) linear evolution of topic content, and (c) topic creation through the dynamic changes in the topic structure, which is crucial in merging of topics from the previous epoch. Also see Fig.7 for related the analysis of a communication medium such as Twitter, is quantitative corroboration of our findings. corroborated quantitatively by the plot in Fig.7. The plot shows changes in the number of topics per epoch, as well IV. SUMMARY AND CONCLUSIONS [17] C. Chew and G. Eysenbach. Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak. PLOS ONE, 2010. In this paper we described the first work on the extraction of 3 the topic structure of tweets and the tracking of its evolution [18] A. Clauset, C. R. Shalizi, and M. E. Newman. Power-law distributions in empirical data. SIAM Review, 2009.6 over time. Although our specific aim was that of analysing [19] A. Culotta. Towards detecting influenza epidemics by analyzing Twitter autism-related Twitter content, the developed framework is messages. SOMA, 2010.1 entirely general in nature and can be applied to any longi- [20] J. T. Danial and J. J. Wood. Cognitive behavioral therapy for children with autism: Review and considerations for future research. Journal of tudinal corpus of tweets. Our work identified and elucidated Developmental & Behavioral Pediatrics, 2013.2 the inherent difficulty associated with topic modelling on [21] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1973.5 tweets – their insufficient information content. To overcome [22] D. E. Gray. Perceptions of stigma: The parents of autistic children. this limitation we proposed an ingenious framework which Sociology of Health & Illness, 1993.2 uses tweets not as endpoint data mining sources but rather as [23] L. Gross. A broken trust: lessons from the vaccineautism wars. PLoS Biology, 2009.6 intermediaries used to discover much richer associated content. [24] J. W. Harrington, L. Rosen, A. Garnecho, and P. A. Patrick. Parental The method proposed in this paper opens a rich cornucopia perceptions and use of complementary and alternative medicine practices of possible avenues for further research which we intend to for children with autistic spectrum disorders in private practice. Journal of Developmental & Behavioral Pediatrics, 2006.2 pursue. Some of these concern developments of the overall [25] A. Harshavardhan, A. Gandhe, R. Lazarus, S. H. Yu, and B. Liu. framework. For example in cases when URL targets include Predicting flu trends using Twitter data. INFOCOM, 2011.1 [26] N. Higashida. The reason I jump: the inner voice of a thirteen-year-old videos (say, on YouTube) the use of computer vision or audio boy with autism. Random House, 2013.7 processing can add valuable information for even more so- [27] I. Himelboim and J. Y. Han. Cancer talk on Twitter: community structure phisticated data mining. On the other hand, there are technical and information sources in breast and prostate cancer social networks. Journal of Health Communication, 2014.1 aspects of the proposed method which can be improved further. [28] J. Huang, K. M. Thornton, and E. N. Efthimiadis. Conversational tagging For example we intend to explore the use of ideas from in Twitter. HT, 2010.7,9 [29] C. Hutchings. Commercial use of facebook and twitterrisks and rewards. information-theory for fully automatic similarity graph prun- Computer Fraud Security, 2012.1 ing, which would eliminate the need for the free parameter, [30] A. Hviid, M. Stellfeld, J. Wohlfahrt, and M. Melbye. Association in the form of the pruning threshold, described in this work. between thimerosal-containing vaccine and autism. The Journal of the American Medical Association, 2003.6,7 We also intend to explore automatic ways of labelling topics [31] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking semantically using meaningful sentence fragments by back- priors. Journal of the American Statistical Association, 2001.5 [32] J. W. Jacobson, J. A. Mulick, and G. Green. Costbenefit estimates analysing probabilistically the collected text data for persistent for early intensive behavioral intervention for young children with ngrams across documents with shared topics [5]. autismgeneral model and single state case. Behavioral Interventions, 1998.1 REFERENCES [33] J. Jashinsky, S. H. Burton, C. L. Hanson, J. West, C. Giraud-Carrier, M. D. Barnes, and A. T. Tracking suicide risk factors through Twitter [1] Autism spectrum disorder fact sheet. American Psychiatric Publishing, in the US. Crisis: The Journal of Crisis Intervention and Suicide 2013.1 [2] F. Abel, Q. Gao, G. J. Houben, and K. Tao. Analyzing user modeling Prevention, 2014.3 [34] L. Jiang, M. Yu, M. Zhou, X. Liu, and T. Zhao. Target-dependent Twitter on twitter for personalized news recommendations. UMAP, 2011.3 [3] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau. sentiment classification. ACL, 2011.3 Lancet LSM [35] S. E. Levy, D. S. Mandell, and R. T. Schultz. Autism. , 2009.1 Sentiment analysis of Twitter data. , 2011.3 [36] J. Li and C. Cardie. Early stage influenza detection from Twitter. arXiv [4] O. Arandjelovic.´ Computationally efficient application of the generic preprint PR , 2013.1 shape-illumination invariant to face recognition from video. , 2012. [37] J. Lin and D. Ryaboy. Scaling big data mining infrastructure: the Twitter 3 ACM SIGKDD Explorations Newsletter ´ experience. , 2013.1 [5] O. Arandjelovic. Reading ancient coins: automatically identifying [38] J. H. Miles. Autism spectrum disorders–a genetics review. Genetics in denarii using obverse legend seeded retrieval. ECCV, 2012.8 Medicine ´ , 2011.1 [6] O. Arandjelovic and R. Cipolla. Achieving robust face recognition from [39] L. Mitchell, M. R. Frank, K. D. Harris, P. S. Dodds, and C. M. video by combining a weak photometric model and a learnt generic face Danforth. The geography of happiness: connecting Twitter sentiment PR invariant. , 2013.3 and expression, demographics, and objective characteristics of place. [7] S. Asur and B. A. Huberman. Predicting the future with social media. PLOS ONE WI-IAT , 2013.3 , 2010.1 [40] A. T. Newton, A. D. I. Kramer, , and D. N. McIntosh. Autism online: a [8] E. Baucom, A. Sanjari, X. Liu, and M. Chen. Mirroring the real world comparison of word usage in bloggers with and without autism spectrum in social media: Twitter, geolocation, and sentiment analysis. MNLP, disorders. SIGCHI, 2009.3 2013.1 [41] M. J. Paul and M. Dredze. You are what you tweet: analyzing Twitter [9] A. J. Baxter, T. S. Brugha, H. E. Erskine, R. W. Scheurer, T. Vos, and for public health. ICWSM, 2011.3 J. G. Scott. The epidemiology and global burden of autism spectrum [42] M. J. Paul and M. Dredze. A model for mining public health topics disorders. Psychological Medicine, 2015.2 Health ´ from Twitter. , 2012.1 [10] A. Beykikhoshk, O. Arandjelovic, D. Phung, and S. Venkatesh. Hierar- [43] B. Robinson, R. Power, and M. Cameron. An evidence based earthquake chical Dirichlet process for tracking complex topical structure evolution detector using Twitter. LPCI, 2013.3 and its application to autism research literature. PAKDD, 2015.4 [44] M. A. Russell. Mining the Social Web: Data Mining Facebook, Twitter, ´ [11] A. Beykikhoshk, O. Arandjelovic, D. Phung, S. Venkatesh, and T. Caelli. LinkedIn, Google+, GitHub, and More. O’Reilly Media, Inc., 2013.1 Data-mining Twitter and the autism spectrum disorder: a pilot study. [45] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes Twitter users: ASONAM, 2014.3,4,6 real-time event detection by social sensors. WWW, 2010.3 [12] A. Beykikhoshk, O. Arandjelovic,´ D. Phung, S. Venkatesh, and T. Caelli. [46] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and Using Twitter to learn about the autism community. SNAM, 2015.1,6 J. Sperling. Twitterstand: news in tweets. SIGSPATIAL GIS, 2009.3 [13] A. Bifet and E. Frank. Sentiment knowledge discovery in Twitter [47] D. Scanfeld, V. Scanfeld, and E. L. Larson. Dissemination of health streaming data. DS, 2010.3 information through social networks: Twitter and antibiotics. American NIPS [14] D. Blei and J. Lafferty. Correlated topic models. , 2006.3 Journal of Infection Control, 2010.3 [15] J. Bollen, H. Mao, and A. Pepe. Modeling public mood and emotion: [48] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Sharing clusters Twitter sentiment and socio-economic phenomena. ICWSM, 2011.3 [16] J. Chang, S. Gerrish, C. Wang, J. L. Boyd-graber, and D. M. Blei. among related groups: hierarchical Dirichlet processes. NIPS, 2004.5 Reading tea leaves: how humans interpret topic models. NIPS, 2009.3 [49] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical 40 Dirichlet processes. Journal of the American Statistical Association, 2006.5 [50] D. Trembath, S. Balandin, and C. Rossi. Crosscultural practice and autism. Journal of Intellectual and Developmental Disability, 2005.2 [51] Twitter. About. https://about.twitter.com/company. (accessed April 2015).1 death [52] S. Verma, S. Vieweg, W. J. Corvey, L. Palen, J. H. Martin, M. Palmer, 20 split A. Schram, and K. M. Anderson. Natural language processing to the res- birth cue? Extracting “situational awareness” tweets during mass emergency. m erge ICWSM, 2011.3 [53] Z. Warren, M. L. McPheeters, N. Sathe, J. H. Foss-Feig, A. Glasser,

Count per epoch and J. Veenstra-VanderWeele. A systematic review of early intensive intervention for autism spectrum disorders. Pediatrics, 2011.2

01-03 07-09 13-15 19-21 25-27 31-02 06-08 12-14 18-20 24-26 October Epoch Novem ber Fig. 7. Complex topic evolution statistics (births, deaths, merges, and splits) over a two month period (Oct–Nov 2015). It can be seen that our framework is effective in capturing the highly dynamic nature of Twitter information exchange as witnessed by the changes in the number of inferred topics per epoch across time, consistently observed emergence of new topics and disappearance of others, topic splitting and merging (also see Fig.6 for speciﬁc examples). Observe the high rate of new topic emergence, compared to the rate of topic creation through the merging of old topics, which illustrates the ephemeral nature of most Twitter exchanges [28].