NORTHEASTERN UNIVERSITY

Modeling Text Embedded Information Cascades

by

Shaobin Xu

A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of

Doctor of Philosophy

in

Computer Science

December, 2019 Abstract

Networks mediate several aspects of society. For example, social networking services (SNS) like Twitter and Facebook have greatly helped people connect with families, friends and the outside world. Public policy diffuses over institutional and social networks that connect political actors in different areas. Inferring network structure is thus essential for understanding the transmission of ideas and information, which in turn could answer ques- tions about communities, collective actions, and influential social participants. Since many networks are not directly observed, we often rely on indirect evidence, such as the tim- ing of messages between participants, to infer latent connections. The textual content of messages, especially the reuse text originating elsewhere, is one source of such evidence. This thesis contributes techniques for detecting the evidence of text reuse and modeling underlying network structure. We propose methods to model text reuse with accidental and intentional lexical and semantic mutations. For lexical similarity detection, an n-gram shin- gling algorithm is proposed to detect “locally” reused passages, instead of near-duplicate documents, embedded within the larger text output of network nodes. For semantic sim- ilarity, we use an attention based neural network to also detect embedded reused texts. When modeling network structure, we are interested in inferring different levels of de- tails: individual links between participants, the structure of a specific information cascade, or global network properties. We propose a contrastive training objective for conditional models of edges in information cascades that has the flexibility to answer those questions

ii Abstract iii and is also capable of incorporating rich node and edge features. Last but not least, net- work embedding methods prove to be a good way to learn the representations of nodes while preserving structure, node and edge properties, and side information. We propose a self-attention Transformer-based neural network trained to predict the next activated node in a given cascade to learn node embeddings.

First Reader: David Smith Second Reader: Tina Eliassi-Rad Tertiary Reader: Byron Wallace External Reader: Bruce Desmarais Acknowledgment

The journey to become a PhD can be daunting, frustrating, and yet wonderful. I owe so many thanks to many great people that helped me sail through the unforgettable 6 years as a PhD student. First and foremost, I would like to thank David, my advisor, for bringing me to the US, giving me the opportunity to work with him on many interesting NLP topics. I am deeply grateful for his patience and guidance throughout this time. His passion and knowledge about research have greatly educated and shaped me. This thesis could not have been done without his constant advice and support. His unique perspective of many matters also has made great impact on my life. I want to thank Tina, Byron, and Bruce for taking the time serving in my thesis commit- tee. Your comments and advice have greatly helped me make the thesis wholesome. I am so very appreciative of Byron’s detailed suggestions to revise the final draft of this thesis. I thank Professor Ryan Cordell, with whom I collaborated on part of the results related to this thesis. I am inspired by your passion on uncovering the 19-th century newspaper reprinting network and I am honored to be part of the team. I want to thank everyone from our lab – Liwen, Rui, Ansel, and Ryan. It is a privilege to have your company. Without any of you making the lab full of joy and energy, my life would have been miserable. I thank Rui for many deep discussions on work and life to keep my mind clear. I am truly grateful for all the brainstorms and discussions with Ansel

iv Acknowledgment v in my final two and a half years to advance my research, as well as many aspects of the live in the US that I would never know otherwise. I thank my friends outside of my lab, Bingyu, Yupeng, Chin and Bochao. Your constant help has made my life so much easier when I already have so much weight on my shoulder for the PhD. You helped to fill a lot of void during this lonely journey. Finally, I thank my mom, Xiulin, who inspires me, encourages me and loves me uncon- ditionally and endlessly. I would not have come to the other side of the world, been able to face all the unknowns and stood on my feet, had I not had the support from her. To my family and my loving friends. Contents

Abstract ii

Acknowledgment iv

Contents vii

List of Figures xi

List of Tables xiii

1 Introduction 1

1.1 Detecting Text Reuse ...... 2 1.2 Network Inference from Information Cascades ...... 5 1.3 Node Representation Learning ...... 7 1.4 Overview of The Thesis ...... 8

2 Text reuse in social networks 10

2.1 Local Text Reuse Detection ...... 11 2.2 Efficient N-gram Indexing ...... 12 2.3 Extracting and Ranking Candidate Pairs ...... 13

vii Contents viii

2.4 Computing Local Alignments ...... 14 2.5 Intrinsic Evaluation ...... 17 2.6 Extrinsic Evaluation ...... 18 2.7 Network Connections of 19c Reprints ...... 21 2.7.1 Dataset description ...... 21 2.7.2 Experiment ...... 24 2.8 Congressional Statements ...... 27 2.8.1 Dataset description ...... 27 2.8.2 Experiment ...... 28 2.9 Conclusion ...... 29

3 Semantic Text Reuse in Social Networks 30

3.1 Classifying text reuse as paraphrase or ...... 31 3.2 Method Overview ...... 34 3.3 Word Representations ...... 35 3.4 Contextualized Sentence Representation ...... 37 3.5 Attention ...... 40 3.6 Final output ...... 41 3.7 Objective function ...... 42 3.8 Experiments ...... 42 3.8.1 Datasets ...... 42 3.8.2 Models ...... 44 3.8.3 Experiment settings ...... 46 3.8.4 Document level evaluation ...... 47 3.8.5 Sentence level evaluation ...... 49 3.8.6 Ablation Test ...... 50 Contents ix

3.9 Conclusion ...... 52

4 Modeling information cascades with rich feature sets 54

4.1 Network Structure Inference ...... 55 4.2 Log-linear Directed Spanning Tree Model ...... 56 4.3 Likelihood of a cascade ...... 58 4.4 Maximizing Likelihood ...... 59 4.5 Matrix-Tree Theorem and Laplacian Matrix ...... 59 4.6 Gradient ...... 61 4.7 ICWSM 2011 Webpost Dataset ...... 61 4.7.1 Dataset description ...... 62 4.7.2 Feature sets ...... 63 4.7.3 Result of unsupervised learning at cascade level ...... 64 4.7.4 Result of unsupervised learning at network level ...... 66 4.7.5 Enforcing tree structure on the data ...... 68 4.7.6 Result of supervised learning at cascade level ...... 70 4.8 State Policy Adoption Dataset ...... 70 4.8.1 Dataset description ...... 71 4.8.2 Effect of proximity of states ...... 71 4.9 Conclusion ...... 72

5 Modeling information cascades using self-attention neural networks 74

5.1 Node representation learning ...... 75 5.2 Information cascades as DAGs ...... 78 5.3 Graph self-attention network ...... 78 5.3.1 Analogy to language modeling ...... 79 5.3.2 Graph self-attention layer ...... 81 Contents x

5.3.3 Graph self-attention network ...... 83 5.3.4 Senders and receivers ...... 85 5.3.5 Hard attention ...... 86 5.3.6 Edge prediction ...... 90 5.4 Experiments ...... 90 5.4.1 Datasets ...... 90 5.4.2 Baselines ...... 93 5.4.3 Experimental settings ...... 95 5.4.4 Node prediction ...... 96 5.4.5 Edge prediction ...... 98 5.4.6 Effect of texts as side information ...... 99 5.5 Conclusion ...... 102

6 Conclusion 104

6.1 Future Work ...... 106 6.1.1 Text Reuse ...... 106 6.1.2 Network Structure Inference ...... 107

Bibliography 109 List of Figures

2.1 Average precision for aligned passages of different minimum length in characters. Vertical red lines indicate the performance of different parame- ter settings (see Table 2.1)...... 19 2.2 (Pseudo-)Recall for aligned passages of different minimum lengths in char- acters...... 20 2.3 Newspaper issues mentioning “associated press” by year, from the Chron- icling America corpus. The black regression line fits the raw number of issues; the red line fits counts corrected for the number of times the Asso- ciated Press is mentioned in each issue...... 23 2.4 Reprints of John Brown’s 1859 speech at his sentencing. Counties are shaded with historical population data, where available. Even taking pop- ulation differences into account, few newspapers in the South printed the abolitionist’s statement...... 25

3.1 The overview of structure of Attention Based Convolutional Network (ABCN) 34 3.2 Unrolled Vanilla Recurrent Neural Network ...... 38 3.3 An illustration of ConvNet structure used in ABCN with toy example with

word embeddings in R5, kernel size 2, feature map size 3. It yields a rep- resentation for the sentence in R3 ...... 39

xi List of Figures xii

4.1 Recall, precision, and average precision of InfoPath and DST on predicting the time-varying networks generated per day. The DST model is trained un- supervisedly on separate cascades using basic and enhanced features. The upper row uses graph-structured cascades from the ICWSM 2011 dataset. The lower row uses the subset of cascades with tree structures...... 69

5.1 The illustration of a cascade structure as DAG in a toy network...... 79 5.2 Graph Self-Attention Network architecture, with L identical multi-headed self attention layer...... 82 5.3 Graph Sender-Receiver attention network architecture, with L identical multi-headed attention layer for senders and receivers respectively...... 87 5.4 Graph hard self-attention network architecture, with L − 1 identical multi- headed attention layer and the last layer replaced with a reinforcement learning agent selecting mask actions...... 88 5.5 Modified Graph Self-Attention Networks with last layer replaced with a single-head attention sublayer to output edge predictions...... 91 List of Tables

2.1 Parameters for text reuse detection ...... 13 2.2 Correlations between shared reprints between 19c newspapers and political and other affinities. While many Whig papers became Republican, they do not completely overlap in our dataset; the identical number of pairs is coincidental...... 26 2.3 Correlations between log length of aligned text and other author networks in public statements by Members of Congress. ∗p < .05, ∗ ∗ p < .01, ∗ ∗ ∗p < .001 ...... 28

3.1 Document level results (Macro-averaged recall/precision/F1) of ABCN in comparison with baseline methods under both whole source document vs. target sentences and source sentences vs. target sentences setting...... 48 3.2 Sentence level results (macro-averaged recall/precision/F1) of ABCN in comparison with baseline methods under both whole source document vs. target sentences and source sentences vs. target sentences, evaluated on all target documents...... 51

xiii List of Tables xiv

3.3 Sentence level results (macro-averaged recall/precision/F1) of ABCN in comparison with baseline methods under both whole source document vs. target sentences and source sentences vs. target sentences, evaluated on only matched documents...... 51 3.4 Ablation test where Bi-LSTM layer is removed to test the efficacy of con- textualized sentence representation, on both document and sentence level dataset, in comparison to proposed model structure...... 52

4.1 Cascade-level inference of DST with different feature sets, in unsupervised learning setting in comparison with naive attach-everything-to-earliest base- line for original cascades extracted from ICWSM 2011 dataset...... 64 4.2 Cascade-level inference of DST with different feature sets, in unsupervised learning setting in comparison with naive attach-everything-to-earliest base- line for tree structure enforced cascades extracted from ICWSM 2011 dataset. 64 4.3 Cascade-level inference of DST with different feature sets, in supervised learning setting for merged cascades for tree structure enforced cascades extracted from ICWSM 2011 dataset...... 65 4.4 Comparison of MultiTree, InfoPath and DST on inferring a static network on the original ICWSM 2011 dataset. The DST model is trained and tested unsupervisedly on both separate cascades and merged cascades using dif- ferent feature sets and the naive attach-everything-to-earliest-node baseline. 67 4.5 Comparison of MultiTree, InfoPath and DST on inferring a static net- work on modidifed ICWSM 2011 dataset with enforced tree structure. The DST model is trained and tested unsupervisedly on both separate cas- cades and merged cascades using different feature sets and the naive attach- everything-to-earliest-node baseline...... 68 List of Tables xv

4.6 Logistic regression of network inferred by DST and InfoPath on indepen- dent networks: geological distance between states and contiguity of states. The statistical significance are at < 0.05 level according to the QAP p-value against the indicated null hypothesis...... 71

5.1 Statistics of datasets. Memes (Ver. 1) is from Wang et al. (2017a)...... 93 5.2 Comparison of variants of GSAN and baseline models on Digg and Memes (Ver.1) datasets. The results of baseline models are from Wang et al. (2017a). 97 5.3 Accuracy of GSAN variants on Digg and Memes (Ver.1) datasets. The accuracy are listed as percentage...... 97 5.4 Comparing GSAN variants versus naive baselines and DST on edge predic- tion on cascade-level structures using macro-averaged recall, precision and F1 score. The lengths of cascades Memes (Ver.2) are restricted between 2 and 30. The right part of table restricts the lengths to be between 5 and 30. . 99 5.5 Comparison between GSAN variants and their corresponding models with additional nodal text features...... 100 Chapter 1

Introduction

Networks mediate several aspects of society, and studies of networks are abundant thanks to the usefulness of what a better understanding of them entails. For example, research on social network services such as Twitter and Facebook helps understand communities or influential groups, promote viral marketing, reveal connections among social and political movements, and so on. Apart from observing the links between different social network participants, we can also obtain much side information, e.g., the time when the interac- tion took place, the content or label that each participant contributes or the attributes of the links. This side information provides indirect evidence for social ties. Text reuse by different participants in this case becomes one of many revealing forms of shared behav- ior. For example, political speech—be it on television, on the floor of the legislature, in printed quotations, on politicians’ websites and social media feeds—uses common tropes and turns of phrase that groups of politicians used to describe some issue (Grimmer and Stewart, 2013). One might even discover a list of “talking points” underlying this common behavior.1 One might be reading the literature in a scientific field and find that a paper starts

1The U.S. State Department, for example, produced a much-discussed set of talking points memos in response to the 2012 attach in Benghazi.

1 Introduction 2

getting cited or paraphrased repeatedly. Which previous paper, or papers, introduced the new technique? Or perhaps one reads several news stories about a new product from some company and then finds that they all share text with a press release put out by the company. Methods to uncover invisible links among sources of text have broad applicability be- cause of the very general nature of the problem—sources of text include websites, newspa- pers, individuals, corporations, political parties, and so on. Further, discerning those hidden links between sources can provide more effective ways of identifying the provenance and diverse sources of information, and to build predictive models of the diffusion of informa- tion. There are substantial challenges, however, in building methodologies to uncover text reuse and model sharing behavior. In this introductory chapter, we will discuss some of the challenges that we face and how we aim to address those issues. In particular, we will consider how to detect text reuse, both lexically and semantically in §1.1. Then, we will see how to model information sharing, using text as an example, in §1.2 from observing in- formation cascades. Each cascade comprises a number of social actors receiving (getting activated by) the a related piece of information in a sequential fashion. Finally, projecting the representations of nodes in a network into a continuous dense space while perserving first- or second-order proximity can be useful for many tasks, such as node classification, node clustering, link prediction, etc. In §1.3 we will see that node representations can also be learned from information cascades.

1.1 Detecting Text Reuse

As mentioned above, several situations can give rise to text reuse. For example, Linder et al. (2018) show that bills introduced by ideologically similar sponsors exhibit a high degree of text reuse, that bills classified by the National Conference of State Legislatures as covering the same policies exhibit a high degree of text reuse, and that rates of text reuse Introduction 3 between states correlate with policy diffusion network ties between states. Such reuse is usually observable at the lexical level and can be detected by an alignment algorithm, such as the Smith-Waterman algorithm (Smith et al., 1981). The following pair shows one of the sample alignments from Linder et al. (2018) for state legislative bills:

nj_214_A1167: “the entire credit may not be taken for the taxable year in which the renewable energy property is placed in service but must be taken in five equal installments beginning with the taxable year in which the” nc_2009_SB305: “the entire credit may not be taken for the taxable year in which the costs are paid---- but must be taken in five equal installments beginning with the taxable year in which the”

The words in red indicate mismatches, while those in green are gaps introduced by the alignment algorithm to indicate insertions and deletions. Non-highlighted passages are matches. There are, however, several issues making the study of text reuse challenging, including: scalable detection of reused passages; identification of appropriate statistical models of text mutation; inference methods for characterizing missing nodes that originate or mediate text transmission; link inference conditioned on textual topics; and the devel- opment of testbed datasets through which predictions of the resulting models might be validated against some broader understanding of the processes of transmission. We propose an n-gram shingling algorithm to detect “locally” reused passages, instead of near-duplicate documents, embedded within the larger text output of social network nodes. We then explore the correlation of links revealed by text sharing behavior and var- ious types of social ties of different datasets. An n-gram shingling algorithm for detecting text reuse is a method based on , whereas semantic meanings of them, along with the contextual texts, are more revealing the path over which the information propa- gates. For example, there could be many editorial changes, typos, so on and so forth while Introduction 4 passing around information. The reuse of the same mutation, rather than the original text, could be an indicator of a link between those network participants. An algorithm based on lexical analysis wouldn’t be the best choice under this circumstance as if we want exact match we miss many repeated passage, whereas if we lower the proportion of text overlaps that is considered a match we will generate more noises. In such cases, an approach that can detect semantic reuse is more ideal. We propose to extract local texts that are semantically similar given a pair of texts. Lexical similarity is just one of two forms of text reuse in social network. The other form is , which includes paraphrase or textual entailment. This usually happens in academic research where authors citing other authors’ contribution by rephras- ing it; in news media where one source cites the press release or other outlet with seman- tically similar sentences; in online social networks where people often describe reported incidents with their words. The following snippets are from a press release about a study on obesity and a news article that cites the press release:

Press release: “The students then received instruction on the causes and treatments of obesity, with follow up testing on their knowledge and attitudes toward obesity for every year of medical school. Those who completed the program significantly reduced their bias by an average of 7 percent.” News article: “The study, recently published online in The Journal of the American Osteopathic Association, found that when medical students are taught a specific curriculum aimed at better understanding the causes behind, and treatments for, obesity, the students’ innate obesity prejudice dropped by an average of 7 percent.” In this example we can observe the news article paraphrasing the press release. This prob- lem could be more complicated than lexical text reuse, in that we cannot necessarily use Introduction 5

a naive alignment algorithm for detection. In the example given above, the sentence is reordered and words are changed (e.g., “received instruction” → “are taught”, and “bias” → ”prejudice”). If only pairs of potentially similar texts are given, many efforts have been made to better predict the similarity score (Dai et al., 2018; Pang et al., 2017; Devlin et al., 2018). However, what are observed in social networks are usually documents, or passages with unknown boundaries, instead of mere sentences. With irrelevant surrounding contexts in these observed documents, those models fail to make a correct prediction. We propose an attention based convolutional network that uses BERT (Devlin et al., 2018), which is the state-of-the-art model on classifying text similarity tasks to provide contextualized rep- resentations of words. We then use a convolutional neural network (CNN) to infer a fixed length representation for sentences with varied length, given the good performance of CNN based models on text similarity classification tasks (Hu et al., 2014; Pang et al., 2016; Dai et al., 2018), followed by a bidirectional Long Short-Term Memory (LSTM: Hochreiter and Schmidhuber, 1997) unit to capture the contextualized sentence representations. The proposed model also uses an attention mechanism (Bahdanau et al., 2015), which guides the model to “look at” similar sentences among irrelevant contexts. We compare this model with a pre-trained and a fine-tuned BERT model on the task of recovering citations between scientific papers in the ACL Anthology Corpus.

1.2 Network Inference from Information Cascades

Observing the behavior of text reuse takes us only part way towards modeling the underly- ing networks. We might be interested in network structures at differing levels of detail. We might be interested in individual links, e.g., knowing which previous paper it was that later papers were mining for further citations; in cascades, e.g., knowing which news stories are copying from which press releases or from each other; and in networks, e.g., know- Introduction 6 ing which politicians are most likely to share talking points or which newspapers are most likely to publish press releases from particular businesses or universities. Depending on our data source, some of these structures could be directly observed. With the right API calls, we might observe retweets (links), chains of retweets (cascades), and follower rela- tions (networks) on Twitter. We might also be interested in inferring an underlying social network for which the Twitter follower relation is partial evidence. In contrast, politicians interviewed on television do not explicitly cite the sources of their talking points, which must be inferred. Documenting an information diffusion process often reduces to keeping track of when nodes (newspapers, bills, people, etc.) mention a piece of information, reuse a text, get infected, or exhibit a contagion in a general sense. When the structure of the propagation of a contagion is hidden and we cannot tell which node infected which, all we have is the result of diffusion process—that is, the timestamp and possibly other information related to when the nodes get infected. We want to infer the diffusion process itself by using such information to predict the links in an underlying network. There have been increasing ef- forts to uncover and model different types of information cascades on networks (Brugere et al., 2016): modeling hidden networks from observed infections (Stack et al., 2012; Ro- driguez et al., 2014), modeling topic diffusion in networks (Gui et al., 2014), predicting social influence on individual mobility (Mastrandrea et al., 2015) and so on. This prior work all focused on using parametric models of the time differences between infections. Such models are useful when the only information we can get from the result of diffusion process is the timestamps of infections. We can hope to make better predictions, however, with access to additional features, such as the location of each node, the similarity between the messages received by two nodes, etc. Popular parametric models cannot incor- porate these features into unsupervised training. We propose an edge-factored, conditional log-linear directed spanning tree (DST) model with an unsupervised, contrastive training Introduction 7 procedure to infer the link structure of information cascades. The advantage of condition- ing the model on the observed sequence of nodes in a cascade is that we can easily include various features, including those reflecting the reuse of texts, in the model.

1.3 Node Representation Learning

The modeling assumption of a directed spanning tree for cascade structure poses a few problems. In many scenarios, for example, a node is more realistically affected by mul- tiple previously activated nodes, which violates the tree constraint on cascade structure. Therefore, our DST model might fail to capture the many-to-one relationships of “complex contagion”. A directed acyclic graph (DAG) is more suitable for capturing such relation- ships unfolding in time. A DAG’s prohibition on directed cycles ensures that, within a given cascade, a piece of information cannot flow back to itself. However, according to Cui et al. (2018), representing the network using the traditional graph structure has several issues: high computational complexity, low parallelizability, in- applicability of machine learning methods such as deep neural networks, etc. We therefore do not intend to directly model the DAG structure, but find an alternative model to build in the assumption of the cascade structure being a DAG. Recently, network embeddings have become popular: instead of learning the edges of the underlying network, this method embeds nodes into a continuous latent space that tries to preserve the structure, property and side information as such texts in the meantime. It also enables applications to sev- eral tasks such as node classification, node clustering, link prediction and so on. However, many of the efforts are on learning the node representations from fully-observed networks as a whole, such as DeepWalk (Perozzi et al., 2014), Node2Vec (Grover and Leskovec, 2016), and GraphSAGE (Hamilton et al., 2017). Among the relatively fewer network em- bedding methods utilizing information cascades are Embedded-IC (Bourigault et al., 2016), Introduction 8

DeepCas (Li et al., 2017) and Topo-LSTM (Wang et al., 2017a). Many of these models of information cascades require knowledge of the underlying network structure or the trans- mission probabilities between nodes, which can be expensive or impractical to obtain for certain social networks. Learning to predict the next active nodes in a cascade could be of interest to several applications. For example, in public policy diffusion networks, having a model that predicts links between states, for example, might help qualitative analysis and visualization; however, predicting the next state that might adopt a certain policy could prove to be a more valuable result, along with the fact that we can actually assess the ef- fectiveness of such model. To take another example, in viral marketing, the bidders on advertisements are usually more interested in knowing which target groups are more likely to get the information next, instead of how the information is passed. We therefore introduce a network embedding model that uses the observed relative order of nodes in a cascade to guide the training process, which is unsupervised with regard to the underlying network structure. Our model also bears the assumption that the cascade structure is actually a DAG, instead of a directed spanning tree. Finally, by the nature of the model, it is easy to include different nodal side information, such as texts, in learning and inference stages.

1.4 Overview of The Thesis

In Chapter 2, we elaborate on detecting local lexical text reuse in social networks in an efficient way. We perform an intrinsic and extrinsic evaluation on the methods. In Chapter 3, we show how to use an attention-based neural network to select similar sentences from the irrelevant surrounding contexts and evaluate the model on ACL Anthol- ogy Corpus. Chapter 4 shows how we model the structure of a cascade as a directed spanning tree Introduction 9 and the effectiveness of including side information. Chapter 5 introduces a self-attention network embedding model based on Transformers to learn node representations. We use several real-life social networks for evaluation. Finally, we summarize the modeling contributions of this thesis in the concluding Chap- ter 6. Chapter 2

Text reuse in social networks

As noted in the introduction, we are interested in detecting passages of text reuse (poems, stories, political talking points) that comprise a small fraction of the containing documents (newspaper issues, political speeches). Using the terminology of biological sequence align- ment, we are interestd in local alignments between documents. In text reuse detection re- search, two primary methods are n-gram shingling and locality-sensitive hashing (LSH) (Henzinger, 2006). The need for local alignments makes LSH less pratical without per- forming a large number of sliding-window matches. In contrast to work on near-duplicate document detection and to work on “meme track- ing” that takes text between quotation marks as the unit of reuse (Leskovec et al., 2009; Suen et al., 2013), here the boundaries of the reused passages are not known. Also in con- trast to work on the contemporary news cycle and blogosphere, we are interested both in texts that are reprinted within a few days and after many years. We thus cannot exclude potentially matching documents for being far removed in time. Text reuse that occurs only among documents from the same “source” (run of newspapers; Member of Congress) should be excluded. Similarly, Henzinger (2006) notes that many of the errors in near- duplicate webpage detection arose from false matches among documents from the same

10 Text reuse in social networks 11

website that shared boilerplate navigational elements.

2.1 Local Text Reuse Detection

Two approaches to text reuse detection have been broadly explored. One approach is hash- ing subsequences of words in documents to construct fingerprints of the document. This approach is known to work well for copy detection. Shivakumar and Garcia-Molina (1995, 1998) and Broder (1997) present generic frameworks for working with a fingerprinting ap- proach, and various selection algorithms have been proposed since (Manber et al., 1994; Heintze et al., 1996; Brin et al., 1995; Schleimer et al., 2003), due to the computational complexity of handling many fingerprints. Broder et al. (1997) propose to use super- shinglings which means hashing sequences of fingerprints again. Charikar (2002) intro- duced a hashing algorithm based on random projections of words in documents. Henzinger (2006) compare the methods proposed by Charikar and Broder et al. on a large scale of web pages and propose a combined algorithm. Schleimer et al. (2003) propose an fingerprint algorithm named Winnowing which uses a second window size w and for each consecu- tive sequence of w shingles it outputs the shingle with the smallest fingerprint value, or the right-most shingle selected if no unique smallest shingle exists. Seo and Croft (2008) proposed two new algorithms: Revised Hash-breaking to break the document into non- overlapping (text) segments at the hashed tokens whose value is divisible for some integer p and fingerprint again all the tokens in the segments; DCT (discrete cosine transformation) which is based on hash-breaking but applies a DCT to sequnces of text segments and quan- tize each coefficient from which the final fingerprint is formed. Abdel-Hamid et al. (2009) include the methods from Seo and Croft (2008) and propose a new algorithm Hailstorn to select the shingle s iff the minimum fingerprint value of all k fingerprinted tokens in s occurs at the first or the last positions of s. Mittelbach et al. (2010) propose a selection Text reuse in social networks 12

strategy Anchor which locates and hashes certain predefined strings (anchors) in a given document. Another approach is computing similarities between documents. Shivakumar and Garcia- Molina (1995) suggest similarity measures based on the relative frequence of words be- tween documents. Hoad and Zobel (2003) extend the notion and explore different variants of identity measure based on cosine measure and different parameters for fingerprinting algorithms, such as generation, granularity, and resolution, for full-document duplication detection.

2.2 Efficient N-gram Indexing

The first step in our approach is to build for each n-gram feature an inverted index of the documents where it appears. As in other duplicate detection and text reuse applications, we are only interested in n-gram shared by two or more documents. The index, therefore, does not need to contain entries for the n-grams that occur only once. We use the two- pass space-efficient algorithm described by Huston et al. (2011), which empirically, is very efficient on large collections. In a first pass, n-grams are hashed into a fixed number of bins. On the second pass, n-grams that hash to bins with one occupant can be discarded; other postings are passed through. Due to hash collisions, there may still be a small number of singleton n-grams that reach this stage. These singletons are filtered out as the index is written. In building an index of n-grams, an index of (n-1)-grams can also provide a useful filter. No 5-gram, for example, can occur twice unless its constituent 4-grams occur at least twice. Text reuse in social networks 13

2.3 Extracting and Ranking Candidate Pairs

Once we have an inverted index of the documents that contain each (skip) n-gram, we use it to generate and rank document pairs that are candidates for containing reprinted texts.

Each entry, or posting list, in the index may be viewed as a set of pairs (di, pi) that record the document identifier and position in that document of that n-gram. Once we have a posting list of documents containing each distinct n-gram, we output all pairs of documents in each list. We suppress repeated n-grams that appear in differ- ent issues of the same newspaper. These repetitions often occur in editorial boilerplate or advertisements, which, while interesting, are outside the scope of this project. We also

u suppress n-grams that generate more than 2 pairs, where u is a parameter. These fre- quent n-grams are likely to be common fixed phrases. Filtering terms with high document frequency has led to significant speed increases with small loss in accuracy in other docu- ment similarity work (Elsayed et al., 2008). We then sort the list of repeated n-grams by document pair, which allows us to assign a score to each pair based on the number of over- lapping n-grams and the distinctiveness of those n-grams. Table 2.1 shows the parameters for trading off recall and precision at this stage.

n n-gram order w maximum width of skip n-grams g minimum gap of skip n-grams u maximum distinct series in the posting list

Table 2.1: Parameters for text reuse detection Text reuse in social networks 14

2.4 Computing Local Alignments

The initial pass returns a large ranked list of candidate document pairs, but it ignores the order of the n-grams as they occur in each document. We therefore employ local alignment techniques to find compact passages with the highest probability of matching. The goal of this alignment is to increase the precision of the detected document pairs while maintaining high recall. Depending on the types of documents, many n-grams in mactching articles will contain slight different if the documents are obtained from image recognition process such like OCR. Unlike some partial duplicate detection techniques based on global alignment Yalniz et al. (2011), we cannot expect all or even most of the articles in two newspaper issues, or the text in two books with a shared quotation, to align. Rather, as in some work on bio- logical subsequence alignment Gusfield (1997), we are looking for regions of high overlap embedded wihin sequences that are otherwise unrelated. We therefore employ the Smith- Waterman dynamic programming algorithm with an affine gap penalty. This use of model- based alignment distinguishes this approach for other work, for detecting shorter quota- tions, that greedily expands areas of n-gram overlap (Kolak and Schilit, 2008; Olsen et al., 2011). We do, however, prune the dynamic programming search by forcing the alignment to go through position pairs that contain a matching n-gram from the previous step, as long as the two n-grams are unique in their respective texts. Even the exact Smith-Waterman algorithm, however, is an approximation to the problem we aim to solve. If, for instance, two separate articles from one newspaper issue were reprinted in another newspaper issue in the opposite order—or separated by a long span of unrelated matter—the local align- ment algorithm would simply output the better-aligned article pair and ignore the other. Anecdotally, we only observed this phenomenon once in the newspaper collection, where two different parodies of the same poem were reprinted in the same issue. In any case, our Text reuse in social networks 15

approach can easily align different passages in the same document to passages in two other documents. The dynamic program proceeds as follows. In this paper, two documents would be

treated as sequences of text X and Y whose individual characters are indexed as Xi and

Yj. Let W (Xi,Yj) be the score of aligning character Xi to character Yj. Higher scores are better. We use a scoring function where only exact character matches get a positive score and any other pair gets a negative score. We also account for additional text appearing on either X or Y . Let Wg be the score, which is negative, of starting a “gap”, where one sequence includes text not in the other. Let Wc be the cost for continuing a gap for one more character. This “affine gap” model assigns a lower cost to continuing a gap than to starting one, which has the effect of making the gaps more contiguous. We use an assignment of weights fairly standard in genetic sequences where matching characters score 2, mismatched charactesr score −1, beginning a gap costs −5, and continuing a gap costs −0.5. We leave for future work the optimization of these weights for the task of capturing shared policy ideas. As with other dynamic programing algorithms such as , the Smith- Waterman algorithm operates by filing in a “chart” of partial results. The chart in this case is a set of cells indexed by the characters in X and Y . We add in affine gap penalty to the standard Smith-Waterman algorithm and we initialize it as follows:

H(0, 0) = 0

H(i, 0) = E(i, 0) = Wg + i · Wc (2.1)

H(0, j) = F (0, j) = Wg + j · Wc

The alogirhtm is then defined by the following recurrence relations: Text reuse in social networks 16

  0    E(i, j) H(i, j) = max  F (i, j)     H(i − 1, j − 1) + W (Xi,Yj)   E(i, j − 1) + Wc E(i, j) = max  H(i, j − 1) + Wg + Wc   F (i − 1, j) + Wc F (i, j) = max  H(i − 1, j) + Wg + Wc

The main entry in each cell H(i, j) represents the score of the best alignment that terminates at position i and j in each sequence. The intermediate quantities E and F are used for evaluating gaps. Due to taking a max with 0, H(i, j) cannot be negative. This is what allows Smith-Waterman to ignore text before and after the locally aligned substrings of each input. After completing the chart, we then find the optimum alignment by tracing back from the cell with the highest cumulative value H(i, j) until a cell with a value of 0 is reached. These two cells represent the bounds of the sequence, and the overall Smith-Waterman algorithm score reflects the extent to which the characters in the sequences align and the overall length of the sequence. In our implementation, we include one further speedup: since in a previous step we identified n-grams that are shared between the two documents, we assume that any align- ment of those documents must include those n-grams as matches. In some cases, this Text reuse in social networks 17 anchoring of the alignment might lead to suboptimal Smith-Waterman alignment scores.

2.5 Intrinsic Evaluation

To evaluate the precision and recall of text reuse detection, we create a pseudo-relevant set of document pairs by pooling the results of several runs with different parameter settings. For each document pair found in the union of these runs, we observe the length, in matching characters, of the longest local alignment. (Using matching character length allows us to abstract somewhat from the precise cost matrix.) We can then observe how many aligned passages each method retrieves that are at least 50,000 character matches in length, at least 20,000 character matches in length, and so on. The candidate pairs are sorted by the number of overlapping n-grams; we measure the pseudo-recall at several length cutoffs. For each position in a ranked list of document pairs, we then measure the precision: what proportion of documents retrieved are in fact 50k, 20k, etc., in length? Since we wish to rank documents by the length of the aligned passages they contain, this is a reasonable metric. One summary of these various values is the average precision: the mean of the precision at every rank position that contains an actually relevant document pair. One of the few earlier evaluations of local text reuse, by Seo and Croft (2008), compared fingerprinting methods to a baseline. Since their corpus contained short individual news articles, the extent of the reused passages was evaluated qualitatively rather than by alignment. Figure 2.1 shows the average precision of different parameter settings on the newspaper collection, ranked by the number of pairs each returns. If the pairwise document step returns a large number of pairs, we will have to perform a large number of more costly Smith-Waterman alignments. On this collection, a good tradeoff between space and speed is achieved by skip features. In the best case, we look at where there is a gap of at least 95, and not more than 105, words between the first and second terms (n=2 Text reuse in social networks 18 u=100 w=105 g=95). While average precision is a good summary of the quality of the ranked list at any one point, many applications will simply be concerned with the total recall after some fixed amount of processing. Figure 2.2 also summarizes these recall results by the absolute number of document pairs examined. From these results, it is clear the several good settings perform well at retrieving all reprinted passages of at least 5000 characters. Even using the pseudo-recall metric, however, even the best operating points fail in the end to retrieve about 10% of the reprints detected by some other setting for all documents of at least 1000 characters.

2.6 Extrinsic Evaluation

While political scientists, historians, and literary scholars will, we hope, find these tech- niques useful and perform close reading and manual analysis on texts of interest, we would like to validate our results without a costly annotation campaign. In this work, we explore the correlation of patterns of text reuse with what is already known from other sources about the connections among Members of Congress, newspaper editors, and so on. This idea was inspired by Margolin et al. (2013), who used these techniques to test rhetorical theories of “semantic organizing processes” on the congressional statements corpus. The approach is quite simple: measure the correlation between some metric of text reuse between actors in a social network and other features of the network links between those actors. The metric of text reuse might be simply the number of exact n-grams shared by the language of two authors (Margolin et al., 2013); alternatively, it might be the abso- lute or relative length of all the aligned passages shared by two authors or the tree distance between them in a phylogenetic reconstruction. To measure the correlation of a text reuse metric with a single network, we can simply use Pearson’s correlation; for more networks, Text reuse in social networks 19 0.5

1k 1k

1k 1k

1k

2k

0.4 1k

2k 1k 1k 2k 2k

0.3 5k

2k 2k 10k 10k 5k

2k 5k 10k 5k 10k5k 2k

0.2 10k Average precision Average 5k 5k 10k

10k 5k

10k 0.1

n7.u50 n7.u100n5.u50 n5.u100 n4.u100

n2.u100.w105.g95n2.u100.w55.g45n2.u100.w25.g15 0.0

5e+06 1e+07 2e+07 5e+07

Pairs examined

Figure 2.1: Average precision for aligned passages of different minimum length in char- acters. Vertical red lines indicate the performance of different parameter settings (see Ta- ble 2.1). Text reuse in social networks 20 1.0 0.8 0.6 Recall 0.4 0.2

500000.0 20000 10000 5000 2000 1000

1e+01 1e+03 1e+05 1e+07

Pairs examined

Figure 2.2: (Pseudo-)Recall for aligned passages of different minimum lengths in charac- ters. Text reuse in social networks 21 we can use multivariate regression. Due to, for instance, autocorrelation among edges aris- ing from a particular node, we cannot proceed as if the weight of each edge in the text reuse network can be compared independently to the weight of the corresponding edges in other networks. We therefore use nonparametric permutation tests using the quadratic assign- ment procedure (QAP) to resample several networks with the same structure but different labels and weights. The QAP achieves this by reordering the rows and columns of one net- work’s adjacency matrix according to the same permutation. The permuted network then has the same structure—e.g., degree distribution—but should no longer exhibit the same correlations with the other network(s). We can run QAP to generate confidence intervals for both single Krackardt (1987) and multiple correlations Dekker et al. (2007).

2.7 Network Connections of 19c Reprints

2.7.1 Dataset description

In American Literature and the Culture of Reprinting, McGill (2007) argues that Amer- ican literary culture in the nineteenth century was shaped by the widespread practice of reprinting stories and poems, usually without authorial permission or even knowledge, in newspapers, magazines, and books. Without substantial copyright enforcement, texts cir- culated promiscuously through the print market and were often revised by editors during the process. These “viral” texts—be they news stories, short fiction, or poetry—are much more than historical curiosities. The texts that editors chose to pass on are useful barome- ters of what was exciting or important to readers during the period, and thus offer significant insight into the priorities and concerns of the culture. Nineteenth-century U.S. newspapers were usually associated with a particular political party, religious denomination, or social cause (e.g., temperance or abolition). Mapping the Text reuse in social networks 22 specific locations and venues in which varied texts circulated would therefore allow us to answer questions about how reprinting and the public sphere in general were affected by geography, communication and transportation networks, and social, political, and religious affinities. These effects should be particularly observable in the period before the Civil War and the rise of wire services that broadcast content at industrial scales (Figure 2.3). To study the reprint culture of this period, we crawled the online newspaper archives of the Library of Congress’s Chronicling America project (chroniclingamerica.loc. gov). Since the Chronicling America project aggregates state-level digitization efforts, there are some significant gaps: e.g., there are no newspapers from Massachusetts, which played a not insubstantial role in the literary culture of the period. While we continue to collect data from other sources in order to improve our network analysis, the current dataset remains a useful, and open, testbed for text reuse detection and analysis of overall trends. For the pre-Civil War period, this corpus contains 1.6 billion words from 41,829 issues of 132 newspapers. Another difficulty with this collection is that it consists of the OCR’d text of newspaper issues without any marking of article breaks, headlines, or other structure. The local align- ment methods described in §2.1 are designed not only to mitigate this problem, but also to deal with partial reprinting. One newspaper issue, for instance, might reprint chapters 4 and 5 of a Thackeray novel while another issue prints only chapter 5. Since our goal is to detect texts that spread from one venue to another, we are not interested in texts that were reprinted frequently in the same newspaper, or series, to use the cataloguing term. This includes material such as mastheads and manifestos and also the large number of advertisements that recur week after week in the same newspaper. Text reuse in social networks 23

Figure 2.3: Newspaper issues mentioning “associated press” by year, from the Chronicling America corpus. The black regression line fits the raw number of issues; the red line fits counts corrected for the number of times the Associated Press is mentioned in each issue. Text reuse in social networks 24

2.7.2 Experiment

For the antebellum newspaper corpus, we are also interested in how political affinity cor- relates with reprinting similar texts. We have also added variables for social causes such as temperance, women’s rights, and abolition that—while certainly not orthogonal to po- litical commitments—might sometimes operate independently. In addition, we also added a “shared state” variable to account for shared political and social environments of more limited scope. Figure 2.4 shows a particularly strong example of a geographic effect: the statement of the radical abolitionist John Brown after being condemned to death for at- tacking a federal arsenal and attempting to raise a slave rebellion was very unlikely to be published in the South. Using information from the Chronicling America cataloguing and from other newspa- per histories, we coded each of the 132 newspapers in the corpus with these political and social affinities. We then counted the number of reprinted passages shared by each pair of newspapers. There is not a deterministic relationship between the number of pairs of newspapers sharing an affinity and the number of reprints shared by those papers. While our admittedly partial corpus only contains a single pair of avowedly abolitionist papers— a radical position at the time—those two papers shared articles 306 times, compared for instance to the 71 stories shared among the 6 pairs of “nativist” papers. Table 2.2 shows that geographic proximity had by far the strongest correlation with (log) reprinting counts. Interestingly, the only political affinity to show as strong a correla- tion was the Republican party, which in this period had just been organized and, one might suppose, was trying to control its “message”. The Republicans were more geographically concentrated in any case, compared to the sectionally more diffuse Democrats. Another counterexample is the Whigs, the party from which the new Republican party drew many of its members, which also has a slight negative effect on reprinting. The only other large Text reuse in social networks 25

Figure 2.4: Reprints of John Brown’s 1859 speech at his sentencing. Counties are shaded with historical population data, where available. Even taking population differences into account, few newspapers in the South printed the abolitionist’s statement. coefficients are in the complete model for smaller movements such as nativism and aboli- tion. It is interesting to speculate about whether the speed or faithfulness of reprinting—as opposed to the volume—might be correlated with more of these variables. For networks that don’t have the natural way to reconstruct the ground truth structures and expensive to have them annotated, we would like to instead explore the correlation of patterns of text reuse with what is already known from other sources about the connections among Members of Congress, newspaper editors, and so on. This idea was inspired by Margolin et al. (2013), who used these techniques to test rhetorical theories of “semantic organizing processes” on the congressional statements corpus. Text reuse in social networks 26

newspaper pairs of regression w/pairs affinity papers reprints ≥ 1 ≥ 10 ≥ 100 Republican 1176 134,302 0.74*** 0.73* 0.72*** Whig 1176 91,139 -0.35 -0.34 -0.35 Democrat 1081 62,609 -0.08 -0.09 -0.07 same state 672 103,057 1.12*** 1.11*** 1.13*** anti-secession 435 22,009 -0.58* -0.58 -0.60 anti-slavery 231 12,742 -0.65 -0.64 -0.60 pro-slavery 120 11,040 -0.35 -0.35 -0.27 Free-State 15 1,194 0.80 0.80 Constitutional Union 15 1,070 -0.21 -0.21 pro-secession 15 529 0.11 0.11 Free Soil 10 1,936 -0.42 -0.42 Copperhead 10 797 1.53 1.54 temperance 6 560 0.65 independent 6 186 -0.22 nativist 6 71 -1.93* women’s rights 3 721 1.91 abolitionist 1 306 3.49** Know-Nothing 1 25 1.33 Mormon 1 3 -1.13 R-squared – – .065 .063 .062

Table 2.2: Correlations between shared reprints between 19c newspapers and political and other affinities. While many Whig papers became Republican, they do not completely overlap in our dataset; the identical number of pairs is coincidental. Text reuse in social networks 27

2.8 Congressional Statements

2.8.1 Dataset description

Members of the U.S. Congress are of course even more responsive to political debates and incentives than nineteenth-century newspapers. Representatives and senators are also a very well-studied social network. Following Margolin et al. (2013), we analyzed a dataset of more than 400,000 public statements made by members of the 112th Senate and House between January 2011 and August 2012. The statements were downloaded from the Vote Smart Project website (votesmart.com). According to Vote Smart, the Members’ pub- lic statements include any press releases, statements, newspaper articles, interviews, blog entries, newsletters, legislative committee websites, campaign websites and cable news show websites (Meet the Press, This Week, etc.) that contain direct quotes from the Mem- ber. Since we are primarily interested in the connections between Members, we will, as we see below, want to filter out reuse among different statements by the same member. That information could be interesting for other reasons—for instance, tracking slight changes in the phrasing of talking points or substantive positions. We supplemented these texts with categorical data chambers and parties and with con- tinuous representations of ideology using the first dimension of the DW-NOMINATE scores (Carroll et al., 2009) that have been widely used to describe the political ideology of politi- cal actors, political parties and political institutions. The score in the first dimension closer to 1 is described as conservative whereas a score closer to −1 can be described as liberal. Finally, a score at zero or close to zero is described as moderate. Text reuse in social networks 28

aligned passages of ≥ n words n-grams of length 10 16 32 8 16 32 First-order Pearson correlations DW-nominate 0.26*** 0.25*** 0.23*** 0.26*** 0.22*** 0.16*** same chamber 0.05* 0.08** 0.13*** -0.05*** 0.21*** 0.10*** Regression coefficients DW-nominate 0.72*** 0.75*** 0.74*** 1.31*** 2.67*** 0.36 same chamber 0.15** 0.27*** 0.42*** 0.20 3.14*** 0.81*** R-squared .069 .070 .073 .068 .073 .010

Table 2.3: Correlations between log length of aligned text and other author networks in public statements by Members of Congress. ∗p < .05, ∗ ∗ p < .01, ∗ ∗ ∗p < .001

2.8.2 Experiment

We model the connection between the log magnitude of reused text and the strength of ties among Members according to whether they are in the same chamber and how similar they are on the first dimension of the DW-nominate ideological scale Carroll et al. (2009). On the left side of Table 2.3 are shown the results for correlating reused passages of certain minimum lengths (10, 16, 32 words) with these underlying features. On the right are shown the similar results of (Margolin et al., 2013) that simply used the exact size of the n-gram overlap between Members’ statements for increasing values of n. The alignment analysis proposed in this paper achieves similar results when passages and n-grams are short. Our analysis, however, achieves higher single and multiple correlations among networks are the passages grow longer. This is unsurprising since the probability of an exact 32-gram match is much smaller than that of a 32-word-long alignment that might contain a few differences. In particular, the much higher coefficients for DW-nominate at longer aligned lengths suggests that ideological influence still dominates over similarities induced by the procedural environment of each congressional chamber. Text reuse in social networks 29

2.9 Conclusion

We have presented techniques for detecting reused passages embedded within the larger discourses produced by actors in social networks. Some of this shared content is as brief as partisan talking points or lines of poetry; other reprints can encompass extensive legislative boilerplate or chapters of novels. The longer passages are easier to detect, with prefect pseudo-recall without exhaustive scanning of the corpus. Precision-recall tradeoffs will vary with the density of text reuse and the noise introduced by optical character recognition and other features of data collection. We then showed the feasibility of using network regression to measure the correlations between connections inferred from text reuse and networks derived from outside information. However, the techniques presented in this chapter are mainly effective for lexical text reuse. Because the Smith-Waterman algorithm uses fixed weights, where each mismatched token is treated equally, these methods are less effective at distinguishing pairs of related, but not closely copied, passages from pairs of unrelated passages. For less literally copied texts, especially semantically similar ones that can be hard to align, we need a more sen- sitive model of the semantics of passages and their contexts. In the next chapter, we will present an attention-based model to address this problem. Chapter 3

Semantic Text Reuse in Social Networks

The text reuse we address in chapter 2 focuses on lexical similarities between texts. In the real word, however, semantic text reuse comprises more widely observed phenomena such as paraphrasing and textual entailment. According to Androutsopoulos and Malakasiotis (2010): “Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language ex- pressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true.” For example, mentions of cited papers in academic papers usually summarize the contributions of the cited papers in one or two sen- tences, which can be seen as paraphrasing corresponding sentences in the abstract or cited papers, or a statement influenced or entailed from the abstract. Or in news media a press release is usually paraphrased in or entails the news reports. Such behaviors can also be seen in congressional bills, or the evolution of patents. The n-gram overlap and alignment methods presented in chapter 2 may fall short when presented with such types of text reuse. In this chapter, we will see an attention based convolution network (ABCN) that can deal with such phenomena. Because it is very hard to find datasets with annotations on

30 Semantic Text Reuse in Social Networks 31 reused source sentences and target sentences and it is also expensive to do so when the annotations are missing, we use citations in academic papers to evaluate such behavior. As mentioned, for an academic paper, the abstract is usually a summarization of the back- ground, description and sometimes experimental results of the contributions, and the citing paper has several mentions of the cited paper that can be considered as either paraphrase or entailment. Therefore, we consider the abstract and section text that contains the paper mentions as a document pair and aim to predict the mentions. §3.8 describes the dataset and experiments we use to evaluate our model.

3.1 Classifying text reuse as paraphrase or textual entail-

ment

The problems of paraphrase and textual entailment are well studied in the literature. With current booming applications of neural networks in natural language processing, many deep neural models have been proposed to improve these classification tasks. For paraphrase identification tasks, there are two lines of work that aim to capture sentence representations and similarities between sentences. The first are methods based on convolutional neural networks. Most of them are based on a “Siamese” network structure (Bromley et al., 1994), which uses the same weights for a convolutional neural network while working in tandem on two different input vectors to compute comparable output vector. For example, DCNN (Kalchbrenner et al., 2014) uses Dynamic k-Max Pooling, a global pooling operation over linear sequences, to handle input sentences of varying length and and induces a feature graph over the sentence that is capa- ble of explicitly capturing relations in different ranges. ARC-I and ARC-II (Hu et al., 2014) are two variants model built on DCNN. ARC-I defers the interaction of two texts to the end Semantic Text Reuse in Social Networks 32

of the process, while ARC-II lets them meet early by directly interleaving them to a single representation. DSSM (Huang et al., 2013) uses deep linear projection layers that project queries and documents into a common low-dimensional space where it is easy to compute the distance between the document and a query. C-DSSM (Shen et al., 2014) has a sim- ilar structure to DSSM but a convolutional layer that projects each word within a context window to a local contextual feature vector. MatchPyramid (Pang et al., 2016) computes a matching matrix whose entries represent the similarities between words of the given input pair is constructed and viewed as an image. Then a convolutional neural network is utilized to capture rich matching patterns in a layer-by-layer way. DRMM (Guo et al., 2016) also uses interaction based model where histogram pooling is used to count multiple levels of soft-TF, and use learning-to-rank afterwards. ABCNN (Yin et al., 2016) proposes three attention schemes that integrate mutual influence between sentences into CNNs; thus, the representation of each sentence takes into consideration its counterpart. These interdepen- dent sentence pair representations are more powerful than isolated sentence representations. K-NRM (Xiong et al., 2017) uses kernel-pooling instead of DRMM’s histogram pooling, and learns the word embeddings and the ranking layers end-to-end. CONV-KNRM (Dai et al., 2018) is the state-of-the-art model in CNN based text similarity matching model. Instead of exact matching n-grams, it uses Convolutional Neural Networks to represent n- grams of various lengths and so matches them in a unified embedding space. The n-gram soft matches are then utilized by the kernel pooling and learning-to-rank layers to generate the final ranking score. The other line of work using RNN based neural network has picked up steam re- cently. MV-LSTM (Wan et al., 2016a) has a similar network architecture as ARC-I but uses Bi-directional LSTM to learn the positional sentence representation instead. The matching score is then passed through k-Max pooling and a multi-layer perceptron. Match- SRNN (Wan et al., 2016b) proposes to use a recursive process to model the generation of Semantic Text Reuse in Social Networks 33 the global interaction between two texts. The interaction of two texts at each position is a composition of the interactions between their prefixes as well as the word level interaction at the current position. They use spatial (2D) RNN to integrate the local interactions re- cursively. BiMPM (Wang et al., 2017b) uses a bilateral multi-perspective matching, which means after getting representations of sentences from a Bi-LSTM encoder, the two rep- resentations are matched from both directions with multiple perspectives. Then another Bi-LSTM layer is used to aggregate the results into a fixed-length vector, followed by a fully connected layer to generate classification. DeepRank (Pang et al., 2017) combines both RNN and CNN networks where a measure network is applied to determine the local relevances by utilizing a convolutional neural network (CNN) or two-dimensional gated recurrent units (2D-GRU). Then sequential integration through RNN is used to product a global relevance score. However, BERT (Devlin et al., 2018) achieves a great improvement on tasks such as paraphrase identification and textual entailment problem, and becomes the state-of-the-art model on these tasks. Last but not least, since our goal is to pick the sentences inside a pair of passages that rationalize the classification decision, work in rationales also helps with the model choice. For example, Lei et al. (2016) combines an independently parameterized and RNN-based generator and encoder and trains them together to generate hard selection of rationales for classification tasks. The generator specifies a distribution over text fragments as candidate rationales and these are passed through the encoder for prediction. As previously men- tioned, it is difficult to collect data or generate labels. This also applies to rationale pre- diction. This model does not need a marked span of rationale in the training time. Zhang et al. (2016) presents a new CNN model for text classification that jointly exploits labels on documents and their constituent sentences, some of which are supportive to the docu- ment classification, or, they provide a rationale for the classification. Also, those sentences should exhibit some worthiness to be the support of the classification. As for academic Semantic Text Reuse in Social Networks 34 papers, Bonab et al. (2018) and Cohan et al. (2019) use CNN and RNN, respectively, to try to decide the citation-worthiness of sentences.

3.2 Method Overview

Convolutions Embedding + Bi-LSTM Attention Pooling

Source document

W1 S1 W2 W3

Weighted W 1 sum S2 W2 W3

W1 S3 W2 W3

Σ

Target document Σ

W1 Σ T1 W2 W3 General score function W1 T2 W2 W3

Concatenation W1 + Final Sigmoid output T3 W2 W3

Figure 3.1: The overview of structure of Attention Based Convolutional Network (ABCN)

In this section, we introduce our attention based model that picks out the relevant sen- tences given a document query. Figure 3.1 shows the model architecture. In a nutshell, we use CNNs to represent the semantics of sentences, an attention layer to learn how to at- tend to different sentences, and a final MLP layer with logistic activation to label the target Semantic Text Reuse in Social Networks 35 sentences.

3.3 Word Representations

Lexical features are insufficient to capture the semantics of words, so we need a different representation for the document vocabulary. The straightforward way is to have a one-hot encoding for each of the word in the vocabulary |V |. For example, if we have a minimal vocabulary |V | = he, she, boy, girl, then word “boy” would be represented by the vector (0, 0, 0, 1). There are several downsides of this method. First, we can see our one-hot encoded vector has the same size as the vocabulary, and only one element is 1 while the rest |V | − 1 elements are 0. That means if we have a million words vocabulary, we would have a very sparse representation. Secondly, we want similar units, such as words or sentences, are closed spatially in the representation space. Or, we want the representations of the units have a cosine similarity close to 1. However, the cosine similarity of one-hot encodings will always be 0 unless same words present. In a word, such encoding method still uses the lexical features of the texts. There is significant work done on generating continuous representations for document vocabulary, while in the meantime “embed” the semantics of words into a low-dimension space. and GloVe are two widely used models of constructing such word em- beddings. Word2Vec (Mikolov et al., 2013a) can obtain word embeddings by using two methods: Skip-Gram or Continuous Bags of Words (CBOW) via neural networks. Skip-gram takes a word as input and tries to predict C context words around it, while CBOW uses C con- text words to predict the word in the middle. Skip-gram works well with small amount of the training data, represents well even rare words or phrases, while CBOW is several times faster to train than the skip-gram, and has slightly better accuracy for the frequent Semantic Text Reuse in Social Networks 36 words. However, the assumption on both methods is to make prediction by taking local contexts into consideration (for skip-gram the local context is the word itself). They do not take advantage of global dependency. On the contrary, GloVe (Pennington et al., 2014) involves leveraging both global word-word co-occurrence matrix and local statistics and decomposing the matrix into dense vectors. One drawback of Word2Vec and GloVe is that they are static. No matter wheter they are estimated using existing corpus or specific training data, after training they are fixed. The outputs of both models are dictionaries where the keys are the words in the vocabulary. They can model complex characteristics of word use, such as syntax and semantics, but not how these use vary across linguistic contexts, i.e. to model polysemy. Therefore, recent research focuses on modeling contextualized word embeddings, where each word is assigned a representation that is a function of the entire input sequence. Such models can address both aforementioned challenges, albeit being more expensive to use in running time comparing to dictionary lookup of “static” methods. Some representative models are ELMo and BERT. ELMo (Peters et al., 2018) is a character based contextualized word embedding method, where it uses the concatenation of independently trained left-to-right and right-to-left LSTM (Hochreiter and Schmidhuber, 1997) to generate word embeddings. Since it is character based, it allows the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training. BERT (Devlin et al., 2018) uses masked lan- guage models to enable pre-trained deep bidirectional representations. Instead of LSTM, BERT uses a bidirectional Transformer (Vaswani et al., 2017) and the tokenization is per- formed by WordPiece (Wu et al., 2016) embeddings with a 30,000 token vocabulary. BERT is the state-of-the-art for several semantic similarity tasks on datasets such as QQP (Chen et al., 2018), MRPC (Dolan and Brockett, 2005), QNLI (Wang et al., 2018), and STS-B(Cer et al., 2017). We use BERT to provide word features for our model. Due to the fact that we Semantic Text Reuse in Social Networks 37 do not have sentence-level labels for source (cited) documents in the cited sentence selction task, we lack the ability to fine tune the parameters of BERT directly on relevant sentences in our corpus. However, we provide alternative ways of comparing fine-tuned BERT model on the evaluation tasks. The details are given in the experiment section (§3.8).

Concretely, we have inputs S = (s1, s2, . . . , sn) and T = (t1, t2, . . . , tn), where S and

T represents the source and target documents, respectively, and si|ti = (w1, w2, . . . , wn) is

d the i-th sentence in the document where wn ∈ R is the embeddings of WordPiece tokens from pre-trained BERT model.

3.4 Contextualized Sentence Representation

We aim to detect text reuse in terms of sentences, so we want to represent sentences in a manner similar to words using continuous vectors from word embeddings, which capture the semantic features of sentences. The simplest way is to compute the average of embed- dings from all the words within a sentence. However, it fails to capture the structure of sentence, such as grammar units that consist of several words which are crucial at express- ing the sentence semantics. One of the many improvements is another algorithm by Arora et al. (2017) which takes pretrained word embeddings on large corpus and then encodes a sentence in a weighted combination of the word vectors then removes the projection of vectors on the first principle component. Those methods aim at providing a universal sentence embedding. We, however, want some models that can incorporate the interactions between different words from different parts of the sentence to provide us with the sentence representation for some specific corpora. There are usually two regimes for the task. The first regime is to train an encoder with Recurrent Neural Network (RNN) (Jordan, 1986; Pearlmutter, 1989; Cleeremans et al., 1989). It allows the ability to base the un- derstanding of current words on the previous words one has seen. To fulfill this, a vanilla Semantic Text Reuse in Social Networks 38

RNN is a network with loops, which accepts variable-length sequence input, and allows the information to persist and flow through different time steps. Figure 3.2 shows the “unroll” network structure. It uses recurrent hidden states whose activation at each time is depen- dent on that of the previous time. However, as Bengio et al. (1994) observed, it is difficult to train RNNs to capture long-term dependencies because the gradients tend to either van- ish or explode. Later, several gated RNN models have been proposed to use recurrent units such as Long Short-Term Memory (LSTM) unit (Hochreiter and Schmidhuber, 1997) or Gated Recurrent Unit (GRU) (Cho et al., 2014). Those models have gate functions in each recurrent unit to modulate the information flows.

ht

A

xt

Figure 3.2: Unrolled Vanilla Recurrent Neural Network

The other regime is to use Convolutional Neural Networks (CNN or ConvNet). Origi- nally designed by LeCun et al. (1998) to recognize hand-written digits in computer vision, CNN has achieved remarkable results in computer vision and in re- cent years and started getting good results for several NLP tasks such as semantic , search query retrieval, so on and so forth. It utilizes layers with convolving filters that are applied to local features. It is easier to parallelize the training of CNN based models than variants of RNN models, but CNN based models have more parameters and require fixed- length sentences as inputs. Consider the recent popularity and excellent results CNN based models have performed in paraphrase identification problems, in our proposed model we Semantic Text Reuse in Social Networks 39 will use a standard ConvNet structure (figure 3.3) to learn sentence representations.

convolution and activation function global max pooling

n x d representations convolutional layer 3 maximum of sentences with 3 filters with numbers (n = 5, d = 5) kernel size 2 which concatenated yields 3 feature maps together to form a single feature vector Modeling Text = Embedded Information Cascades

Figure 3.3: An illustration of ConvNet structure used in ABCN with toy example with word embeddings in R5, kernel size 2, feature map size 3. It yields a representation for the sentence in R3

d1×kd d1 Each convolutional filter is parameterized with kernel size k as W ∈ R , bw ∈ R

k×d and takes as input X ∈ R which is a concatenation of k WordPieces wi, and generates

d1 the representation si|ti ∈ R :

si|ti = ReLU(W [wi−k/2, . . . , wi+k/2] + bw) (3.1)

After the convolutional layer, global maximum pooling is then used to obtain a fixed length of representation of the sentence. We share the weights of convolutional layers between the source and target documents. Now that we have representations of each sentence we also want to embed context into the sentence representations similar to word representations. Having the knowledge of Semantic Text Reuse in Social Networks 40 the semantics of sentences prior to or after current sentence reveals additional information such as cite worthiness (Cohan et al., 2019). Here we use a bidirectional long short-term memory (BiLSTM) (Schuster and Paliwal, 1997) with a hidden size d2:

−−−−→ ←−−−− hi = [LSTMh(s, i), LSTMh(s, i)] (3.2)

−−−−→ ←−−−− gj = [LSTMg(t, j), LSTMg(t, j)] (3.3)

−−−−→ n×2d2 where h, g ∈ R and LSTMh(h, i) processes h from left to right at position i, and ←−−−− vice versa for the backward direction LSTMh(h, i). The same applies to BiLSTM for t. In order to capture the dependencies between sentences in documents independently, we use two sets of parameters for source and target documents, respectively.

3.5 Attention

Attention mechanisms have long been used in the field of computer vision and natural lan- guage processing. The intuition behind it is that humans do not tend to process something in its entirety at once but instead focus attention selectively on parts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation (Rensink, 2000; Mnih et al., 2014). Therefore, by including attention mechanism in the model, we can regulate the information of what the model is “looking at” on training and inference time to make current prediction. Now that we have the contextualized representation of each sentence from the source and target documents, to compute the attention score, we combine the representation of Semantic Text Reuse in Social Networks 41

source document hi and target document gj:

αi,j = f(hi, gj) (3.4)

where f is the score function to measure the similarities between states of source and target sentences. We choose the general score function from Luong et al. (2015) where:

f(hi, gj) = hiW gj (3.5)

Then we compute the softmax of each sentence in source document across all sentences in the target document:

exp(αi,j) αi,j = P (3.6) i exp(αi,j)

Similar to the global attention mechanism in Luong et al. (2015), the final context vector is a weighted sum of the attention scores and source sentence embeddings hi:

X cj = αi,jhi (3.7) i

3.6 Final output

Given the target sentence embedding gj and the source-side context vector cj, we can compute the final output as:

oj = σ(Wc[cj; gj]) (3.8)

where [·; ·] is concatenation and σ(·) is the sigmoid function. Since the sigmoid function

gives us a result between 0 and 1, we treat oj as the probability of labeling sentence tj to Semantic Text Reuse in Social Networks 42 be semantically similar to some sentence(s) in the source document.

3.7 Objective function

For each target document T, we now have a sequence of outputs (o1,..., on). Since the number of similar sentences will be relatively small compared to the other nonrelevant ones, we use weighted cross entropy between the output sequence and the ground truth:

n X   LT = − yj log(oj) · w + (1 − yj) log(1 − oj) (3.9) j=1 where yj is the label of j-th sentence in the target document, and w is the weight, which is a hyper-parameter, assigned to the positive labels, to accentuate the cost of a positive error relative to a negative error. We use Adam Optimizer (Kingma and Ba, 2015) to optimize the objective function.

3.8 Experiments

3.8.1 Datasets

We evaluate the effectiveness of our model on a sentence selection task on scientific papers from the ACL Anthology Conference Corpus (Bird et al., 2008). The corpus contains the abstract and full article text, later broken into sections, of 10,921 scientific articles. Hereafter, we use “document” to refer both abstracts and sections of full articles. The citations between documents in the corpus can provide labels for training and evaluating our models. To be specific, we create (source, target) pairs using the abstract of the cited paper and the section in the citing paper which includes the citation. In the target document, the “matched” sentence is defined as the sentence in the the target document where the actual Semantic Text Reuse in Social Networks 43 citation mark for the source document occurs. A section that contains at least one matched sentence is further defined as a matched document. One example is given as follows:

S: This paper proposes a framework for representing the meaning of phrases and sentences in vector space. Central to our approach is vector composition which we operationalize in terms of additive and multiplicative functions. Under this framework, we introduce a wide range of composition models which we evaluate empirically on a sentence similarity task. Experimental results demonstrate that the multiplicative models are superior to the additive alternatives when compared against human judgments.

T: ... As our final set of baselines, we extend two simple techniques proposed by Mitchell and Lapata (2008) that use element-wise addition and multiplication operators to perform composition. The two baselines thus obtained are AVC (element-wise addition) and MVC (element-wise multiplication).

The above example highlights the citing sentence, as well as the most relevant sentence in its citation’s abstract. To generate non-relevant target examples for a given source ab- stract, we randomly sample a section in the citing paper that doesn’t have any citation mark pointed to the source paper where the abstract is from. A given source abstract will thus have positive and negative target sections drawn from the same target paper. To locate the citation marks, we use Parscit (Councill et al., 2008), a CRF-based system to parse the reference strings, find the citations in the document body and split the whole paper into sections. Since citation marks usually have a distinctive lexical pattern, and we do not want our model to be able to learn it, in the pre-processing stage we remove all citation marks, while keeping a record of the position of the actual ones referring to the Semantic Text Reuse in Social Networks 44

given abstract. We then use the sentence tokenizer from StanfordCoreNLP (Manning et al., 2014) to split the abstracts and sections into sentences. We obtain 105,381 document pairs of (source abstract, target section). As previously mentioned, we have definitions of matched sentence and document, so we evaluate our model on retrieval tasks on both sentence and document level. On the sentence level, we also evaluate our model on ranking tasks to see how good it is at picking relevant sentences. The sentence level task is straightforward: for a given abstract-section document pair, we want to predict the matched sentence. For the document-level task, we are given an abstract and a list of sections from the citing papers and we want to find the matched document (section). In order to avoid having the model learn the particular contents of cited papers, we split the corpus based on the cited abstract, and have 86,424 pairs for training, 9,121 pairs for validation and 9,836 pairs for testing. For the source documents, we limit the number of sentences to 90th percentile among all samples, or by dropping abstracts that have more than 20 sentences, while for the target documents, 90th percentile gives section texts that have fewer than 40 sentences.

3.8.2 Models

We compare ABCN with the following models:

TF-IDF TF-IDF (Salton et al., 1982) is a widely used and sometimes strong baseline method. Each token is represented in a |V |-dimensional vector, where |V | is the vocabulary size. We train the vocabulary and idf score using the training dataset only, and compute the similarity score using the inner product of two vectors.

Jaccard Similarity We use Jaccard similarity coefficient for the document pairs. Al- though it doesn’t require training data, we use the validation set for choosing the threshold Semantic Text Reuse in Social Networks 45 for positive labeling and run it on the test dataset:

|si ∩ tj| J(si, tj) = (3.10) |si ∪ tj|

Levenshtein Distance We use Levenshtein(edit) distance for the document pairs. It fo- cuses on the syntactic similarities and can be an indication of the level of semantic similar- ity that we have.

BERT-Base Uncased We use pretrained uncased BERT-Base model which has 12-layer, 768-hidden, 12-heads, and 110M parameters to perform the binary classification. Because BERT gives the normalized probabilities for the binary labels, we can also evaluate its performance on ranking tasks.

Fine-tuned BERT Uncased Since BERT-Base is pretrained on a corpus that has some- what different styles of wording than scientific papers, we also fine-tune BERT based on the strategies by which we treat the source documents. The details of the BERT fine-tuning process are described in the following sections.

ABCN We choose above-mentioned pretrained uncased BERT-Base model for feature extraction. We use the output of the second to last layer, which has 768 dimensions, to be our contextualized word embedding. We use the validation set for picking the best hyper- parameters and the threshold of the sigmoid function for outputing labels. The result we report uses a 7-tokens wide, 100 filters 1D convolutional layer, 128-hidden BiLSTM, 15 for weight on positive labels, and 0.2 rate for dropout layers in-between the layers in the model. In addition, for Adam Optimizer (Kingma and Ba, 2015) we choose a learning rate at 1 × 10−4. Semantic Text Reuse in Social Networks 46

3.8.3 Experiment settings

There is substantial previous research on classifying whether a pair of texts are paraphrases of one another, or whether the first text entails the second. However, no previous work, to the best of our knowledge, has attempted to perform the classification across all sentences in a pair of documents at once. Hence, we compare our method with several other mod- els from paraphrase identification or textual entailment that computes pairwise similarity scores with some modifications. For those models, we need to adapt the training tasks in order to be comparable to our results. Since we do not have sentence annotations on the source documents, such modifications are made based on how we pair them with the target documents:

• Whole Source Document vs. Target Sentences

One possible way to train the classifier model on our task is to ignore the lack of sentence annotations in the source document but use it as a whole. Therefore, the input to the classifier model becomes pairs of the source document and every target sentence. We can keep using the sentence annotations for training and testing in this case as we consider the positively labeled sentence is a paraphrase or entailment from the source document.

• Source Sentences vs. Target Sentences

The other way is to utilize every sentence in the source document, make a cross product with sentences in the target document. Since we have no corresponding annotations for source document, we cannot fine-tune BERT-Base model on our cor- pus, instead we train it on the training data of MRPC and test it on our corpus. For inference, for each target sentence, we generate a similarity score for each sentence from source documents, and use the highest score to be the surrogacy between source Semantic Text Reuse in Social Networks 47

document and the target sentence.

SemEval 2014 (Jurgens et al., 2014) had the task of modeing cross-level semantic sim- ilarity, wherein one of the possibilities is paragraph-to-sentence pair, which is the same as our first modification. Second possible modification is natural since sentence similar- ity problem has been widely studied. These two options allow use to compute the sim- ilarity score, and, if needed, use validation dataset to determine the best threshold for similarity score for labeling a sentence as matched. Then we can report evaluations on both document and sentence level: (1) on the document level, we use macro-averaged re- call/precision/F1 to measure given a source document, how good the model performs on retrieving matched target document, while (2) on the sentence level, we also use macro- averaged recall/precision/F1 to measure the performance on retrieving relevant target sen- tence, as well as some classical ranking metrics, where for a target document we rank the relevance of sentences by the similarity score:

• MAP@k: The classical mean average precision measure where the average precision is defined as the mean of the precision scores after each relevant document is retrieved up to top k documents:

P P @r Average P recision = r R

• MRR@k: The classical mean reciprocal rank measure which calculates the reciprocal of the rank at which the first relevant document was retrieved up to top k documents.

3.8.4 Document level evaluation

Recall from §3.8.1 that on document level we are interested in finding the matched target documents from a list of candidates that can be retrieved by off-the-shelf search engines, Semantic Text Reuse in Social Networks 48

Table 3.1: Document level results (Macro-averaged recall/precision/F1) of ABCN in com- parison with baseline methods under both whole source document vs. target sentences and source sentences vs. target sentences setting.

recall precision F1 Source Document vs. Target Sentences TF-IDF 0.522 0.686 0.593 Jaccard Distance 0.067 0.179 0.097 Levenshtein Distance 0.027 0.027 0.027 BERT 0.973 0.547 0.700 BERT (ARC) 0.455 0.735 0.562 Source Sentences vs. Target Sentences TF-IDF 0.298 0.570 0.391 Jaccard Distance 0.078 0.229 0.116 Levenshtein Distance 0.089 0.286 0.136 BERT 0.843 0.543 0.661 BERT (MRPC) 0.886 0.610 0.722 ABCN 0.854 0.726 0.785 and we consider a target document “matched” when at least one sentence within the docu- ment is predicted as such. Table 3.1 shows the results for document level evaluation. We cannot perform ranking evaluations as we are not able to obtain a similarity score given a document pair. From the table we see that that uncased BERT-base model has the highest recall. How- ever, having a closer inspection on the test dataset, we see that 57.59% documents are matched documents. This tells us that the Uncased BERT-base model does slightly better than simply predicts a random sentence to be relevant for all the target documents. Fine- tuned BERT model achieves the highest precision while having the worst recall among all BERT models. ABCN achieves the most highest balance between recall and precision by having the highst F-1 measure, while both recall and precision are very close to the best performer. Fine-tuned BERT if use the cross product of sentences from source and target documents achieves the second to the best balance on retrieval tasks, while has the disad- Semantic Text Reuse in Social Networks 49 vantage of being trained on a different corpus – MRPC. One might expect it to performs better if fed with the correct training data. However, this is exactly the advantage and novelty of our proposed model, which can consider the document pairs at the same time without having expensive-to-obtain sentence level labels on both documents.

3.8.5 Sentence level evaluation

We are also interested in how well a model can select the matched sentence, which in this test setting is defined as the sentence in the the target document where the actual cita- tion mark for the source document occurs. All the baseline methods and ABCN assign a similarity score to each sentence in the target document given the source document. We use the validation dataset to determine the threshold of mapping similarity score to posi- tive/negative labels to compute the macro-averaged recall/precision/F1. We also rank the sentences from the same target document and compute MAP@k and MRR@k metrics. Table 3.2 shows the sentence level results among all target documents, which also in- cludes non-matched documents. This experiment is able to show that how many false negatives (through recall) and false positives (through precision) the models would predict for those documents without matched sentences. Since the denominator of the recall com- putation involves true positives and false negatives, we consider a non-matched document to have recall 1 if the numbers of both true positives and false negatives are 0, and to have precision 1 if the number of both true positives and false positives are 0. We can see from the table that fine-tuned BERT trained on source document as a whole achieves the best re- sults here, followed by TF-IDF. BERT-Base has abysmal precision, which also supports our theory about document level evaluation—that is, it predicts at least one matched sentence for a given document. In both ways of treating the source documents, fine-tuned BERT would achieve a better precision compared to the pretrained base model. TF-IDF has a Semantic Text Reuse in Social Networks 50 stronger performance on this task than it does on the document level evaluation. It remains to be seen if the boosting of baseline methods and slight slippage of ABCN is due to false negatives/positives. Therefore, we single out the matched documents only and perform the same evaluations, as well as the ranking tasks, as it makes sense to evaluate it if we have non-matched documents. From Table 3.3 we see that indeed BERT-Base model underperforms due to its ten- dency to mark every sentence as matched, which is indicated by the almost perfect recall and close to 0 precision. Although fine-tuning BERT using source documents as a whole in the training data boosts the precision, it also adversely affects the performance on re- call. Since in this corpus we limit the source document to have 20 sentences and target 40 sentences, we report the ranking evaluation for top 5 sentences. And the fine-tuned BERT on our corpus achieves highest scores, just as Devlin et al. (2018) show that it has the ad- vantage if given the appropriate data. And the very low performance achieved by Jaccard distance and Levenshtein distance shows that the actual matched sentences share very few lexical features with the source document. As for cross joining source sentences and target sentences without the label, we see that neither baseline methods have good performance on precision. Our proposed model achieves the best F-1 measure and the highest precision over both experimental settings, while is the runner-up on the ranking tasks. This once again shows that ABCN has the advantage when annotations from only one side are given.

3.8.6 Ablation Test

To test the efficacy of contextualizing sentence representation by Bi-LSTM layer in §3.4, we perform an ablation test where we remove the Bi-LSTM layer from the model. To compare with the proposed model structure, we train and evaluate the ablated model for both document and sentence level tasks, using the validation set to pick the best hyper- Semantic Text Reuse in Social Networks 51

Table 3.2: Sentence level results (macro-averaged recall/precision/F1) of ABCN in com- parison with baseline methods under both whole source document vs. target sentences and source sentences vs. target sentences, evaluated on all target documents.

recall precision F1 Source Document vs. Target Sentences TF-IDF 0.487 0.451 0.468 Jaccard Distance 0.429 0.430 0.430 Levenshtein Distance 0.424 0.424 0.424 BERT 0.576 0.051 0.093 BERT (ARC) 0.556 0.551 0.554 Source Sentences vs. Target Sentences TF-IDF 0.449 0.443 0.446 Jaccard Distance 0.428 0.429 0.428 Levenshtein Distance 0.427 0.427 0.427 BERT 0.144 0.073 0.097 BERT (MRPC) 0.333 0.188 0.240 ABCN 0.521 0.422 0.466

Table 3.3: Sentence level results (macro-averaged recall/precision/F1) of ABCN in com- parison with baseline methods under both whole source document vs. target sentences and source sentences vs. target sentences, evaluated on only matched documents.

recall precision F1 MAP@5 MRR@5 Source Document vs. Target Sentences TF-IDF 0.214 0.151 0.177 0.423 0.426 Jaccard Distance 0.012 0.013 0.012 0.320 0.322 Levenshtein Distance 0.000 0.000 0.000 0.257 0.259 BERT 0.997 0.085 0.157 0.376 0.378 BERT (ARC) 0.270 0.261 0.265 0.641 0.646 Source Sentences vs. Target Sentences TF-IDF 0.087 0.077 0.082 0.406 0.408 Jaccard Distance 0.014 0.015 0.014 0.292 0.294 Levenshtein Distance 0.013 0.014 0.013 0.210 0.211 BERT 0.153 0.031 0.051 0.098 0.098 BERT (MRPC) 0.347 0.094 0.147 0.247 0.248 ABCN 0.474 0.303 0.370 0.539 0.543 Semantic Text Reuse in Social Networks 52

Table 3.4: Ablation test where Bi-LSTM layer is removed to test the efficacy of contextu- alized sentence representation, on both document and sentence level dataset, in comparison to proposed model structure.

recall precision F1 MAP@5 MRR@5 Document level evaluation ABCN 0.854 0.726 0.785 N/A N/A ABCN-ablated 0.744 0.731 0.737 N/A N/A Sentence level evaluation - all target documents ABCN 0.521 0.422 0.466 N/A N/A ABCN-ablated 0.521 0.431 0.472 N/A N/A Sentence level evaluation - matched documents ABCN 0.474 0.303 0.370 0.539 0.543 ABCN-ablated 0.395 0.238 0.297 0.515 0.519 parameters. Table 3.4 shows the comparison. From the results we see that using sentence representation directly from CNN layer reduces false positives on unmatched documents, from the improved precision on document level evaluation and sentence level evaluation using all target documents. But we see the proposed model outperforms in terms of recall and F1 on document level evaluation. For the purpose of selecting similar sentences given only matched documents, we see the effectiveness of obtaining contextualized sentence representation, as the proposed model beats the ablated model by a good margin for both retrieval and ranking metrics.

3.9 Conclusion

Semantic text reuse is just as widely adopted as lexical reuse. In this chapter, we present an attention-based neural network model with a Siamese structure (ABCN) to capture con- textualized sentence representations, model the interactions between sentences and select a relevant sentence given another document. We evaluate this model on the ACL Anthology corpus, where we try to recover the links between sentences with a citation and the abstracts Semantic Text Reuse in Social Networks 53 of the cited documents. We compare the ABCN model with a state-of-the-art BERT model and show better performance on retrieval tasks. The ABCN model unfortunately shows a higher rate of false positives on unmatched documents, where the lack of an explicit citation is treated as a negative example in our evaluation. While the quality of data is partly to be blamed, we think a better attention mechanism might also help to reduce the number of false positives. In addition, as BERT shows better performance compared to ABCN on ranking tasks in the sentence-level evalu- ation that includes only matched documents, we believe that there are also things to improve in our model to make more accurate predictions. Now that we have developed and evaluated models to detect lexical and semantic local text reuse, we can use these relationships as indirect evidence of social ties and model the structure of social networks and how information propagates through them. Chapter 4

Modeling information cascades with rich feature sets

In previous chapters, we show the models that detect shared behavior between actors in social networks in the form of text reuse. Thanks to those shared behavior, we are able to observe sequences of actors, such as newspapers, individuals, websites, political parties and so on, by “adopting” similar or the same information. Modeling the structures of these sequences can help to reveal the dynamics of the underlying network. To achieve that, We present a simple log-linear, edge-factored directed spanning tree (DST) model of cascades over network nodes. This allows us to talk concretely about the likelihood objective for supervised and unsupervised training, where we present a contrastive objective function. We note that other models besides the DST model could be trained with this contrastive objective. Finally, we derive the gradient of this objective and its efficient computation using Tutte’s directed matrix-tree theorem.

54 Modeling information cascades with rich feature sets 55

4.1 Network Structure Inference

There has been a great deal of work on trying to infer underlying network structure us- ing information cascades, most of which are based on the independent cascade (IC) model (Saito et al., 2008). We evaluate the DST model against a transmission based model from Rodriguez and Schölkopf (2012), which, similar to our work, also uses directed spanning trees to represent cascades, but employs a submodular parameter-optimization method and fixed activation rates. In addition, we compare our work with an advanced model from Rodriguez et al. (2014), which uses a generative probabilistic model for inferring both static and dynamic diffusion networks. It is a line of work starting from using a gen- erative model with fixed activation rate (Gomez Rodriguez et al., 2010; Rodriguez and Schölkopf, 2012; Myers and Leskovec, 2010). Later comes the development of inferring the activation rate between nodes to reveal the network structure (Rodriguez et al., 2011; Gomez Rodriguez et al., 2013; Snowsill et al., 2011; Rodriguez et al., 2013, 2014). Zhai et al. (2015) use a Markov chain Monte Carlo approach for the inference problem. Linder- man and Adams (2014) propose a probabilistic model based on mutually-interacting point processes and also use MCMC for the inference. Gui et al. (2014) model topic diffusion in multi-relational networks. An interesting approach by Amin et al. (2014) infers the un- known network structure, assuming the detailed timestamps for the spread of the contagion are not observed but that “seeds” for cascades can be identified or even induced experi- mentally. Wang et al. (2012) propose feature-enhanced probabilistic models for diffusion network inference while still maintaining the requirement that exact propagation times be observed and modeled. Daneshmand et al. (2014) and Abrahao et al. (2013) perform the- oretical analysis of transmission-based cascade inference models. While the foregoing ap- proaches are all based on parametric models of propagation time between infections, Rong et al. (2016) experiment with a nonparametric approach to discriminating the distribution of Modeling information cascades with rich feature sets 56

diffusion times between connected and unconnected nodes. Recently, Brugere et al. (2016) have compiled a survey about the methods and applications for different network structure inference problems. Tutte’s directed matrix-tree theorem, which plays a key role in our approach, has been used in natural language processing to infer posterior probabilities for edges in nonprojec- tive syntactic dependency trees (Smith and Smith, 2007; Koo et al., 2007; McDonald and Satta, 2007) and for inferring semantic hierarchies (i.e., ontologies) over words (Bansal et al., 2014).

4.2 Log-linear Directed Spanning Tree Model

For each cascade, define a set of activated nodes x = {x1, . . . , xn}, each of which might be associated with a timestamp and other information that are the input to the model. Nodes thus correspond to (potentially) dateable entities such as webpages or posts, and not aggre- gates, such as websites or users. Let y be a directed spanning tree of x, which is a map y : {1, . . . , n} → {0, 1, . . . , n} from child indices to parent indices of the cascade. In the

range of mapping y we add a new index 0, which represents a dummy “root” node x0. This allows us to model both single cascades and to disentangle multiple cascades on a set of nodes x, since more than one “seed” node might attach to the dummy root. In the exper- iments below, we model datasets with both single-rooted (“separated”) and multi-rooted (“merged”) cascades. A valid directed spanning tree is by definition acyclic. Every node has exactly one parent, with the edge xy(i) → xi, except that the root node has in-degree 0. We might wish to impose additional constraints C on the set of spanning trees: for instance, we might require that edges not connect nodes with timestamps known to be in reverse order. Let

YC be the set of all valid directed spanning trees that satisfy a list of rules in constraint set Modeling information cascades with rich feature sets 57

C over x, and Y be the set of all directed spanning trees over the same sequence of x but without any constraint being imposed. Define a log-linear model for trees over x. The unnormalized probability of the tree y ∈ Y is thus:

θ·f(x,y) p˜θ(y | x) = e (4.1) where f is a feature vector function on cascade and θ ∈ Rm parameterizes the model. Following McDonald et al. (2005), we assume that features are edge-factored:

n n X X θ · f(x, y) = θ · fx(y(i), i) = s(y(i), i) (4.2) i=1 i=1 where s(i, j) is the score of a directed edge i → j. In other words, given the sequence x and the cascade is a directed spanning tree, this directed spanning tree (DST) model assumes that the edges in the tree are all conditionally independent of each other. Despite the constraints they impose on features, we can perform inference with edge- factored models using tractable O(n3) algorithms, which is one of the advantages this

model brings. Since p˜θ(y | x) is not a normalized probability, we divide it by the sum over all possible directed spanning trees, which gives us:

eθ·f(x,y) eθ·f(x,y) p (y | x) = = θ P θ·f(x,y0) (4.3) y0∈Y e Zθ(x)

where Zθ(x) denotes the sum of log-linear scores of all directed spanning trees, i.e., the partition function.

If, for a given set of parameters θ, we merely wish to find the ˆy = arg maxy∈Y pθ(y | x), we can pass the scores for each i → j candidate edge to the Chu-Liu-Edmonds maxi- mum directed spanning tree algorithm (Chu and Liu, 1965; Edmonds, 1967). Modeling information cascades with rich feature sets 58

4.3 Likelihood of a cascade

When we observe all the directed links in a training set of cascades, we now have the machinery to perform supervised training with maximum conditional likelihood. We can

∗ simply maximize the likelihood of the true directed spanning tree pθ(y | x) for each cascade in our training set, using the gradient computations discussed below. When we do not observe the true links in a cascade, we need a different objective function. While we cannot restrict the numerator in the likelihood function (4.3) to a single, true tree, we can restrict it to the set of trees YC that obey some constraints C on valid cascades. As mentioned above, these constraints might, for instance, require that links point forward in time or avoid long gaps. We can now write the likelihood function for each cascade x as a sum of the probabilities of all directed spanning trees that meet the constraints C:

P eθ·f(x,y) X y∈YC Zθ,C (x) Lx = pθ(y | x) = = (4.4) Zθ(x) Zθ(x) y∈YC where Zθ,C (x) denotes the sum of log-linear scores of all valid directed spanning trees under constraint set C. This is a contrastive objective function that, instead of maximizing the likelihood of a single outcome, maximizes the likelihood of a neighborhood of possible outcomes con- trasted with implicit negative evidence (Smith and Eisner, 2005). A similar objective could be used to train other cascade models besides the log-linear the DST model presented above, e.g., models such as the Hawkes process in Linderman and Adams (2014). As noted above, cascades on a given set of nodes are assumed to be independent. We Modeling information cascades with rich feature sets 59

thus have a log-likelihood over all N cascades:

X X Zθ,C (x) log L = log L = log (4.5) N x Z (x) x x θ

4.4 Maximizing Likelihood

Our goal is to find the parameters θ that solve the following maximization problem: prob- lem: X max log LN = max (log Zθ,C (x) − log Zθ(x)) (4.6) θ θ x To solve this problem with quasi-Newton numerical optimization methods such as L-BFGS Liu and Nocedal (1989), we need to compute the gradient of the objective function, which

for a given parameter θk is given by the following equation:

  ∂ log LN X ∂ log Zθ,C (x) ∂ log Zθ(x) = − (4.7) ∂θ ∂θ ∂θ k x k k For a cascade that contains n nodes, even if we have tractable number of valid directed

n−2 spanning trees in YC , there will be n (Cayley’s formula) possible directed spanning trees for the normalization factor Zθ(x), which makes the computation intractable. Fortunately,

3 there exists an efficient algorithm that can compute Zθ(x), or Zθ,C (x), in O(n ) time.

4.5 Matrix-Tree Theorem and Laplacian Matrix

Tutte (1984) proves that for a set of nodes x0, . . . , xn, the sum of scores of all directed

spanning trees Zθ(x) in Y rooted at xj is

Z (x) = Lˆ j θ θ,x (4.8) Modeling information cascades with rich feature sets 60

ˆ j where Lθ,x is the matrix produced by deleting the j-th row and column from Laplacian

matrix Lθ,x. Before we define Laplacian matrix, we first denote:

θ·fx(j,i) sθ,x(j,i) uθ,x(j, i) = e = e (4.9)

where j = y(i), which is the parent of xi in y. Recall that we define the unnormalized score of a spanning tree over x as a log-linear model using edge-factored scores (Eq 4.1, 4.2). Therefore, we have:

n P θ·f(x,y) θ·fx(y(i),i) Y e = e i = uθ,x(j, i) (4.10) i=1

where uθ,x(j, i) represents the multiplicative contribution of the edge from parent j to child i to the total score of the tree.

(n+1)×(n+1) Now we can define the Laplacian matrix Lθ,x ∈ R for directed spanning trees by:

[Lθ,x]j,i =    −uθ,x(j, i) if edge (j, i) ∈ C  (4.11)  P uθ,x(k, i) if j = i  k∈{0,...,n},k6=j   0 if edge (j, i) ∈/ C where j represents a parent node and i represents a child node. As for all possible valid directed spanning trees, we will have 0 for all entries where the edge from parent j to child i doesn’t satisfy the specified constraint set. For all possible directed spanning trees, however, the constraint set C is V × V , that is all possible edges. We can use the LU factorization to compute the matrix inverse, so that the determinant of the Laplacian matrix can be done in O(n3) times. Meanwhile, the Laplacian matrix is Modeling information cascades with rich feature sets 61

diagonally dominant, in that we use positive edge scores to create the matrix. The matrix therefore is guaranteed to be invertible.

4.6 Gradient

Smith and Smith (2007) use a similar inference approach for probabilistic models of non-

projective dependency trees. They derive that for any parameter θk,

n n ∂ log Zθ(x) 1 X X k = uθ,x(j, i)fx (j, i) ∂θk |Lθ,x| i=1 j=0 (4.12) ∂ |L | ∂ |L | ×( θ,x − θ,x ) ∂[Lθ,x]i,i ∂[Lθ,x]j,i

Also, for an arbitrary matrix A, they derive the gradient of A with respect to any cell [A]j,i using the determinant and entries in the inverse matrix:

∂ |A| −1 = |A| [A ]i,j (4.13) ∂[A]j,i

Plugging (4.13) to (4.12) gives us the final gradient of Zθ(x) with respect to θk:

n n ∂ log Zθ(x) X X = u (j, i)f k(j, i) ∂θ θ,x x k i=1 j=0 (4.14) −1 −1  × [Lθ,x]i,i − [Lθ,x]i,j

4.7 ICWSM 2011 Webpost Dataset

One of the hardest tasks in network inference problems is gathering information about the true network structure. Most existing work has conducted experiments on both synthetic data with different parameter settings and on real-world networks that match the assump- tions of proposed method. Generating synthetic data, however, is less feasible if we want to Modeling information cascades with rich feature sets 62 exploit complex textual features, which negates one of the advantages of the DST model. Generating child text from parent documents is beyond the scope of this paper, although we believe it to be a promising direction for future work. In this paper, therefore, we train and test on documents from the ICWSM 2011 Spinn3r dataset Burton et al. (2011). This allows us to compare our method with MultiTree Rodriguez and Schölkopf (2012) and InfoPath Rodriguez et al. (2014), both of which output a network given a set of cascades. We also analyze the performance of DST at the cascade level, an ability that MultiTree, InfoPath and similar methods lack.

4.7.1 Dataset description

The ICWSM 2011 Spinn3r dataset consists of 386 million different web posts, such as blog posts, news articles, social media content, etc., made between January 13 and Febru- ary 14, 2011. We first avoid including hyperlinks that connect two posts from the same websites, as they could simply be intra-website navigation links. In addition, we enforce a strict chronological order from source post to destination post to filter out erroneous date fields. Then, by backtracing hyperlinks to the earliest ancestors and computing connected components, we are able to obtain about 75 million clusters, each of which serves as a sep- arate cascade. We only keep the cascades containing between 5 and 100 posts, inclusive. This yields 22,904 cascades containing 205,234 posts from 61,364 distinct websites. We create ground truth for each cascade by using the hyperlinks as a proxy for real information flow. For time-varying network, we include edges only appear in a particular day into the corresponding network, while for the static network we simply include any existing edge regardless of the timestamp. Of these cascades, approximately 61% don’t have a tree structure, and among the re- maining cascades, 84% have flat structures, meaning for each cascade, all nodes other than Modeling information cascades with rich feature sets 63 the earliest one are in fact attached to the earliest node in the ground truth. In this paper we report the performance of our models on the original dataset and on a dataset where the cascades are guaranteed to be trees. To construct the tree cascades, before finding the connected components using hyperlinks, we remove posts which have hyperlinks coming from more than one source website. Selecting, as above, cascades whose sizes are be- tween 5 and 100 nodes, this yields 20,424 separate cascades containing 201,875 posts from 63,576 distinct websites We also merge cascades to combine all cascades that start within an hour of each other to make the inference problem more challenging and realistic, since if we don’t know links, we are unlikely to know the membership of nodes in cascades exactly. In the original ICWSM 2011 dataset, we obtain 789 merged cascades and for the tree-constrained data, we obtain 938 merged cascades. When merging cascades, we only change the links to the dummy root node, and the underlying network structure remains the same. The DST model would be able to learn different parameters depending on whether we train it on separate cascades or merged cascades. We report on the comparison between both with MultiTree and InfoPath.

4.7.2 Feature sets

Most existing work on network structure inference described in Related Work only uses the time difference between two nodes as the feature for learning and inference. Our model has the ability to incorporate different features as pointed out in Eq. 4.1 and 4.2. Hence in this paper we experiment different features and report on the following sets:

• basic feature sets, which include only the node information and timestamp differ- ence, which resembles what the other models do; and

• enhanced feature sets, which include the basic feature sets, as well as the languages Modeling information cascades with rich feature sets 64

that both nodes of an edge use, what content types as assigned by Spinn3r (blog, news, etc.), whether a node is the earliest node in the cluster, and the Jaccard distance between the normalized texts in the two nodes.

We use one-hot encoding to represent the feature vectors. In addition, we discretize real- valued features by binning them.

4.7.3 Result of unsupervised learning at cascade level

Table 4.1: Cascade-level inference of DST with different feature sets, in unsupervised learning setting in comparison with naive attach-everything-to-earliest baseline for original cascades extracted from ICWSM 2011 dataset.

Cascade types Method Recall Precision F1 DST Basic 0.348 0.454 0.394 Separated Cascades DST Enhanced 0.504 0.658 0.571 Naive Baseline 0.450 0.587 0.509 DST Basic 0.027 0.035 0.031 Merged Cascades DST Enhanced 0.036 0.047 0.040 Naive Baseline 0.015 0.019 0.017

Table 4.2: Cascade-level inference of DST with different feature sets, in unsupervised learning setting in comparison with naive attach-everything-to-earliest baseline for tree structure enforced cascades extracted from ICWSM 2011 dataset.

Cascade types Method Recall Precision F1 DST Basic 0.622 0.622 0.622 Separated Cascades DST Enhanced 0.946 0.946 0.946 Naive Baseline 0.941 0.933 0.937 DST Basic 0.042 0.042 0.042 Merged Cascades DST Enhanced 0.246 0.246 0.246 Naive Baseline 0.043 0.043 0.043

In practice, we use Apache Spark for parallelizing the computation to speed up the optimization process. We choose batch gradient descent with a fixed learning rate 5 × 10−3 Modeling information cascades with rich feature sets 65

Table 4.3: Cascade-level inference of DST with different feature sets, in supervised learn- ing setting for merged cascades for tree structure enforced cascades extracted from ICWSM 2011 dataset.

Training Test Merged Cascades Recall Precision Recall Precision Basic Feature Set 0.171±0.001 0.171±0.001 0.164±0.007 0.164±0.007 Enhanced Feature Set 0.475±0.002 0.475±0.002 0.455±0.011 0.455±0.011 Naive Baseline 0.042±0.001 0.042±0.001 0.046±0.009 0.046±0.009 and report the result after 1, 500 iterations. Inspecting the results of the last two iterations confirms that all training runs converge. The constraint set C contains edges that satisfy: (1) time constraints, and (2) only nodes within the first hour of a specific cascade can be attached to root. The DST model outputs finer-grained structure than existing approaches and predicts a tree for each cascade, with edges equal to the number of nodes. We report the micro- averaged recall, precision, and F1 for the whole dataset. Table 4.1 shows the results of training the DST model in an unsupervised setting with different feature sets on both separate cascades dataset and merged cascades dataset. We also include a naive baseline that simply attaches all other nodes to the earliest node in a cascade. From Table 4.1 we can see the flatness problem. The naive baseline can already achieve 45% recall and 58.7% precision, while knowing the websites and time lags only yields 34.8% recall and 45.4% precision, which partly attributes to the time constraints we apply on creating the Laplacian matrix so that the model can at least gets the earliest node and one of the edges leaving from that node right. On the other hand, the enhanced feature set utilizes the features from the textual content of posts such as the Jaccard distance. Having this information helps the DST model outperform the naive baseline. In the merged clusters setting, instead of only one seed per cascade being attached to the implicit root, we have Modeling information cascades with rich feature sets 66 multiple seeds occurring within the same hour attached to the root. Hence, the naive base- line strategy can at most get the original cascade to which the earliest node belongs right. DST with both feature sets can achieve a better result. We believe in the future, adding more content based features will further boost the performance. We expect, however, that disentangling multiple information flow paths will remain a challenging problem in many domains.

4.7.4 Result of unsupervised learning at network level

In this section, we evaluate effectiveness on inferring the network structure, comparing to MultiTree and InfoPath. The DST model outputs a tree for each cascade with posterior probabilities for each edge. To convert to a network, we sum all posteriors for a certain edge to get a combined score, from which we obtain a ranked list of edges between websites. We report on two different sets of quantitative measurements: recall/precision/F1 and average precision. When using InfoPath, we assume an exponential edge transmission model, and sample 20,000 cascades for the stochastic gradient descent. The output has the activation rates for a certain edge in the network per day. We keep those edges which have non-zero activation rate and actually present on that day to counteract the decaying assumption in InfoPath. We then compute the recall/precision/average-precision for each day. To compare the DST model with InfoPath on the time-varying network, we pick edges from the ranked list of the DST model on each day, the number of which matches InfoPath’s choice. We exclude MultiTree for the lack of ability to model a dynamic network. Figure 4.1 shows the comparison between InfoPath and the DST model with different feature sets. We can see that the DST model outperforms InfoPath by a large margin on every metric with the enhanced feature set being the best. Modeling information cascades with rich feature sets 67

Now we can compare the DST model with MultiTree and InfoPath on the static net- work. We include every edge in the output of InfoPath. Table 4.4 shows a comparison between the two models in a similar way to the comparisons mentioned before, where the number of edges from the DST model equals to the number of total edges selected by In- foPath. As for MultiTree, we keep all the parameters default while setting the number of edges to match InfoPath’s as well. Since MultiTree assumes a fixed activation rate, while InfoPath gives activation rate based on the timestep, there is no way to rank the edges in the static network both methods inferred; therefore, we don’t report average-precision for them. The DST model also outperforms MultiTree and InfoPath in inferring static network structure. Notably, the recall/precision of InfoPath is much higher than the recall/precision per day (Figure 4.1). This is due to the fact that edges InfoPath correctly selects in the static network might not be correct on that specific day.

Table 4.4: Comparison of MultiTree, InfoPath and DST on inferring a static network on the original ICWSM 2011 dataset. The DST model is trained and tested unsupervisedly on both separate cascades and merged cascades using different feature sets and the naive attach-everything-to-earliest-node baseline.

Cascade types Method Recall Precision F1 AP MultiTree 0.367 0.242 0.292 N/A InfoPath 0.414 0.273 0.329 N/A Separated Cascades DST Basic 0.557 0.368 0.443 0.279 DST Enhanced 0.842 0.556 0.670 0.599 Naive Baseline 0.622 0.595 0.608 0.385 DST Basic 0.052 0.034 0.041 0.003 Merged Cascades DST Enhanced 0.057 0.038 0.045 0.003 Naive Baseline 0.015 0.019 0.017 0.001 Modeling information cascades with rich feature sets 68

Table 4.5: Comparison of MultiTree, InfoPath and DST on inferring a static network on modidifed ICWSM 2011 dataset with enforced tree structure. The DST model is trained and tested unsupervisedly on both separate cascades and merged cascades using different feature sets and the naive attach-everything-to-earliest-node baseline.

Cascade types Method Recall Precision F1 AP MultiTree 0.249 0.196 0.220 N/A InfoPath 0.375 0.294 0.330 N/A Separated Cascades DST Basic 0.618 0.486 0.544 0.452 DST Enhanced 0.950 0.747 0.836 0.915 Naive Baseline 0.941 0.933 0.937 0.892 DST Basic 0.083 0.065 0.073 0.012 Merged Cascades DST Enhanced 0.207 0.163 0.182 0.047 Naive Baseline 0.043 0.043 0.043 0.005

4.7.5 Enforcing tree structure on the data

In the ICWSM 2011 dataset, 61% of the cascades are DAGs. Since DST, MultiTree, and In- foPath all assume that they are trees, we evaluate performance on data where this constraint is satisfied—i.e., the tree-constrained dataset described above. The bottom part of Table 4.2 shows that the naive baseline for separate cascades achieves 94.1% recall/precision because of flatness. DST with enhanced features beats it by a mere 0.5%. This leaves very little room for DST to improve in cascade structure inference problem for separate cascades. For merged cascades, the naive baseline can at most get the original cascade to which the ear- liest node belongs right. DST with basic feature set did adequately on finding the earliest nodes but found very few correct edges inside the cascades, while enhanced feature set is better at reconstructing the cascade structures thanks to the knowledge of textual features, which leads about a 600% margin. With only 24.6% recall/precision, there is still room for improvement on this very hard inference problem. On network inference, DST with the enhanced feature set also performs the best for recall and average precision but lags on Modeling information cascades with rich feature sets 69

Enhanced Basic InfoPath Enhanced Basic InfoPath Enhanced Basic InfoPath Recall Precision Average Precision Average 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

15 20 25 30 35 40 45 15 20 25 30 35 40 45 15 20 25 30 35 40 45

Day Day Day

(a) Recall on graph data (b) Precision on graph data (c) AP on graph data

Enhanced Basic InfoPath Enhanced Basic InfoPath Enhanced Basic InfoPath Recall Precision Average Precision Average 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

15 20 25 30 35 40 45 15 20 25 30 35 40 45 15 20 25 30 35 40 45

Day Day Day

(d) Recall on tree data (e) Precision on tree data (f) AP on tree data

Figure 4.1: Recall, precision, and average precision of InfoPath and DST on predicting the time-varying networks generated per day. The DST model is trained unsupervisedly on separate cascades using basic and enhanced features. The upper row uses graph-structured cascades from the ICWSM 2011 dataset. The lower row uses the subset of cascades with tree structures. Modeling information cascades with rich feature sets 70 precision. Table 4.4, Table 4.5, Figure 4.1d, 4.1e and 4.1f show similar performance when comparing with MultiTree and InfoPath on inferring different types of network structure.

4.7.6 Result of supervised learning at cascade level

Our proposed model has the ability to perform both supervised and unsupervised learning, with different objective functions. One of the main contributions of the DST model is to be able to learn the cascade level structure in a feature-enhanced and unsupervised way. However, supervised learning can establish on upper bound for unsupervised performance when trained with the same features. Table 4.3 shows the result of supervised learning using DST on the merged cascades with tree structure enforced. Since there are only 938 merged cascades, we perform a 10-fold cross validation on both dataset and we report the result of 5 folds. We split the training and test set by interleaved round-robin sampling from the merged cascades dataset. Although not precisely comparable to DST in the unsupervised setting due to this jackknif- ing, Table 4.3 still shows results about twice as large as for unsupervised training.

4.8 State Policy Adoption Dataset

In political science, there have been many efforts made to conceptulize the diffusion of policy adoptions as a dyadic process, which implies a diffusion network between states. Desmarais et al. (2015) uses a dataset of 187 policies introduced by Boehmke and Skinner (2012), applies NetInf Gomez Rodriguez et al. (2010) to the dataset to infer that diffu- sion network, and describes the contributions of several political attributes to the network connections. To be consistent with previous experiments, we revisit this dataset and use InfoPath in lieu of NetInf to infer the network. Modeling information cascades with rich feature sets 71

Table 4.6: Logistic regression of network inferred by DST and InfoPath on independent networks: geological distance between states and contiguity of states. The statistical sig- nificance are at < 0.05 level according to the QAP p-value against the indicated null hy- pothesis.

DST InfoPath Coef. Pr(≥ |b|) Coef. Pr(≥ |b|) Intercept 0.373 0.000 0.455 0.000 Distance 0.026 0.599 -0.039 0.079 Contiguity 0.041 0.341 0.024 0.564

4.8.1 Dataset description

This dataset contains 187 policies, comprising a broad array of policy areas that have been adopted by 50 states in years 1913–2009. Unfortunately, the dataset does not include orig- inal texts for the polices. Therefore, we are not able to use textual similarity to better infer the network. However, there are still other features we could use in the model such as the geolocation information, population of the corresponding states over time, etc. In addition, we do not have ground truth for the structure of cascades or the diffusion network. Instead, we fit a logistic regression of the networks and test it against the indicated null hypothesis.

4.8.2 Effect of proximity of states

For this dataset, we use the nodes and logarithms of the time differences as features in our model. Moreover, within each cascade, we do not exclude the links of adoptions whose time differences span a very long time window. As for this experiment we use the logistic regression of network variables, we generate the network with only 1,000 out of 2,450 pos- sible edges. For InfoPath, we use the same parameter settings as in the previous experiment, except that we only have 187 cascades max to sample for stochastic gradient descent. Be- cause both models infer a dichotomy network, we use logistc regression instead of a linear regression of the diffusion network on two network variables: geological distance between Modeling information cascades with rich feature sets 72 the center of states and an indicator variable for the contiguity of the states. We then use QAP (Quadratic Assignment Procedure) to test the coefficient against null hypothesis. Ta- ble 4.6 shows the results whose statistical significances are at < 0.05 value according to the QAP p-values, which are derived from 1,000 network permutations. From two-tailed test in Table 4.6 we can see that InfoPath has a higher probability to reject contiguity of states than DST but very low probability to reject geological distance.

4.9 Conclusion

We have presented a method to uncover the network structure of information cascades using an edge-factored, conditional log-linear model, which can incorporate more features than most comparable models. This directed spanning tree (DST) model can also infer finer grained structure on the cascade level, compard to other methods that focus on inferring global network structure. We utilize the directed matrix-tree theorem to prove that the likelihood function of the conditional model can be solved in cubic time and to derive a contrastive, unsupervised training procedure. We show that on the ICWSM 2011 Spinn3r dataset, our proposed method outperforms the baseline MultiTree and InfoPath methods in terms of recall, precision, and average precision. One major drawback of the presented model is the assumption that the underlying cas- cade structure is a directed spanning tree. Networks often exhibit a complex topology and, regardless of the accuracy of DST model, it can never model the dynamics of multiple parents for a given node. In addition, even though we reduce the time complexity of com- puting eq. 4.7 to O(n3) thanks to Tutte’s matrix-tree theorem, the computation time is still significant in practice; therefore, it might not be scalable to very large real-life networks. Last but not least, evaluating the DST model requires knowledge of the underlying net- work structure, since it predicts edges between nodes in the network. However, it is often Modeling information cascades with rich feature sets 73 impractical to obtain such knowledge for many types of social networks, for instance net- works of public policy diffusion where no obvious links can be drawn without expensive expert judgments. In the next chapter, we will propose an efficient and scalable model, that assumes a directed acyclic graph (DAG) structure at the cascade level and does not require observed network structure, either to train or to evaluate. Chapter 5

Modeling information cascades using self-attention neural networks

In chapter 4 we described an unsupervised model for information cascades based on the assumption that the structure of a cascade was a directed spanning tree. However, it is a somewhat limited theory. Admittedly, for some information one would have obtained knowledge from some unique source. Most of the time the information is coming from various places and the tree structure fails to capture such dynamics. For instance, an aca- demic paper picks up an idea that has been circling around by citing a couple of papers, or a news media only decides to public a news piece because multiple prominent outlets have reported it. We need a better model for such an information diffusion process. What is more, even with the proliferation of cascades that can be observed thanks to the ubiquitous usage of online social networks, the transmission probabilities are still unknown even with the knowledge of underlying network structure. This also applies to social net- works that observing the links between nodes are very expensive, if not impossible to get, in the first place. In these cases, evaluating the efficacy of a model that learns to infer the probabilities is impractical. Instead, we can have a model that could learn to predict the

74 Modeling information cascades using self-attention neural networks 75 next active nodes which are observed. This task could also be of interest to several ap- plications. For example, in public policy diffusion networks, having a model that predicts links between states, for example, would only help qualitative analysis such as visualiza- tion. However, predicting next state that might adopt a certain policy could prove to be a more valuable result, along with the fact that we can actually assess the effectiveness of such model. Or in viral marketing, the bidders are certainly more interested in knowing if the target group is more likely to get the information next, instead of how the information is passed. In this chapter we describe a neural model (§5.3) that uses the observed relative order of nodes in a cascade to guide the training process, which is unsupervised with regard to the underlying network structure. Our model also makes the assumption that the cascade structure is actually a directed acyclic graph (§5.2), or DAG, instead of a directed spanning tree. Finally, by the nature of model it is easy to include different kinds of nodal side information, such as texts, at the learning and inference stages. We compare the proposed model in §5.4 to several neural models on both node and edge prediction tasks.

5.1 Node representation learning

Rather than learning a parametric or non-parametric model of distributions defined on the discrete edges of a diffusion graph, there have been many recent efforts made to embed the representation of nodes along with the structural information into a latent continuous space. DeepWalk (Perozzi et al., 2014) can be considered as one of the earliest examples of such work. They draw an analogy to the representation learning models in natural language processing, and consider a randomly sampled sequence of nodes to be similar to a sentence of tokens. Hence the learned representation can embed the topological information as words are defined by their context information. Their objective function is similar to Skip- Modeling information cascades using self-attention neural networks 76

Gram in word2vec (Mikolov et al., 2013b). Node2vec (Grover and Leskovec, 2016) is a modification of DeepWalk with a small difference in random walks. They have two param- eters to control the probabilities of a smaller or larger neighborhood. LINE, proposed by Tang et al. (2015) for large-scale network embedding, can preserve first- and second-order proximities. SDNE (Wang et al., 2016) uses a deep autoencoder with multiple non-linear layers to preserve the neighbor structures of nodes. GraphSAGE (Hamilton et al., 2017) is an inductive framework for dynamic network embedding that leverages vertex feature information (e.g., text attributes) to efficiently generate vertex embeddings for previously unseen data. These network embedding methods focus on representation at the network level. GAN (Velickoviˇ c´ et al., 2018) or Graph Attention Network uses multiheaded atten- tion layers to accentuate the importance of prominent nodes. It also uses the underlying network structure to mask the attention mechanism to focus on only the nodes that have connections to current one. There is another line of work that focuses on learning node representations from the structures of observed cascades. For example, recurrent neural networks (Jordan, 1986; Pearlmutter, 1989; Cleeremans et al., 1989) have been effective for modeling sequential data in natural language processing. DeepCas (Li et al., 2017) uses random walk alike sampling to generate a few cascades, and treat them as sequences of nodes. It then uses a bidirectional GRU (Cho et al., 2014) to learn a representation of the cascade graph in an end-to-end manner. GraphRNN (You et al., 2018) is a deep autoregressive model that addresses the challenges that arise from the non-unique, high-dimensional nature of graphs and the complex, non-local dependencies that exist between edges in a given graph, and approximates any distribution of graphs with minimal assumptions about their structure. However, these models all assume nodes in a linear sequence order, which is hardly the case in practice. Various modified RNN-based units have been proposed to address this drawback. For example, TreeRNN (Tai et al., 2015) allows an LSTM to be applied to a Modeling information cascades using self-attention neural networks 77 tree-like network topology. It is very useful in natural language processing, as usually the sentences will have a corresponding syntactical or semantic dependency in tree structures. As mentioned earlier, the actual structure of a cascade could bear more resemblance to a DAG. DAG-RNN has been somewhat researched in the literature. In protein structure prediction (Baldi and Pollastri, 2003), a contact map over the amino acids of a protein is decomposed into a DAG by traversing the map in certain directions, then a DAG-RNN is developed. Or in computer vision, in order to perform a face recognition task, a region adjacency graph with labeled edges is first extracted from each image; then, the graph is decomposed into an edge-labeled DAG by breadth-first search, and modeled by a RNN- LE model (Bianchini et al., 2005). For a network embedding task in recent work, Topo- LSTM (Wang et al., 2017a) has a modified the LSTM (Hochreiter and Schmidhuber, 1997) unit to be used in a DAG structure. It solves a problem in previous DAG-RNN alike models where they can only handle static DAG structure, while in information diffusion problem, the DAGS are dynamic and evolving over the time. TopoLSTM will be an important base- line to compare with our proposed model in this chapter. Most of the models mentioned above require knowledge of the underlying network either to assign parameters for a sampling strategy, or to build the actual structures of cas- cades. IC-SB (Goyal et al., 2010) and Embedded-IC (Bourigault et al., 2016) are based on an Independent Cascade (IC) framework and can work unsupervisedly, meaning they do not require the underlying network to learn the node representations. DYFERENCE (Ghalebi et al., 2018) is a non-parametric dynamic network model—based on a mixture of coupled hierarchical Dirichlet processes, which allows inference on the evolving community struc- ture in networks. In addition Zhang et al. (2018) and Chen et al. (2019) show the efficacy of including textual features in helping encoding the network topology. Modeling information cascades using self-attention neural networks 78

5.2 Information cascades as DAGs

Whether the original underlying network is directed or not, the structure of a certain cascade can be considered as a directed acyclic graph, thanks to the introduction of direction by the relative orders of activated nodes. Formally, we have the underlying network G = (V, E) where V is the node set and E is the edge set. Each input cascade sequence is an ordered

sequence of nodes c = {v1, v2, . . . , vt} where vt ∈ V. If we incorporate side information about nodes, such as raw texts, we may substitute the element in the cascade with tuple

(vt, tst, textt,... ) where vt ∈ V and others are side information features, such as tst is the time information and textt is the representation of texts. The structure of each cascade can be represented as a directed graph. It is easy to see that the structure of cascade is also acyclic, because the direct edges are always from an earlier activated node, to another node that will be influenced later. Figure 5.1 illustrates a toy network with the solid arrows and nodes representing the observed cascade. We can see that in the observed cascade node A and D have no incoming arrows, and node D has no outgoing arrows to the observed node in the cascade, as well. This is because in practice it is not realistic to observe the whole network. The initial node in a certain cascade and some following nodes have outside influence that needs to be incorporated into the modeling process. The dashed circled nodes and dashed arrows represent the part of network that do not participate in current information cascade.

5.3 Graph self-attention network

The observations we obtain from the dataset are merely the cascade sequences introduced in section 5.2. This means we need to somehow capture the influence of already-activated nodes on the future ones in a cascade. In the mean time, we need golden labels for training Modeling information cascades using self-attention neural networks 79

B D

A C E

F

Figure 5.1: The illustration of a cascade structure as DAG in a toy network. the model. Therefore, we choose the observed node activated at time t as our ground truth, and try to predict it using previously activated nodes in the same cascade. Some, but not all, of the nodes are involved in activating current node, and we mean to learn such interactions, as these reflect the connections between the nodes in the network, or they can be considered as edges in the network. Here, incorporating an attention mechanism into our model proves useful, as it shows the “alignment”, or a distribution of the roles of existing nodes, which is exactly the goal we want to achieve. In this section, we discuss the overall architecture of our graph self-attention network (GSAN) and variants that we have tested to capture various aspects of the problem.

5.3.1 Analogy to language modeling

In some sense, the problem of node prediction is similar to word prediction in language modeling. A classic N-gram model predicts the next word (wn) from previous N − 1 words (w1, . . . , wn). To formulate the problem, we have

p(w1, . . . , wn−1, wn) p(wn|w1, . . . , wn−1) = p(w1, . . . , wn−1) Modeling information cascades using self-attention neural networks 80

By using the chain rule of probability we get the joint probability of the word sequence as:

p(w1, . . . , wn) = p(w1)p(w2|w1)p(w3|w1, w2) ··· p(wn|w1, . . . , wn−1)

However, the intuition behind N-gram models is that we will approximate the history by just the last few words based on the Markov assumption, instead of using the entire history. Our problem is different from classical N-gram model in this sense, as we cannot safely as- sume that only the most recent activated nodes matter. In fact, for real-world datasets such as ICWSM 2011 webpost as mentioned in section 4.7.1 and the naive attach-everything-to- earliest baseline listed in table 4.1 and table 4.2 we see that as oppose to language model, in cascade structure inference usually the earliest nodes are more influential. Therefore, we do not adopt the Markov assumption in our model. Recent work on language modeling utilizing RNNs (Jordan, 1986; Pearlmutter, 1989; Cleeremans et al., 1989), GRUs (Cho et al., 2014), and LSTMs (Hochreiter and Schmid- huber, 1997) aims at using the entire history to improve performance. In GRUs or LSTMs, there are gate functions to control the flow of information and potentially control what parts of the history are useful. Similar mechanisms are also useful for the node prediction prob- lem. In our assumption, we want to look at the entire history, but we also believe only some of the nodes in the history are influential on passing the information to current node. Recall that in chapter 3.5 we used an attention mechanism to regulate the information we look at for inferring current representation. We also want to apply the attention mechanism here. However, RNN-based sequential computation inhibits parallelization and it cannot explic- itly model long and short range dependencies or hierarchy. Notably, Transformer (Vaswani et al., 2017) solves these disadvantages of RNNs. Language modeling is different from sequence-to-sequence tasks in that there is only one single sequence to work on. We therefore specifically consider self attention for sin- Modeling information cascades using self-attention neural networks 81

gle sequence problems. Thanks to Transformer (Vaswani et al., 2017) based self attention model, there are new efforts on language modeling problem such as OpenAI GPT (Radford et al., 2018), BERT (Devlin et al., 2018), and XLNet (Yang et al., 2019), which achieves state-of-the-art performance of various generation and prediction tasks. Because of the promising results of multi-headed, multi-layer Transformer based models in single se- quence tasks, we propose a self-attention based neural network that uses Transformer-like structure to model the sequence of nodes in a given casacde.

5.3.2 Graph self-attention layer

Our model consists of multiple layers of the same multi-headed self-attention layer. The particular layer structure closely follows the work of Vaswani et al. (2017) on Transformers. The input to ith self-attention layer is the sequence of node features in the cascades,

(i) (i) (i) (i) (i) f h = {h1 , h2 , . . . , ht }, hi ∈ R , where t is the number of observed nodes in the given cascade, and f is the dimension of feature vectors of each node. The layer then produces the output, also the input to next layer if current layer is not the last, of h(i+1) =

(i+1) (i+1) (i+1) (i+1) f {h1 , h2 , . . . , ht }, hi ∈ R with the same number of features as the input. Figure 5.2 shows that the structure of our model consists of L identical multi-headed self- attention layers. For masked multi-headed self-attention component in the layer, we want to maintain the autoregressive property of the given cascade, but we also do not have the knowledge of the underlying network structure as Velickoviˇ c´ et al. (2018) does. Therefore, given a his- tory of length t, we mask out the future nodes from t + 1 to the end of cascade. In this way, even though we do not have any structure information, we still compute the attention on influence from the history only. According to Vaswani et al. (2017), we compute the atten- tion function on a set of queries simultaneously, packed together into a matrix Q. The keys Modeling information cascades using self-attention neural networks 82

Node prediction

Layer Norm

Feed Forward

x L Layer Norm

Masked Multi- headed Self- attention

Position Embedding

Node features

Figure 5.2: Graph Self-Attention Network architecture, with L identical multi-headed self attention layer. Modeling information cascades using self-attention neural networks 83 and values are also packed together into matrices K and V . We use the scaled dot product attention mechanism, and the transformer block with self-attention can be formulated as:

QKT Attention(Q, K, V ) = Softmax( √ )V (5.1) fk

Adding multi-head attention, we have

K O MultiHead(h) = ( k headk)W k=1 (5.2) Q K V where headk = Attention(hWi , hWi , hWi )

Q K V d×df where Wi ,Wi ,Wi ∈ R are different linear transformations, d is a hyperparameter

kdf ×d set for the model, WO ∈ R are corresponding linear projection layers, and k is the concatenation operator. In addition to the self-attention component, each of the layers also contains a fully connected feed-forward network, which is applied separately and identically:

FFN(x) = max(0, xW1 + b1)W2 + b2 (5.3)

where each layer has its own set of parameters W1,W2. Finally a residual connection (He et al., 2016) is added to each of the computation components, followed by a layer normalization (Ba et al., 2016). All these components are combined together to form a single graph self-attention layer.

5.3.3 Graph self-attention network

Notice in Figure 5.2 that we also include a position embedding to be added to the input node features. Similar to Gehring et al. (2017) but in contrast to Vaswani et al. (2017), we choose to learn the representations of the positions of each node in the cascade. This Modeling information cascades using self-attention neural networks 84

gives us the flexibility to avoid using often noisy and unreliable timestamps observed from webposts data, but focus on the relative order of the nodes getting activated. Then we run the input through each of the self-attention layer to get the final output. In the self- attention layer, all of the keys, values and queries come from the same place, in this case, the output of the previous layer. For L layers, each output of the L − 1 layers are hidden

(l) representations for input nodes, except for the outputs of the last layer. We use ht as the predicted representation of node to be activated at time step t + 1. The whole network can be formulated as:

(0) h = We + Wp (5.4) h(l) = self-attention-layer(h(l−1)), ∀h ∈ [1,L]

(0) where h is the initial inputs to the first layer, We is the node embedding matrix, and Wp is the position embedding matrix, both of which are trainable variables. Our goal is to project the output of last layer back to node embedding, and to minimize the cross entropy loss

between the predicted representation and the representation of actual activated node ht+1:

(L) T yˆt+1 = Softmax(h We ) N (5.5) P L(yt+1, yˆt+1) = − yt+1,i logy ˆt+1,i i=1 where i iterates all possible nodes. In the experiments below, we will refer to this architecture as GSAN-Basic. Apart from the raw node embeddings, we can also add in other side information such as texts of the nodes. In contrast to Chapter 4, where edge features are used to infer a directed spanning tree structure for cascades, we use nodes as basic units for the model so we will use node- specific features in this chapter. In order to add node features other than embeddings, we can obtain the corresponding feature vectors first, then add or concatenate them into the Modeling information cascades using self-attention neural networks 85 node embedding vectors. Zhang et al. (2018) and Chen et al. (2019) show the efficacy of including textual features in helping encode network topology.

5.3.4 Senders and receivers

In the GSAN-Basic model, we use a set of universal embeddings to represent nodes. An- other seemingly natural way is to assign different types of representations depends on the role which a node plays in the diffusion process. For example, when the information dif- fuses from node a to node b, the role of a will be different as a sender from being a receiver when it gets information from another node c. Universal embeddings fail to capture such dynamics. Some recent work such as Bourigault et al. (2016), Wang et al. (2017a) and Kipf et al. (2018) adopted such a notion to use different sets of representations for a given node. In order to incorporate this idea into our GSAN model, we need to change the struc- ture slightly. Figure 5.3 shows the modified architecture, which is similar to the encoder- decoder structure in Vaswani et al. (2017). Basically, we apply a self-attention network to

(0) the input of sender embeddings hsender, and obtain the outputs after L layers. We then use these outputs along with self-attention layers applied to the receiver embeddings to get the final outputs. For the added attention layers that model the interactions between sender out- puts and receiver outputs, the queries come from the previous receiver self-attention layer, and the memory keys and values come from the output of the sender. This allows every po- sition in the receiver to attend over all history positions in the input sequence. While in the sender and receiver self-attention layer has all of the keys, values and queries come from the outputs of the corresponding previous layers. The intuition behind such a structure is that we want to learn some kind of interactions between nodes that are already activated, by using the self-attention network on senders, and then learn the interactions between nodes when they are receiving information up to current time step, as well as how information Modeling information cascades using self-attention neural networks 86 from the senders affects receivers. In order to maintain the autoregressive property and prevent future information from flowing into the history, we apply masking to all attention related sublayers. We refer to this architecture as GSAN-SR.

5.3.5 Hard attention

Even though we mean to use the entire history to provide evidence of potential influencers, there is still a drawback with the attention mechanism: it uses a softmax function, which means it might dilute the contribution of the potential influencers by smoothing out their attention scores. As the history grows longer, it could also be potentially impractical to find the correct sources. Luong et al. (2015) propose a local attention mechanism that selectively focuses on a small window of context and is differentiable. Recently, Yang et al. (2018) argue there exists localness in self-attention networks by learning a Gaussian bias, which indicates the center and scope of the local region to be paid more attention. These all sug- gest a hard attention mechanism to selectively focus on a couple of potential nodes in the history, not necessarily in a contiguous window, for node activation prediction. However, finding such a definitive set of nodes has combinatorial time complexity, which is imprac- tical. Wu et al. (2018) propose a dynamic programming method for their non-monotonic hard attention on character transduction, but it is still relatively expensive compared to soft attention with time complexity of O(|x| · |y| · |Σy|), where x and y are inputs/outputs and

Σy are vocabulary of outputs. Shen et al. (2018) and Indurthi et al. (2019) propose a hard attention mechanism using sampling and reinforcement learning for training. In Yang et al. (2018), they also show that a Transformer-based network tends to capture different scopes among the layers, and the higher layers are more likely to capture long- range dependencies than neighboring words in lower layers. We build our hard attention networks based on GSAN-Basic, and replace the last L layer’s masked multi-headed self- Modeling information cascades using self-attention neural networks 87

Node prediction

Layer Norm

Feed Forward

Layer Norm Layer Norm

Feed Forward Masked Multi- x L headed Attention

L x Layer Norm Layer Norm

Masked Multi- Masked Multi- headed Self- headed Self- attention attention

Position Position Embedding Embedding

Sender features Receiver features

Figure 5.3: Graph Sender-Receiver attention network architecture, with L identical multi- headed attention layer for senders and receivers respectively. Modeling information cascades using self-attention neural networks 88

Node prediction

Layer Norm Layer Norm

Feed Forward Feed Forward

(L-1) x Layer Norm Layer Norm

Masked Multi- RL Agent Action headed Self- Mask attention

Position Embedding

Node features

Figure 5.4: Graph hard self-attention network architecture, with L − 1 identical multi- headed attention layer and the last layer replaced with a reinforcement learning agent se- lecting mask actions. Modeling information cascades using self-attention neural networks 89 attention layer with a reinforcement learning (RL) agent’s action, as shown in Figure 5.4. To obtain the hard attention, for each given history, we generate an equal length of binary random variables as the action vector α = {α1, . . . , αT } where αt = 1 indicates the node at time t is a potential influencer and αt = 0 tells us to ignore that node. The influential nodes are sampled using a Bernoulli distribution over each αt for all nodes in the given history. Since this hard selection introduces discrete variables and the back propagation of gradients cannot be computed, estimating them requires reinforcement learning algorithms. Therefore, we design the reinforcement policy based on the node hidden representations of the L − 1-th layer:

(L−1) (L−1)T πθh (αt) = σ(tanh(h1,...,t Whardh1,...,t )) (5.6)

αt ∼ Bernoulli(πθh (αt)) where σ is sigmoid function used to turn the score into a probability between (0, 1) and θh are the parameters of hard attention component. The action vector α is used as a mask in place of the original mask used to maintain the autoregressive property in Eq. 5.1. We use accuracy between the predicted cascade sequence and the target cascade sequence to be the reward for given cascade:

T X r(y, yˆ) = 1(yi =y ˆi) (5.7) i=1 Apart from our original loss function, we also need a loss function for estimating the pa- rameters of a policy. Because the goal in reinforcement learning is usually to maximize the expected reward, we define our loss function to be the negative expected reward for the entire sequence:

Lθ = − yˆ ,...,yˆ ∼π (α ,...,α )[r(y, yˆ)] (5.8) h E 1 T θh 1 T Modeling information cascades using self-attention neural networks 90

In practice, one will usually approximate this expectation with a single sample from the distribution of actions acquired by the Benoulli process. Hence, the derivative for loss function in Eq. 5.8 is given as follows:

∇ L = − [∇ log (ˆy )r(y, yˆ)] (5.9) θh θh E θh πθ 1...T yˆ ∼π h 1...T θh

We approxiate the expectation in gradient ∇θh Lθh of Eq. 5.9 by using the REINFORCE (Williams, 1992) algorithm. This gives us the ability to jointly optimize the original loss function L in

Eq. 5.5 and Lθh in Eq. 5.8. We will refer to this model as GSAN-RL.

5.3.6 Edge prediction

Apart from predicting the next activated node, we are also interested in predicting edges between nodes, or inferring the network structure. Recall that one justification for using an attention mechanism was the hope that the attention scores could shed light on which nodes in the history might be influential in activating the next one. However, with a multi-headed attention sublayer in the model, it is difficult to extract this information. Therefore, in order to predict edges utilizing attention scores, we replace the multi-headed attention sublayer in the Lth, or the last, layer of our network with a single-head attention sublayer. Figure 5.5 illustrates the changes made to the model.

5.4 Experiments

5.4.1 Datasets

In this section we conduct experiments on two datasets to evaluate the performance of our proposed model. We choose two datasets extracted from Web that have been widely used Modeling information cascades using self-attention neural networks 91

Node prediction

Layer Norm Layer Norm

Feed Forward Feed Forward

(L-1) x Layer Norm Layer Norm

Masked Multi- Single-head headed Self- Self-attention attention

Position Embedding

Node features

(a) GSAN-Basic

Node prediction

Layer Norm Layer Norm

Feed Forward Feed Forward

Layer Norm Layer Norm Layer Norm

Feed Forward Masked Multi- x (L-1) Masked Single- headed Attention head Attention

L x Layer Norm Layer Norm Layer Norm

Masked Multi- Masked Multi- Masked Multi- headed Self- headed Self- headed Self- attention attention attention

Position Position Embedding Embedding

Sender features Receiver features

(b) GSAN-SR

Figure 5.5: Modified Graph Self-Attention Networks with last layer replaced with a single- head attention sublayer to output edge predictions. Modeling information cascades using self-attention neural networks 92 in prior work on evaluating network inference problems. The first dataset is the Memetracker (Leskovec et al., 2009) dataset, which tracks new topics, ideas, and “memes” across the Web. The dataset is generated by detecting short and distinctive phrases inside quotation marks and then clustering the mutational varaints of phrases using edit distance on words as tokens. No direct links between websites or posts were used originally on the cluster generation process, only textual features. The published dataset consists of 7,665,108 such memes from between August 2008 and April 2009, from 4,455,215 distinct articles on 259,036 websites. Prior work has employed two variant ways of cleaning the original published dataset. One is provided by Wang et al. (2017a) (Memes (Ver.1)). In their paper they do not describe how they process the data that provides 5,000 nodes, 313,669 edges and 54,847 cascades. Since it is a supervised model, they need ground truth network links at training time. They do not use the hyperlinks as indications for edges; instead, they create a link between two websites if one of them appears earlier than the other in any cascade. The major drawback of this dataset is that it omits all side information, meaning we cannot get raw meme phrases. To overcome this problem, we also create a Memetracker dataset from the raw published dataset that includes meme phrases (Memes (Ver.2)). Similarly to Simmons et al. (2011), we use phrases extracted from posts which link to other posts within the same cluster, and use the hyperlinks as the edges in the underlying network. We then further filter out short phrases, as many of them (such as “i love you”, “world of warcraft”, etc.) are very mundane and not transferring any meaningful information. In their original work, Leskovec et al. (2009) use a meme length of 4 as cutoff, and we follow their practice here. The second dataset is Digg posts collected by Hogg and Lerman (2012). It contains diffusions of stories as voted by the users, along with friendship network of the users. The statistics of the datasets we use are listed in Table 5.1. To be consistent with the other datasets, we also split Memes (Ver.2) into 75% for training, wherein 10% are earmarked for Modeling information cascades using self-attention neural networks 93

Digg Memes (Ver.1) Memes (Ver.2) # Nodes 279,632 5,000 18,402 # Edges 2,617,993 313,669 149,455 # Cascades 3,553 54,847 25,617 Avg. Cascade Length 30.0 17.0 7.5

Table 5.1: Statistics of datasets. Memes (Ver. 1) is from Wang et al. (2017a). validation, and the rest 25% for testing. Also as Topo-LSTM limits the length of a cascade to 30, even though our model can learn longer cascades without dropping in training time, we do use the first 30 nodes in a cascade across all our datasets in order to make a fair comparison among different methods.

5.4.2 Baselines

We compare GSAN with the following models on the task of predicting node activation:

IC-SB (Goyal et al., 2010) infers the diffusion probability pu,v of each edge (u, v) ∈ E given training cascades, and predicts diffusion under the classical independent cascade (IC) framework. The probability of an inactive node v being activated is the complement of it not being activated by all its neighbors. The static Bernoulli (SB) method in their paper, which shows the best performance for our problem setting, is used in the comparison for the Digg and Memes (Ver.1) datasets.

Embedded-IC (Bourigault et al., 2016) is also based on the IC framework. Instead of directly learning the discrete diffusion probability, they model the diffusion space as a continuous latent space and learn the embedded representations of users and use the dis- tance between users representations to define the transmission likelihood. They distinguish the roles of users in a transmission process as sender and receiver by assigning them into different continuous latent spaces. Modeling information cascades using self-attention neural networks 94

DeepCas (Li et al., 2017) represents a diffusion by sampled path in a random walk on the induced diffusion subgraph. The random walk requires knowledge of the transmission probabilities for sampling. Then they use a GRU network with an attention mechanism to learn a single representation of the diffusion and use a dense layer to predict the cascade size. The last dense layer is replaced with a logistic classifer to predict node activations.

DeepWalk (Perozzi et al., 2014) represents the simple baseline which computes the em- bedding of nodes without using the cascade information, and aggregates the embeddings of the active nodes by average pooling to represent the diffusion. A logistic classifier is used to predict node activations.

Topo-LSTM (Wang et al., 2017a) utilizes the underlying network structure in predicting node activations for each given cascade. It uses a customized LSTM unit abbreviated Topo- LSTM to handle dynamic DAG structures of cascades and performs a node prediction task. We report baseline performance on the Digg and Memes (Ver.1) datasets using the re- sults listed in Wang et al. (2017a). Both IC-SB and Embedded-IC are unsupervised meth- ods, which means they require no knowledge of the underlying networks, while DeepCas, DeepWalk and Topo-LSTM all require network structures to train the model. We also compare GSAN with the DST method described in Chapter 4 in terms of an edge prediction task. Although edges are hidden variables when training GSAN for predicting next activated nodes, we would like to see its performance compared to a model that is proposed for modeling transmission likelihood. The evaluation will be performed on Memes (Ver.2) datasets. Modeling information cascades using self-attention neural networks 95

5.4.3 Experimental settings

Given the current state of a cascade, predicting the next activated node can be evaluated as a retrieval problem (Bourigault et al., 2016), due to the large number of target nodes. Both GSAN and baseline models can output the probabilities of all possible target nodes. This gives us the opportunity to obtain a ranked list based on the probabilities. Having a short list of top ranked candidates will be useful for cascade prediction tasks. Wang et al. (2017a) argue that it could be hard to exactly predict the next active node among the huge list of candidates, so they do not report precision@1 or accuracy in their evaluation. However, we think this metric shows the potential of our proposed model along with other metrics we are going to use. So for evaluation, we use the following metrics:

• P@1 or accuracy: precision/recall at the top one predicted node, measuring the effi- cacy of the model at predicting the correct next activated node.

• MAP@k: The classical Mean Average Precision for top k predictions.

• Hits@k: The rate of the top-k ranked nodes containing the next active node.

Similar to Wang et al. (2017a) we use a varying k among {10, 50, 100}. In order to be comparable with the baseline methods, all these metrics are computed micro-averaged. To choose the hyperparameters of our model, we search for the best performance based on the loss function on a validation set by choosing number of layers l in {1, 2, 3}, number of features f in {128, 256, 512}, number of heads h in {2, 4, 8} and linear projection layer of number of features times 4 with a learning rate at 1 × 10−4. As for baseline models, for Topo-LSTM and DeepCas, the hidden dimensionality d is set to 512. For DeepCas, 200 walks of length 10 are generated for each cascade, the same setting as in Li et al. (2017). For Embedded-IC and DeepWalk, d is set to 64 and 128 respectively. Modeling information cascades using self-attention neural networks 96

5.4.4 Node prediction

Table 5.2 shows the aggregated results between GSAN variants and the baseline models. The table shows a clear uptrend on both metrics when k is increasing, because the correct node is more likely to be expected when more candidates are included. For Digg dataset, Topo-LSTM is the best performer by quite a large margin comparing to other methods. GSAN based variants beats Embedded-IC while falls behind IC-SB. When comparing to other supervised models such as DeepWalk and DeepCas, GSAN-SR slightly beats Deep- Walk but not DeepCas. Among the variants, sender/receiver structure achieves a better results, maybe because the users in Digg network shows more prominent difference in be- havior on different roles. The low numbers across the board for Digg dataset comparing to Memes (Ver.1) suggest that Digg dataset might present a complicated or messy structures for the models to recover. In addition, the relatively low number of cascades (3.5k in total) is not ideal for a parameter intensive models like GSAN to fit. As for the Memes (Ver.1) dataset, GSAN-Basic and GSAN-RL beat the state-of-the- art Topo-LSTM with regards to MAP@k by a slim margin, while GSAN-SR is on par in terms of the performance. For the other unsupervised and supervised model GSAN- Basic leads in a large margin (no less than approximately 33%). Given that Topo-LSTM requires the underlying network structure for training, GSAN could be used more widely for those networks, such as political ideas distribution, or state legislative policy diffusion, that obtaining the explicit links are quite difficult, if not impossible. For Hits@k, GSAN- Basic is slightly behind. This means it is better at ranking the correct nodes higher if they are retrieved. Table 5.3 shows the accuracy of node prediction among GSAN-variants. The numbers on Digg are not great while an accuracy on predicting the next active node in Memes (Ver.1) from 5,000 candidates is pleasantly surprising. Modeling information cascades using self-attention neural networks 97

Table 5.2: Comparison of variants of GSAN and baseline models on Digg and Memes (Ver.1) datasets. The results of baseline models are from Wang et al. (2017a).

Digg Memes (Ver.1) MAP@k (%) @10 @50 @100 @10 @50 @100 IC-SB 3.624 4.584 4.800 18.220 19.428 19.558 Embedded-IC 2.812 3.564 3.755 18.270 19.247 19.374 DeepWalk 3.288 4.088 4.289 13.523 14.636 14.798 DeepCas 3.743 4.632 4.842 19.564 20.618 20.753 Topo-LSTM 5.862 6.842 7.031 29.000 29.933 30.037 GSAN-Basic 3.086 3.893 4.083 29.324 30.237 30.332 GSAN-RL 3.923 4.714 4.902 29.314 30.207 30.300 GSAN-SR 3.546 4.317 4.503 29.036 29.897 29.988 Hits@k (%) @10 @50 @100 @10 @50 @100 IC-SB 10.826 33.113 48.412 41.356 65.884 74.868 Embedded-IC 8.887 26.117 39.220 35.124 55.966 65.053 DeepWalk 9.689 29.985 44.342 28.315 51.193 62.617 DeepCas 10.269 30.826 45.741 38.858 60.478 69.921 Topo-LSTM 15.410 37.363 50.384 50.781 69.548 76.850 GSAN-Basic 9.660 28.252 41.792 51.009 69.282 75.869 GSAN-RL 10.879 29.126 42.418 51.094 69.200 75.704 GSAN-SR 9.883 27.737 40.353 50.767 67.951 74.253

Table 5.3: Accuracy of GSAN variants on Digg and Memes (Ver.1) datasets. The accuracy are listed as percentage.

Digg Memes (Ver.1) GSAN-Basic 0.923 20.102 GSAN-RL 1.282 20.343 GSAN-SR 1.612 19.741 Modeling information cascades using self-attention neural networks 98

5.4.5 Edge prediction

One reason of our model choice is that we hope it can embed the edges of the network as a hidden variable and reflect them in the attention mechanism. After fitting the model based on observed node activations, we extract edges based on attention scores with a certain value that exceeds some threshold determined by validation set. Since our proposed model in Chapter 4 achieves good results on inferring both cascade and network structures, we compare GSAN with DST on the edge prediction tasks using macro-averaged recall, precision and F1 score. Table 5.4 shows the comparison on Memes (Ver.2) dataset with DST. We choose two naive baselines for inferring cascade-level structure: (1) naive-to-previous, attaching ev- erything to its immediate predecessor, and (2) naive-to-earliest, attaching everything to the earliest node in cascade. The naive baselines measure the “broadcastability” or virality of the cascades. From the table we can see that for the cascades in Memes (Ver.2), attaching everything to earliest performs the best while not too far away from attaching everything to immediate predecessor. This leads us to investigate the distribution of length, we can see that many cascades have length 2 and for those cascades both naive baselines would have the same results, hence boost the performance on macro-averaged metrics. On the right of table 5.4, we show the same evaluation after excluding those cascades whose lengths are no more than 5. As expected, we see a drop across the board, more so on naive base- lines. DST does not perform well compared to naive baselines as is shown when applied to ICWSM dataset (§4.7) in table 4.1 and table 4.2. The intuition that embeds edges as hidden variables inside the attention mechanism is not proved from the table as they trail behind the results of DST. Both GSAN-Basic and GSAN-SR achieve a better balance be- tween recall and precision. GSAN-RL has a competitive recall compared to DST, which we believe shows a promising effect from using hard attention to focus on selective nodes Modeling information cascades using self-attention neural networks 99

Table 5.4: Comparing GSAN variants versus naive baselines and DST on edge predic- tion on cascade-level structures using macro-averaged recall, precision and F1 score. The lengths of cascades Memes (Ver.2) are restricted between 2 and 30. The right part of table restricts the lengths to be between 5 and 30.

Memes (Ver.2) Memes (Ver.2) (>= 5 nodes) Recall Precision F1 Recall Precision F1 Naive-to-previous 0.650 0.605 0.627 0.393 0.341 0.365 Naive-to-earliest 0.679 0.665 0.672 0.433 0.432 0.432 DST 0.420 0.482 0.449 0.376 0.440 0.405 GSAN-Basic 0.343 0.338 0.340 0.230 0.222 0.226 GSAN-RL 0.417 0.261 0.321 0.389 0.188 0.254 GSAN-SR 0.335 0.329 0.332 0.232 0.220 0.226 in the history. However, the precision for GSAN-RL is also the lowest, indicating that it picks more nodes from the history. Although the performance on edge prediction task for GSAN variants does not beat DST, we argue that is because the core training task for GSAN variants is essentially fo- cusing on nodes and learning node representations, while DST’s fits based on the edges conditioned on the nodes. It is understandable that the performance differs due to the dif- ference on training tasks. The other possibility could be answered from Jain and Wallace (2019) that the standard attention mechanism itself does not provide meaningful explana- tions and should not be treated as though they do.

5.4.6 Effect of texts as side information

Some work—such as the expanded DST model in Chapter 4, Zhang et al. (2018), and Chen et al. (2019)—shows that texts as side information can improve network structure inference tasks. DST uses edge features, hence we experimented with using text similarities; while both Zhang et al. (2018) and Chen et al. (2019) use text as nodal side information to learn the node representations. Since GSAN uses observed nodes to learn node representation, Modeling information cascades using self-attention neural networks 100

Table 5.5: Comparison between GSAN variants and their corresponding models with ad- ditional nodal text features.

Memes (Ver.2) Map@k (%) @10 @50 @100 GSAN-Basic 31.507 31.896 31.958 GSAN-Basic 31.360 31.758 31.823 w/ Texts GSAN-RL 31.000 31.427 31.492 GSAN-RL 30.908 31.333 31.399 w/ Texts GSAN-SR 31.028 31.456 31.520 GSAN-SR 31.183 31.601 31.664 w/ Texts Hits@k (%) @10 @50 @100 GSAN-Basic 39.337 47.685 52.044 GSAN-Basic 39.566 48.124 52.733 w/ Texts GSAN-RL 40.546 49.545 54.099 GSAN-RL 40.600 49.743 54.354 w/ Texts GSAN-SR 40.121 49.458 53.898 GSAN-SR 40.438 49.377 53.823 w/ Texts (a) Node Prediction Memes (Ver.2) Recall Precision F1 DST 0.420 0.482 0.449 DST 0.420 0.482 0.449 w/ Texts GSAN-Basic 0.315 0.319 0.320 GSAN-Basic 0.314 0.318 0.318 w/ Texts GSAN-RL 0.383 0.186 0.250 GSAN-RL 0.416 0.293 0.344 w/ Texts GSAN-SR 0.310 0.315 0.315 GSAN-SR 0.312 0.316 0.317 w/ Texts (b) Edge Prediction Modeling information cascades using self-attention neural networks 101 we treat texts as nodal side information, and choose GloVe (Pennington et al., 2014) word embeddings that are pretrained on Wikipedia 2014 and Gigaword 5. The embeddings have 6B tokens and a 400K vocabulary. We use the 300d version and apply a fully-connected layer to project the embeddings into the same dimension as the node representations. We then add both representations to form the new node feature vectors. We compare GSAN variants with or without texts on the Memes (Ver.2) dataset in table 5.5a. From the table we can see that including nodal text features does not have a conclusive advantage over the node representation as feature alone. We also conduct a similar comparison of GSAN variants on the edge prediction task, along with DST. DST is an unsupervised model, but our dataset has train/validation/test split. To accommodate this change, we first extract feature representations for DST according to the whole dataset. We then fit the weights of the features using training data, then apply the weights to the features in the test set. To include textual features, we use Jaccard distance as the additional feature in DST. From table 5.5b, we can see that even DST with texts achieves the same results compare to the basic feature set alone. For GSAN variants, in general the inclusion of textual features helps more on edge prediction in comparison to node prediction, and the margin is the largest for GSAN-RL where the text features is the most helpful at increasing the precision and brings it close to the other two variants. Given the good performance observed of DST in table 4.1 and table 4.2, our best guess is that in Memetracker dataset, the texts are too short to have meaningful representation powers on improving the results. Had we had a dataset with more texts, we might have better demonstrated the expected help of adding textual features. Modeling information cascades using self-attention neural networks 102

5.5 Conclusion

In this chapter we propose a self-attention-based model, GSAN, that utilizes cascade data without knowledge of the underlying network structure. It uses the observed node activa- tions instead for training, with the possibility to predict edges although no information is given about them in the training stage. We also propose two variants of GSAN, one using hard attention and reinforcement learning to train the model, and the other separating the roles of nodes depending on whether they are sending or receiving information. We com- pare GSAN and its variants with both supervised and unsupervised node representation learning models, some of them also utilizing observation of cascades. From evaluation on node prediction tasks, our GSAN beats unsupervised models and is on par with the state- of-the-art supervised model which requires the knowledge of network structure. We try to include textual features to see whether it helps better perform prediction tasks. However, due perhaps to datasets with short texts, we have so far not observed a significant differ- ence in effectiveness when including texts. Finally, we evaluate whether GSAN variants are able to predict edges from learning them as a hidden variable in the training stage and compare it to DST and two naive attachment baselines. Unfortunately, GSAN variants un- derperform other methods, most likely due to different training focus when compared to DST. Notebly, due to the nature of Transformer model, our GSAN model also suffers from the large parameter space which requires a lot of training data to fit the model. With larger candidate lists and edges along with small training data in the Digg dataset, our model lags behind TopoLSTM. In conclusion, GSAN and its variants are very good at the node prediction task even without knowing the network structure. This opens the possibility of applying GSAN in real life applications such as legislative policy diffusion where it is hard to observe the actual links between different actors and predicting the next active actor in the network is Modeling information cascades using self-attention neural networks 103 of interest. Chapter 6

Conclusion

In this thesis we address several problems related to inferring information cascades from text data: discovering the behavior of sharing text, inferring network structure and learning node representations from observed cascades, along with side information such as texts. Concretely, this thesis makes the following contributions:

1. We propose an n-gram shingling algorithm to detect locally reused passages, instead of near-duplicate documents, embedded within the larger text output of social net- work nodes. Precision-recall tradeoffs will vary with the density of text reuse and the noise introduced by optical character recognition and other features of data col- lection. We then showed the feasibility of using network regression to measure the correlations between connections inferred from text reuse and networks derived from outside information.

2. We propose an attention-based convolutional network to detect semantically similar sentences surrounded by irrelevant contextual texts. The model uses BERT (De- vlin et al., 2018) to provide contextualized representations of words. We then use a CNN layer to generate a fixed-length representation of sentences with varied length,

104 Conclusion 105

followed by a bidirectional LSTM layer to capture contextualized sentence represen- tations. An attention mechanism is used between the sentence representations from both documents to guide the classifier. We compare our model with pre-trained and fine-tuned BERT models on the ACL Anthology Corpus to show the effectiveness of our model at selecting similar sentences and labeling documents.

3. We propose a method to uncover the network structure of information cascades using an edge-factored, conditional log-linear model, which can incorporate more features than most comparable models based on cascade infection times. This directed span- ning tree (DST) model can also infer finer grained structure at the cascade level, be- sides inferring global network structure. We utilize the matrix-tree theorem to prove that the likelihood function of the conditional model can be solved in cubic time and to derive a contrastive, unsupervised training procedure. We show that for ICWSM 2011 Spinn3r dataset, our proposed method outperforms the baseline MultiTree (Ro- driguez and Schölkopf, 2012) and InfoPath (Rodriguez et al., 2014) methods in terms of recall, precision, and average precision. In the future, we expect that applications of this technique could benefit from richer textual features—including full genera- tive models of child document text—and different model structures trained with the proposed contrastive approach.

4. We introduce a network embedding model that uses the observed relative order of nodes in a cascade to guide the training process, which is unsupervised with regard to the underlying network structure. Our model also bears the assumption that the cascade structure is actually a DAG, instead of a directed spanning tree. Finally, by the nature of the model, it is easy to include different nodal side information, such as texts, in learning and inference stages. The model is a self-attention Transformer- based neural network. We explore three different variants of the model, including Conclusion 106

hard self-attention and sender/receiver structure. We use real life Digg and Meme- tracker datasets to evaluate the performance of these models, in comparison with several popular baseline methods, on the task of predicting future nodes in an infor- mation cascade. We also evaluate on edge prediction, even though these models are not trained for that task.

6.1 Future Work

Progressing towards the goals of this thesis, we have developed models that have obtained successful results on experiments on real-life datasets. However, our journey of better modeling text embedded information cascades is far from the finish line. There are several possibilities for future improvements. In the following sections we present, for each aspects of the research problems carried out in this thesis, some limitations, open problems, and improvements that deserve future research.

6.1.1 Text Reuse

Apart from the lack of ability to model semantic text reuse, we can also try to improve the ability to capture shorter reused text sequences, as noise from shorter texts can significantly obscure detection of lyric poems, short prose pieces, advertisements, and other repeated texts outside the scope of this thesis. As for semantic reuse detection, we see many directions for future research. For exam- ple, to select relevant sentences, we can use a different attention mechanism, or a different interaction matrix between sentences from different documents to improve accuracy on the ranking tasks. Or we can try to predict the span of semantically similar sentences, analo- gous to image localization models in computer vision, instead of labelling each sentence. In addition, in order to generate contextualized representations for both words and sentences, Conclusion 107 our current model suffers from runtime efficiency in practice. This renders it impractical to perform retrieval tasks, if there is no pre-selection of target documents for a query doc- ument, or close document pairs from a more efficient first-round retrieval model. It would be an intriguing direction to propose a model that can be used solely for retrieval purposes.

6.1.2 Network Structure Inference

Many real-life datasets exhibit tree-like structures at the cascade level, as the naive baseline shows very flat cascade structures, with high broadcastability (chapter 4.7). This mitigates the hard limitation on the assumption of cascade structure being a directed spanning tree imposed by DST model. Even though it achieves better results compared to the widely used models MultiTree and InfoPath, we can still improve the model by richer textual features—including full generative models of child document text—and different model structures trained with the contrastive approach, such as a joint generative model for texts and edges. Such models can also alleviate the cubic running time used by DST model, which is somewhat impractical in real life if concurrent computation is not supported, or if the size of a given cascade or network is large. While the GSAN model and its variants fare better in terms of the flexibility of datasets— i.e., no knowledge of underlying network structures is required—we see it suffers in perfor- mance for a smaller dataset with large number of candidate nodes and edges, thanks to the large parameter space brought by the Transformer architecture. This might prevent GSAN from being applied to smaller real-life social networks, such as state public policy diffu- sion, where the numbers of states and legislators are small, and the number of messages (e.g., bills) spread across the network is small as well, compared to hundreds of thousands cascades we might be able to obtain from online social network services. Another pos- sible improvement is that usually nodes activated earlier in a given cascade have a large Conclusion 108 influence over passing and infecting subsequent nodes. However, the position embeddings currently learned by GSAN and similar models might not be able to capture such dynamics. We could therefore experiment with different position embeddings not as closely derived from language modeling. In addition, we can another neural layer to combine the side information features with node representations to allow more non-linear transformation. Apart from these specific improvements we can apply to each of the assumptions about cascade structure, we are missing another important characteristics of a social network: temporal evolution. The representation of nodes, or the presence of edges in a social net- work is constantly changing—new actors join, and old actors leave; relationships break and form. Up until now, both the DST and GSAN models focus only on a current snapshot of a network. We can instead learn dynamic node representations to embed different topologi- cal or community information across time. Last but not least, snapshots of social networks often times are only partial observations. Incorporating this into modeling assumptions can help to better deal with missing nodes and links in modeling cascades and networks. Bibliography

O. Abdel-Hamid, B. Behzadi, S. Christoph, and M. Henzinger. Detecting the origin of text segments efficiently. In Proceedings of the 18th international conference on World wide web, pages 61–70. Acm, 2009.

B. Abrahao, F. Chierichetti, R. Kleinberg, and A. Panconesi. Trace complexity of network inference. In KDD, pages 491–499, 2013.

K. Amin, H. Heidari, and M. Kearns. Learning from contagion (without timestamps). In ICML, pages 1845–1853, 2014.

I. Androutsopoulos and P. Malakasiotis. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, 38:135–187, 2010.

S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embed- dings. ICLR, 2017.

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

D. Bahdanau, K. Cho, and Y. Bengio. Neural by jointly learning to align and translate. ICLR, 2015.

P. Baldi and G. Pollastri. The principled design of large-scale recursive neural network

109 Bibliography 110

architectures–dag-rnns and the protein structure prediction problem. Journal of Machine Learning Research, 4(Sep):575–602, 2003.

M. Bansal, D. Burkett, G. de Melo, and D. Klein. Structured learning for taxonomy induc- tion with belief propagation. In ACL, pages 1041–1051, 2014.

Y. Bengio, P. Simard, P. Frasconi, et al. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.

M. Bianchini, M. Maggini, L. Sarti, and F. Scarselli. Recursive neural networks for pro- cessing graphs with labelled edges: Theory and applications. Neural Networks, 18(8): 1040–1050, 2005.

S. Bird, R. Dale, B. J. Dorr, B. Gibson, M. T. Joseph, M.-Y. Kan, D. Lee, B. Powley, D. R. Radev, and Y. F. Tan. The acl anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In Proceedings of Language Resources and Evaluation Conference (LREC 08), 2008.

F. J. Boehmke and P. Skinner. State policy innovativeness revisited. State Politics & Policy Quarterly, 12(3):303–329, 2012.

H. Bonab, H. Zamani, E. Learned-Miller, and J. Allan. Citation worthiness of sentences in scientific reports. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1061–1064. ACM, 2018.

S. Bourigault, S. Lamprier, and P. Gallinari. Representation learning for information diffu- sion through social networks: an embedded cascade model. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 573–582. ACM, 2016. Bibliography 111

S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In ACM SIGMOD Record, pages 398–409. ACM, 1995.

A. Z. Broder. On the resemblance and containment of documents. In Compression and complexity of sequences 1997. proceedings, pages 21–29. IEEE, 1997.

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8):1157–1166, 1997.

J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature verification using a" siamese" time delay neural network. In Advances in neural information processing systems, pages 737–744, 1994.

I. Brugere, B. Gallagher, and T. Y. Berger-Wolf. Network Structure Inference, A Survey: Motivations, Methods, and Applications. ArXiv e-prints, Oct. 2016.

K. Burton, N. Kasch, and I. Soboroff. The ICWSM 2011 Spinn3r dataset. In ICWSM, 2011.

R. Carroll, J. B. Lewis, J. Lo, K. T. Poole, and H. Rosenthal. Measuring bias and un- certainty in dw-nominate ideal point estimates via the parametric bootstrap. Political Analysis, 17(3):261–275, 2009.

D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. Semeval-2017 task 1: Seman- tic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.

M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388. ACM, 2002. Bibliography 112

L. Chen, G. Wang, C. Tao, D. Shen, P. Cheng, X. Zhang, W. Wang, Y. Zhang, and L. Carin. Improving textual network embedding with global attention via optimal transport. arXiv preprint arXiv:1906.01840, 2019.

Z. Chen, H. Zhang, X. Zhang, and L. Zhao. Quora question pairs, 2018.

K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural ma- chine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.

Y. Chu and T. Liu. On the shortest arborescence of a directed graph. Science Sinica, 14: 1396–1400, 1965.

A. Cleeremans, D. Servan-Schreiber, and J. L. McClelland. Finite state automata and simple recurrent networks. Neural computation, 1(3):372–381, 1989.

A. Cohan, W. Ammar, M. van Zuylen, and F. Cady. Structural scaffolds for citation intent classification in scientific publications. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language

Technologies, Volume 1 (Long and Short Papers), pages 3586–3596, 2019.

I. G. Councill, C. L. Giles, and M.-Y. Kan. Parscit: an open-source crf reference string parsing package. In LREC, volume 8, pages 661–667, 2008.

P. Cui, X. Wang, J. Pei, and W. Zhu. A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering, 31(5):833–852, 2018.

Z. Dai, C. Xiong, J. Callan, and Z. Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the eleventh ACM international conference on web search and data mining, pages 126–134. ACM, 2018. Bibliography 113

H. Daneshmand, M. Gomez-Rodriguez, L. Song, and B. Schoelkopf. Estimating diffu- sion network structures: Recovery conditions, sample complexity & soft-thresholding algorithm. In ICML, pages 793–801, 2014.

D. Dekker, D. Krackhardt, and T. A. Snijders. Sensitivity of mrqap tests to collinearity and autocorrelation conditions. Psychometrika, 72(4):563–581, 2007.

B. A. Desmarais, J. J. Harden, and F. J. Boehmke. Persistent policy pathways: Inferring diffusion networks in the american states. American Political Science Review, 109(2): 392–406, 2015.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

W. B. Dolan and C. Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.

J. Edmonds. Optimum branchings. Journal of Research of the National Bureau of Stan- dards, 71B:233–240, 1967.

T. Elsayed, J. Lin, and D. W. Oard. Pairwise document similarity in large collections with mapreduce. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pages 265– 268. Association for Computational Linguistics, 2008.

J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. JMLR. org, 2017. Bibliography 114

E. Ghalebi, B. Mirzasoleiman, R. Grosu, and J. Leskovec. Dynamic network model from partial observations. Advances in Neural Information Processing Systems, pages 9862– 9872, 2018.

M. Gomez Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. In KDD, pages 1019–1028, 2010.

M. Gomez Rodriguez, J. Leskovec, and B. Schölkopf. Structure and dynamics of informa- tion pathways in online media. In WSDM, pages 23–32, 2013.

A. Goyal, F. Bonchi, and L. V. Lakshmanan. Learning influence probabilities in social networks. In Proceedings of the third ACM international conference on Web search and data mining, pages 241–250. ACM, 2010.

J. Grimmer and B. M. Stewart. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3):267–297, 2013.

A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceed- ings of the 22nd ACM SIGKDD international conference on Knowledge discovery and

data mining, pages 855–864. ACM, 2016.

H. Gui, Y. Sun, J. Han, and G. Brova. Modeling topic diffusion in multi-relational biblio- graphic information networks. In CIKM, pages 649–658, 2014.

J. Guo, Y. Fan, Q. Ai, and W. B. Croft. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 55–64. ACM, 2016.

D. Gusfield. Algorithms on strings, trees and sequences: computer science and computa- tional biology. Cambridge university press, 1997. Bibliography 115

W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

N. Heintze et al. Scalable document fingerprinting. In 1996 USENIX workshop on elec- tronic commerce, 1996.

M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and

development in information retrieval, pages 284–291. ACM, 2006.

T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. Journal of the American society for information science and technology, 54(3):203–215, 2003.

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997.

T. Hogg and K. Lerman. Social dynamics of digg. EPJ Data Science, 1(1):5, 2012.

B. Hu, Z. Lu, H. Li, and Q. Chen. Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems, pages 2042–2050, 2014.

P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 2333– 2338. ACM, 2013. Bibliography 116

S. Huston, A. Moffat, and W. B. Croft. Efficient indexing of repeated n-grams. In Proceed- ings of the fourth ACM international conference on Web search and data mining, pages 127–136. ACM, 2011.

S. R. Indurthi, I. Chung, and S. Kim. Look harder: A neural machine translation model with hard attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3037–3043, 2019.

S. Jain and B. C. Wallace. Attention is not explanation. In Proceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics:

Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, 2019.

M. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. In Proc. of the Eighth Annual Conference of the Cognitive Science Society (Erlbaum, Hills-

dale, NJ), 1986.

D. Jurgens, M. T. Pilehvar, and R. Navigli. Semeval-2014 task 3: Cross-level semantic similarity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pages 17–26, 2014.

N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of 3rd International Conference on Learning Representations, 2015.

T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel. Neural relational inference for interacting systems. In International Conference on Machine Learning, pages 2693– 2702, 2018. Bibliography 117

O. Kolak and B. N. Schilit. Generating links by mining quotations. In Proceedings of the nineteenth ACM conference on Hypertext and hypermedia, pages 117–126. ACM, 2008.

T. Koo, A. Globerson, X. Carreras Pérez, and M. Collins. Structured prediction models via the matrix-tree theorem. In EMNLP-CoNLL, pages 141–150, 2007.

D. Krackardt. Qap partialling as a test of spuriousness. Social networks, 9(2):171–186, 1987.

Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

T. Lei, R. Barzilay, and T. Jaakkola. Rationalizing neural predictions. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107– 117, 2016.

J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 497–506. ACM, 2009.

C. Li, J. Ma, X. Guo, and Q. Mei. Deepcas: An end-to-end predictor of information cascades. In Proceedings of the 26th International Conference on World Wide Web, pages 577–586. International World Wide Web Conferences Steering Committee, 2017.

F. Linder, B. A. Desmarais, M. Burgess, and E. Giraudy. Text as policy: Measuring policy similarity through bill text reuse. Policy Studies Journal, 2018.

S. W. Linderman and R. P. Adams. Discovering latent network structure in point process data. In ICML, pages 1413–1421, 2014.

D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimiza- tion. In Mathematical Programming, volume 45, pages 503–528. Springer, 1989. Bibliography 118

T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, 2015.

U. Manber et al. Finding similar files in a large file system. In Usenix Winter, volume 94, pages 1–10, 1994.

C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computa- tional Linguistics (ACL) System Demonstrations, pages 55–60, 2014. URL http: //www.aclweb.org/anthology/P/P14/P14-5010.

D. Margolin, Y.-R. Lin, and D. Lazer. Why so similar?: Identifying semantic organizing processes in large textual corpora. SSRN, 2013.

R. Mastrandrea, J. Fournet, and A. Barrat. Contact patterns in a high school: A comparison between data collected using wearable sensors, contact diaries and friendship surveys. PLoS ONE, 10(9):e0136497, 2015.

R. McDonald and G. Satta. On the complexity of non-projective data-driven dependency parsing. In IWPT, pages 121–132, 2007.

R. McDonald, K. Crammer, and F. Pereira. Online large-margin training of dependency parsers. In ACL, pages 91–98, 2005.

M. L. McGill. American literature and the culture of reprinting, 1834-1853. University of Pennsylvania Press, 2007.

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a. Bibliography 119

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013b.

A. Mittelbach, L. Lehmann, C. Rensing, and R. Steinmetz. Automatic detection of local reuse. In European Conference on Technology Enhanced Learning, pages 229–244. Springer, 2010.

V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.

S. Myers and J. Leskovec. On the convexity of latent social network inference. In NIPS, pages 1741–1749, 2010.

M. Olsen, R. Horton, and G. Roe. Something borrowed: Sequence alignment and the identification of similar passages in large text collections. Digital Studies/Le champ numérique, 2(1), 2011.

L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng. Text matching as image recognition. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.

L. Pang, Y. Lan, J. Guo, J. Xu, J. Xu, and X. Cheng. Deeprank: A new deep architec- ture for relevance ranking in information retrieval. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 257–266. ACM, 2017.

B. A. Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural Computation, 1(2):263–269, 1989.

J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representa- tion. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. Bibliography 120

B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representa- tions. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. In Proc. of NAACL, 2018.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language under- standing with unsupervised learning. Technical report, Technical report, OpenAI, 2018.

R. A. Rensink. The dynamic representation of scenes. Visual cognition, 7(1-3):17–42, 2000.

M. G. Rodriguez and B. Schölkopf. Submodular inference of diffusion networks from multiple trees. In ICML, pages 1–8, 2012.

M. G. Rodriguez, D. Balduzzi, and B. Schölkopf. Uncovering the temporal dynamics of diffusion networks. In ICML, pages 561–568, 2011.

M. G. Rodriguez, J. Leskovec, and B. Schölkopf. Modeling information propagation with survival theory. In ICML, pages 666–674, 2013.

M. G. Rodriguez, J. Leskovec, D. Balduzzi, and B. Schölkopf. Uncovering the structure and temporal dynamics of information propagation. Network Science, 2(01):26–65, 2014.

Y. Rong, Q. Zhu, and H. Cheng. A model-free approach to infer the diffusion network from event cascade. In CIKM, pages 1653–1662, 2016.

K. Saito, R. Nakano, and M. Kimura. Prediction of information diffusion probabilities for independent cascade model. In Knowledge-based intelligent information and engineer- ing systems, pages 67–75. Springer, 2008. Bibliography 121

G. Salton, E. A. Fox, and H. Wu. Extended boolean information retrieval. Technical report, Cornell University, 1982.

S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 76–85. ACM, 2003.

M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.

J. Seo and W. B. Croft. Local text reuse detection. In Proceedings of the 31st annual inter- national ACM SIGIR conference on Research and development in information retrieval, pages 571–578. ACM, 2008.

T. Shen, T. Zhou, G. Long, J. Jiang, S. Wang, and C. Zhang. Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 4345–4352. AAAI Press, 2018.

Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd International Conference on World Wide Web, pages 373–374. ACM, 2014.

N. Shivakumar and H. Garcia-Molina. Scam: A copy detection mechanism for digital documents. In Theory and Practice of Digital Libraries, 1995.

N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In International Workshop on the World Wide Web and Databases, pages 204–212. Springer, 1998. Bibliography 122

M. P. Simmons, L. A. Adamic, and E. Adar. Memes online: Extracted, subtracted, injected, and recollected. In Fifth international AAAI conference on weblogs and social media, 2011.

D. A. Smith and N. A. Smith. Probabilistic models of nonprojective dependency trees. In EMNLP-CoNLL, 2007.

N. A. Smith and J. Eisner. Contrastive estimation: Training log-linear models on unlabeled data. In ACL, 2005.

T. F. Smith, M. S. Waterman, et al. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197, 1981.

T. M. Snowsill, N. Fyson, T. De Bie, and N. Cristianini. Refining causality: Who copied from whom? In KDD, pages 466–474, 2011.

J. C. Stack, S. Bansal, V. A. Kumar, and B. Grenfell. Inferring population-level contact heterogeneity from common epidemic data. Journal of the Royal Society Interface, page rsif20120578, 2012.

C. Suen, S. Huang, C. Eksombatchai, R. Sosic, and J. Leskovec. Nifty: a system for large scale information flow tracking and clustering. In Proceedings of the 22nd international conference on World Wide Web, pages 1237–1248. ACM, 2013.

K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree- structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Con-

ference on Natural Language Processing (Volume 1: Long Papers), pages 1556–1566, 2015. Bibliography 123

J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. International World Wide Web Conferences Steering Commit- tee, 2015.

W. T. Tutte. Graph Theory, volume 21 of Encyclopedia of Mathematics and its Applica- tions. Addison-Wesley, 1984.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.

P. Velickoviˇ c,´ G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. ICLR, 2018.

S. Wan, Y. Lan, J. Guo, J. Xu, L. Pang, and X. Cheng. A deep architecture for semantic matching with multiple positional sentence representations. In Thirtieth AAAI Confer- ence on Artificial Intelligence, 2016a.

S. Wan, Y. Lan, J. Xu, J. Guo, L. Pang, and X. Cheng. Match-srnn: Modeling the recursive matching structure with spatial rnn. Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2922–2928, 2016b.

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. EMNLP 2018, page 353, 2018.

D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1225–1234. ACM, 2016. Bibliography 124

J. Wang, V. W. Zheng, Z. Liu, and K. C.-C. Chang. Topological recurrent neural network for diffusion prediction. In Data Mining (ICDM), 2017 IEEE International Conference on, pages 475–484. IEEE, 2017a.

L. Wang, S. Ermon, and J. E. Hopcroft. Feature-enhanced probabilistic models for diffusion network inference. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 499–514, 2012.

Z. Wang, W. Hamza, and R. Florian. Bilateral multi-perspective matching for natural lan- guage sentences. Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4144–4150, 2017b.

R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. Machine learning, 8(3-4):229–256, 1992.

S. Wu, P. Shapiro, and R. Cotterell. Hard non-monotonic attention for character-level transduction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4425–4438, 2018.

Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval, pages 55–64. ACM, 2017.

I. Z. Yalniz, E. F. Can, and R. Manmatha. Partial duplicate detection for large book col- lections. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 469–474. ACM, 2011. Bibliography 125

B. Yang, Z. Tu, D. F. Wong, F. Meng, L. S. Chao, and T. Zhang. Modeling localness for self-attention networks. arXiv preprint arXiv:1810.10182, 2018.

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.

W. Yin, H. Schütze, B. Xiang, and B. Zhou. Abcnn: Attention-based convolutional neural network for modeling sentence pairs. Transactions of the Association for Computational Linguistics, 4:259–272, 2016.

J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec. Graphrnn: Generating realis- tic graphs with deep auto-regressive models. In International Conference on Machine Learning, pages 5694–5703, 2018.

X. Zhai, W. Wu, and W. Xu. Cascade source inference in networks: A Markov chain Monte Carlo approach. Computational Social Networks, 2(1), 2015.

X. Zhang, Y. Li, D. Shen, and L. Carin. Diffusion maps for textual network embedding. arXiv preprint arXiv:1805.09906, 2018.

Y. Zhang, I. Marshall, and B. C. Wallace. Rationale-augmented convolutional neural net- works for text classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language

Processing, volume 2016, page 795. NIH Public Access, 2016.