NORTHEASTERN UNIVERSITY
Modeling Text Embedded Information Cascades
by
Shaobin Xu
A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of
Doctor of Philosophy
in
Computer Science
December, 2019 Abstract
Networks mediate several aspects of society. For example, social networking services (SNS) like Twitter and Facebook have greatly helped people connect with families, friends and the outside world. Public policy diffuses over institutional and social networks that connect political actors in different areas. Inferring network structure is thus essential for understanding the transmission of ideas and information, which in turn could answer ques- tions about communities, collective actions, and influential social participants. Since many networks are not directly observed, we often rely on indirect evidence, such as the tim- ing of messages between participants, to infer latent connections. The textual content of messages, especially the reuse text originating elsewhere, is one source of such evidence. This thesis contributes techniques for detecting the evidence of text reuse and modeling underlying network structure. We propose methods to model text reuse with accidental and intentional lexical and semantic mutations. For lexical similarity detection, an n-gram shin- gling algorithm is proposed to detect “locally” reused passages, instead of near-duplicate documents, embedded within the larger text output of network nodes. For semantic sim- ilarity, we use an attention based neural network to also detect embedded reused texts. When modeling network structure, we are interested in inferring different levels of de- tails: individual links between participants, the structure of a specific information cascade, or global network properties. We propose a contrastive training objective for conditional models of edges in information cascades that has the flexibility to answer those questions
ii Abstract iii and is also capable of incorporating rich node and edge features. Last but not least, net- work embedding methods prove to be a good way to learn the representations of nodes while preserving structure, node and edge properties, and side information. We propose a self-attention Transformer-based neural network trained to predict the next activated node in a given cascade to learn node embeddings.
First Reader: David Smith Second Reader: Tina Eliassi-Rad Tertiary Reader: Byron Wallace External Reader: Bruce Desmarais Acknowledgment
The journey to become a PhD can be daunting, frustrating, and yet wonderful. I owe so many thanks to many great people that helped me sail through the unforgettable 6 years as a PhD student. First and foremost, I would like to thank David, my advisor, for bringing me to the US, giving me the opportunity to work with him on many interesting NLP topics. I am deeply grateful for his patience and guidance throughout this time. His passion and knowledge about research have greatly educated and shaped me. This thesis could not have been done without his constant advice and support. His unique perspective of many matters also has made great impact on my life. I want to thank Tina, Byron, and Bruce for taking the time serving in my thesis commit- tee. Your comments and advice have greatly helped me make the thesis wholesome. I am so very appreciative of Byron’s detailed suggestions to revise the final draft of this thesis. I thank Professor Ryan Cordell, with whom I collaborated on part of the results related to this thesis. I am inspired by your passion on uncovering the 19-th century newspaper reprinting network and I am honored to be part of the team. I want to thank everyone from our lab – Liwen, Rui, Ansel, and Ryan. It is a privilege to have your company. Without any of you making the lab full of joy and energy, my life would have been miserable. I thank Rui for many deep discussions on work and life to keep my mind clear. I am truly grateful for all the brainstorms and discussions with Ansel
iv Acknowledgment v in my final two and a half years to advance my research, as well as many aspects of the live in the US that I would never know otherwise. I thank my friends outside of my lab, Bingyu, Yupeng, Chin and Bochao. Your constant help has made my life so much easier when I already have so much weight on my shoulder for the PhD. You helped to fill a lot of void during this lonely journey. Finally, I thank my mom, Xiulin, who inspires me, encourages me and loves me uncon- ditionally and endlessly. I would not have come to the other side of the world, been able to face all the unknowns and stood on my feet, had I not had the support from her. To my family and my loving friends. Contents
Abstract ii
Acknowledgment iv
Contents vii
List of Figures xi
List of Tables xiii
1 Introduction 1
1.1 Detecting Text Reuse ...... 2 1.2 Network Inference from Information Cascades ...... 5 1.3 Node Representation Learning ...... 7 1.4 Overview of The Thesis ...... 8
2 Text reuse in social networks 10
2.1 Local Text Reuse Detection ...... 11 2.2 Efficient N-gram Indexing ...... 12 2.3 Extracting and Ranking Candidate Pairs ...... 13
vii Contents viii
2.4 Computing Local Alignments ...... 14 2.5 Intrinsic Evaluation ...... 17 2.6 Extrinsic Evaluation ...... 18 2.7 Network Connections of 19c Reprints ...... 21 2.7.1 Dataset description ...... 21 2.7.2 Experiment ...... 24 2.8 Congressional Statements ...... 27 2.8.1 Dataset description ...... 27 2.8.2 Experiment ...... 28 2.9 Conclusion ...... 29
3 Semantic Text Reuse in Social Networks 30
3.1 Classifying text reuse as paraphrase or textual entailment ...... 31 3.2 Method Overview ...... 34 3.3 Word Representations ...... 35 3.4 Contextualized Sentence Representation ...... 37 3.5 Attention ...... 40 3.6 Final output ...... 41 3.7 Objective function ...... 42 3.8 Experiments ...... 42 3.8.1 Datasets ...... 42 3.8.2 Models ...... 44 3.8.3 Experiment settings ...... 46 3.8.4 Document level evaluation ...... 47 3.8.5 Sentence level evaluation ...... 49 3.8.6 Ablation Test ...... 50 Contents ix
3.9 Conclusion ...... 52
4 Modeling information cascades with rich feature sets 54
4.1 Network Structure Inference ...... 55 4.2 Log-linear Directed Spanning Tree Model ...... 56 4.3 Likelihood of a cascade ...... 58 4.4 Maximizing Likelihood ...... 59 4.5 Matrix-Tree Theorem and Laplacian Matrix ...... 59 4.6 Gradient ...... 61 4.7 ICWSM 2011 Webpost Dataset ...... 61 4.7.1 Dataset description ...... 62 4.7.2 Feature sets ...... 63 4.7.3 Result of unsupervised learning at cascade level ...... 64 4.7.4 Result of unsupervised learning at network level ...... 66 4.7.5 Enforcing tree structure on the data ...... 68 4.7.6 Result of supervised learning at cascade level ...... 70 4.8 State Policy Adoption Dataset ...... 70 4.8.1 Dataset description ...... 71 4.8.2 Effect of proximity of states ...... 71 4.9 Conclusion ...... 72
5 Modeling information cascades using self-attention neural networks 74
5.1 Node representation learning ...... 75 5.2 Information cascades as DAGs ...... 78 5.3 Graph self-attention network ...... 78 5.3.1 Analogy to language modeling ...... 79 5.3.2 Graph self-attention layer ...... 81 Contents x
5.3.3 Graph self-attention network ...... 83 5.3.4 Senders and receivers ...... 85 5.3.5 Hard attention ...... 86 5.3.6 Edge prediction ...... 90 5.4 Experiments ...... 90 5.4.1 Datasets ...... 90 5.4.2 Baselines ...... 93 5.4.3 Experimental settings ...... 95 5.4.4 Node prediction ...... 96 5.4.5 Edge prediction ...... 98 5.4.6 Effect of texts as side information ...... 99 5.5 Conclusion ...... 102
6 Conclusion 104
6.1 Future Work ...... 106 6.1.1 Text Reuse ...... 106 6.1.2 Network Structure Inference ...... 107
Bibliography 109 List of Figures
2.1 Average precision for aligned passages of different minimum length in characters. Vertical red lines indicate the performance of different parame- ter settings (see Table 2.1)...... 19 2.2 (Pseudo-)Recall for aligned passages of different minimum lengths in char- acters...... 20 2.3 Newspaper issues mentioning “associated press” by year, from the Chron- icling America corpus. The black regression line fits the raw number of issues; the red line fits counts corrected for the number of times the Asso- ciated Press is mentioned in each issue...... 23 2.4 Reprints of John Brown’s 1859 speech at his sentencing. Counties are shaded with historical population data, where available. Even taking pop- ulation differences into account, few newspapers in the South printed the abolitionist’s statement...... 25
3.1 The overview of structure of Attention Based Convolutional Network (ABCN) 34 3.2 Unrolled Vanilla Recurrent Neural Network ...... 38 3.3 An illustration of ConvNet structure used in ABCN with toy example with
word embeddings in R5, kernel size 2, feature map size 3. It yields a rep- resentation for the sentence in R3 ...... 39
xi List of Figures xii
4.1 Recall, precision, and average precision of InfoPath and DST on predicting the time-varying networks generated per day. The DST model is trained un- supervisedly on separate cascades using basic and enhanced features. The upper row uses graph-structured cascades from the ICWSM 2011 dataset. The lower row uses the subset of cascades with tree structures...... 69
5.1 The illustration of a cascade structure as DAG in a toy network...... 79 5.2 Graph Self-Attention Network architecture, with L identical multi-headed self attention layer...... 82 5.3 Graph Sender-Receiver attention network architecture, with L identical multi-headed attention layer for senders and receivers respectively...... 87 5.4 Graph hard self-attention network architecture, with L − 1 identical multi- headed attention layer and the last layer replaced with a reinforcement learning agent selecting mask actions...... 88 5.5 Modified Graph Self-Attention Networks with last layer replaced with a single-head attention sublayer to output edge predictions...... 91 List of Tables
2.1 Parameters for text reuse detection ...... 13 2.2 Correlations between shared reprints between 19c newspapers and political and other affinities. While many Whig papers became Republican, they do not completely overlap in our dataset; the identical number of pairs is coincidental...... 26 2.3 Correlations between log length of aligned text and other author networks in public statements by Members of Congress. ∗p < .05, ∗ ∗ p < .01, ∗ ∗ ∗p < .001 ...... 28
3.1 Document level results (Macro-averaged recall/precision/F1) of ABCN in comparison with baseline methods under both whole source document vs. target sentences and source sentences vs. target sentences setting...... 48 3.2 Sentence level results (macro-averaged recall/precision/F1) of ABCN in comparison with baseline methods under both whole source document vs. target sentences and source sentences vs. target sentences, evaluated on all target documents...... 51
xiii List of Tables xiv
3.3 Sentence level results (macro-averaged recall/precision/F1) of ABCN in comparison with baseline methods under both whole source document vs. target sentences and source sentences vs. target sentences, evaluated on only matched documents...... 51 3.4 Ablation test where Bi-LSTM layer is removed to test the efficacy of con- textualized sentence representation, on both document and sentence level dataset, in comparison to proposed model structure...... 52
4.1 Cascade-level inference of DST with different feature sets, in unsupervised learning setting in comparison with naive attach-everything-to-earliest base- line for original cascades extracted from ICWSM 2011 dataset...... 64 4.2 Cascade-level inference of DST with different feature sets, in unsupervised learning setting in comparison with naive attach-everything-to-earliest base- line for tree structure enforced cascades extracted from ICWSM 2011 dataset. 64 4.3 Cascade-level inference of DST with different feature sets, in supervised learning setting for merged cascades for tree structure enforced cascades extracted from ICWSM 2011 dataset...... 65 4.4 Comparison of MultiTree, InfoPath and DST on inferring a static network on the original ICWSM 2011 dataset. The DST model is trained and tested unsupervisedly on both separate cascades and merged cascades using dif- ferent feature sets and the naive attach-everything-to-earliest-node baseline. 67 4.5 Comparison of MultiTree, InfoPath and DST on inferring a static net- work on modidifed ICWSM 2011 dataset with enforced tree structure. The DST model is trained and tested unsupervisedly on both separate cas- cades and merged cascades using different feature sets and the naive attach- everything-to-earliest-node baseline...... 68 List of Tables xv
4.6 Logistic regression of network inferred by DST and InfoPath on indepen- dent networks: geological distance between states and contiguity of states. The statistical significance are at < 0.05 level according to the QAP p-value against the indicated null hypothesis...... 71
5.1 Statistics of datasets. Memes (Ver. 1) is from Wang et al. (2017a)...... 93 5.2 Comparison of variants of GSAN and baseline models on Digg and Memes (Ver.1) datasets. The results of baseline models are from Wang et al. (2017a). 97 5.3 Accuracy of GSAN variants on Digg and Memes (Ver.1) datasets. The accuracy are listed as percentage...... 97 5.4 Comparing GSAN variants versus naive baselines and DST on edge predic- tion on cascade-level structures using macro-averaged recall, precision and F1 score. The lengths of cascades Memes (Ver.2) are restricted between 2 and 30. The right part of table restricts the lengths to be between 5 and 30. . 99 5.5 Comparison between GSAN variants and their corresponding models with additional nodal text features...... 100 Chapter 1
Introduction
Networks mediate several aspects of society, and studies of networks are abundant thanks to the usefulness of what a better understanding of them entails. For example, research on social network services such as Twitter and Facebook helps understand communities or influential groups, promote viral marketing, reveal connections among social and political movements, and so on. Apart from observing the links between different social network participants, we can also obtain much side information, e.g., the time when the interac- tion took place, the content or label that each participant contributes or the attributes of the links. This side information provides indirect evidence for social ties. Text reuse by different participants in this case becomes one of many revealing forms of shared behav- ior. For example, political speech—be it on television, on the floor of the legislature, in printed quotations, on politicians’ websites and social media feeds—uses common tropes and turns of phrase that groups of politicians used to describe some issue (Grimmer and Stewart, 2013). One might even discover a list of “talking points” underlying this common behavior.1 One might be reading the literature in a scientific field and find that a paper starts
1The U.S. State Department, for example, produced a much-discussed set of talking points memos in response to the 2012 attach in Benghazi.
1 Introduction 2
getting cited or paraphrased repeatedly. Which previous paper, or papers, introduced the new technique? Or perhaps one reads several news stories about a new product from some company and then finds that they all share text with a press release put out by the company. Methods to uncover invisible links among sources of text have broad applicability be- cause of the very general nature of the problem—sources of text include websites, newspa- pers, individuals, corporations, political parties, and so on. Further, discerning those hidden links between sources can provide more effective ways of identifying the provenance and diverse sources of information, and to build predictive models of the diffusion of informa- tion. There are substantial challenges, however, in building methodologies to uncover text reuse and model sharing behavior. In this introductory chapter, we will discuss some of the challenges that we face and how we aim to address those issues. In particular, we will consider how to detect text reuse, both lexically and semantically in §1.1. Then, we will see how to model information sharing, using text as an example, in §1.2 from observing in- formation cascades. Each cascade comprises a number of social actors receiving (getting activated by) the a related piece of information in a sequential fashion. Finally, projecting the representations of nodes in a network into a continuous dense space while perserving first- or second-order proximity can be useful for many tasks, such as node classification, node clustering, link prediction, etc. In §1.3 we will see that node representations can also be learned from information cascades.
1.1 Detecting Text Reuse
As mentioned above, several situations can give rise to text reuse. For example, Linder et al. (2018) show that bills introduced by ideologically similar sponsors exhibit a high degree of text reuse, that bills classified by the National Conference of State Legislatures as covering the same policies exhibit a high degree of text reuse, and that rates of text reuse Introduction 3 between states correlate with policy diffusion network ties between states. Such reuse is usually observable at the lexical level and can be detected by an alignment algorithm, such as the Smith-Waterman algorithm (Smith et al., 1981). The following pair shows one of the sample alignments from Linder et al. (2018) for state legislative bills:
nj_214_A1167: “the entire credit may not be taken for the taxable year in which the renewable energy property is placed in service but must be taken in five equal installments beginning with the taxable year in which the” nc_2009_SB305: “the entire credit may not be taken for the taxable year in which the costs are paid---- but must be taken in five equal installments beginning with the taxable year in which the”
The words in red indicate mismatches, while those in green are gaps introduced by the alignment algorithm to indicate insertions and deletions. Non-highlighted passages are matches. There are, however, several issues making the study of text reuse challenging, including: scalable detection of reused passages; identification of appropriate statistical models of text mutation; inference methods for characterizing missing nodes that originate or mediate text transmission; link inference conditioned on textual topics; and the devel- opment of testbed datasets through which predictions of the resulting models might be validated against some broader understanding of the processes of transmission. We propose an n-gram shingling algorithm to detect “locally” reused passages, instead of near-duplicate documents, embedded within the larger text output of social network nodes. We then explore the correlation of links revealed by text sharing behavior and var- ious types of social ties of different datasets. An n-gram shingling algorithm for detecting text reuse is a method based on lexical analysis, whereas semantic meanings of them, along with the contextual texts, are more revealing the path over which the information propa- gates. For example, there could be many editorial changes, typos, so on and so forth while Introduction 4 passing around information. The reuse of the same mutation, rather than the original text, could be an indicator of a link between those network participants. An algorithm based on lexical analysis wouldn’t be the best choice under this circumstance as if we want exact match we miss many repeated passage, whereas if we lower the proportion of text overlaps that is considered a match we will generate more noises. In such cases, an approach that can detect semantic reuse is more ideal. We propose to extract local texts that are semantically similar given a pair of texts. Lexical similarity is just one of two forms of text reuse in social network. The other form is semantic similarity, which includes paraphrase or textual entailment. This usually happens in academic research where authors citing other authors’ contribution by rephras- ing it; in news media where one source cites the press release or other outlet with seman- tically similar sentences; in online social networks where people often describe reported incidents with their words. The following snippets are from a press release about a study on obesity and a news article that cites the press release:
Press release: “The students then received instruction on the causes and treatments of obesity, with follow up testing on their knowledge and attitudes toward obesity for every year of medical school. Those who completed the program significantly reduced their bias by an average of 7 percent.” News article: “The study, recently published online in The Journal of the American Osteopathic Association, found that when medical students are taught a specific curriculum aimed at better understanding the causes behind, and treatments for, obesity, the students’ innate obesity prejudice dropped by an average of 7 percent.” In this example we can observe the news article paraphrasing the press release. This prob- lem could be more complicated than lexical text reuse, in that we cannot necessarily use Introduction 5
a naive alignment algorithm for detection. In the example given above, the sentence is reordered and words are changed (e.g., “received instruction” → “are taught”, and “bias” → ”prejudice”). If only pairs of potentially similar texts are given, many efforts have been made to better predict the similarity score (Dai et al., 2018; Pang et al., 2017; Devlin et al., 2018). However, what are observed in social networks are usually documents, or passages with unknown boundaries, instead of mere sentences. With irrelevant surrounding contexts in these observed documents, those models fail to make a correct prediction. We propose an attention based convolutional network that uses BERT (Devlin et al., 2018), which is the state-of-the-art model on classifying text similarity tasks to provide contextualized rep- resentations of words. We then use a convolutional neural network (CNN) to infer a fixed length representation for sentences with varied length, given the good performance of CNN based models on text similarity classification tasks (Hu et al., 2014; Pang et al., 2016; Dai et al., 2018), followed by a bidirectional Long Short-Term Memory (LSTM: Hochreiter and Schmidhuber, 1997) unit to capture the contextualized sentence representations. The proposed model also uses an attention mechanism (Bahdanau et al., 2015), which guides the model to “look at” similar sentences among irrelevant contexts. We compare this model with a pre-trained and a fine-tuned BERT model on the task of recovering citations between scientific papers in the ACL Anthology Corpus.
1.2 Network Inference from Information Cascades
Observing the behavior of text reuse takes us only part way towards modeling the underly- ing networks. We might be interested in network structures at differing levels of detail. We might be interested in individual links, e.g., knowing which previous paper it was that later papers were mining for further citations; in cascades, e.g., knowing which news stories are copying from which press releases or from each other; and in networks, e.g., know- Introduction 6 ing which politicians are most likely to share talking points or which newspapers are most likely to publish press releases from particular businesses or universities. Depending on our data source, some of these structures could be directly observed. With the right API calls, we might observe retweets (links), chains of retweets (cascades), and follower rela- tions (networks) on Twitter. We might also be interested in inferring an underlying social network for which the Twitter follower relation is partial evidence. In contrast, politicians interviewed on television do not explicitly cite the sources of their talking points, which must be inferred. Documenting an information diffusion process often reduces to keeping track of when nodes (newspapers, bills, people, etc.) mention a piece of information, reuse a text, get infected, or exhibit a contagion in a general sense. When the structure of the propagation of a contagion is hidden and we cannot tell which node infected which, all we have is the result of diffusion process—that is, the timestamp and possibly other information related to when the nodes get infected. We want to infer the diffusion process itself by using such information to predict the links in an underlying network. There have been increasing ef- forts to uncover and model different types of information cascades on networks (Brugere et al., 2016): modeling hidden networks from observed infections (Stack et al., 2012; Ro- driguez et al., 2014), modeling topic diffusion in networks (Gui et al., 2014), predicting social influence on individual mobility (Mastrandrea et al., 2015) and so on. This prior work all focused on using parametric models of the time differences between infections. Such models are useful when the only information we can get from the result of diffusion process is the timestamps of infections. We can hope to make better predictions, however, with access to additional features, such as the location of each node, the similarity between the messages received by two nodes, etc. Popular parametric models cannot incor- porate these features into unsupervised training. We propose an edge-factored, conditional log-linear directed spanning tree (DST) model with an unsupervised, contrastive training Introduction 7 procedure to infer the link structure of information cascades. The advantage of condition- ing the model on the observed sequence of nodes in a cascade is that we can easily include various features, including those reflecting the reuse of texts, in the model.
1.3 Node Representation Learning
The modeling assumption of a directed spanning tree for cascade structure poses a few problems. In many scenarios, for example, a node is more realistically affected by mul- tiple previously activated nodes, which violates the tree constraint on cascade structure. Therefore, our DST model might fail to capture the many-to-one relationships of “complex contagion”. A directed acyclic graph (DAG) is more suitable for capturing such relation- ships unfolding in time. A DAG’s prohibition on directed cycles ensures that, within a given cascade, a piece of information cannot flow back to itself. However, according to Cui et al. (2018), representing the network using the traditional graph structure has several issues: high computational complexity, low parallelizability, in- applicability of machine learning methods such as deep neural networks, etc. We therefore do not intend to directly model the DAG structure, but find an alternative model to build in the assumption of the cascade structure being a DAG. Recently, network embeddings have become popular: instead of learning the edges of the underlying network, this method embeds nodes into a continuous latent space that tries to preserve the structure, property and side information as such texts in the meantime. It also enables applications to sev- eral tasks such as node classification, node clustering, link prediction and so on. However, many of the efforts are on learning the node representations from fully-observed networks as a whole, such as DeepWalk (Perozzi et al., 2014), Node2Vec (Grover and Leskovec, 2016), and GraphSAGE (Hamilton et al., 2017). Among the relatively fewer network em- bedding methods utilizing information cascades are Embedded-IC (Bourigault et al., 2016), Introduction 8
DeepCas (Li et al., 2017) and Topo-LSTM (Wang et al., 2017a). Many of these models of information cascades require knowledge of the underlying network structure or the trans- mission probabilities between nodes, which can be expensive or impractical to obtain for certain social networks. Learning to predict the next active nodes in a cascade could be of interest to several applications. For example, in public policy diffusion networks, having a model that predicts links between states, for example, might help qualitative analysis and visualization; however, predicting the next state that might adopt a certain policy could prove to be a more valuable result, along with the fact that we can actually assess the ef- fectiveness of such model. To take another example, in viral marketing, the bidders on advertisements are usually more interested in knowing which target groups are more likely to get the information next, instead of how the information is passed. We therefore introduce a network embedding model that uses the observed relative order of nodes in a cascade to guide the training process, which is unsupervised with regard to the underlying network structure. Our model also bears the assumption that the cascade structure is actually a DAG, instead of a directed spanning tree. Finally, by the nature of the model, it is easy to include different nodal side information, such as texts, in learning and inference stages.
1.4 Overview of The Thesis
In Chapter 2, we elaborate on detecting local lexical text reuse in social networks in an efficient way. We perform an intrinsic and extrinsic evaluation on the methods. In Chapter 3, we show how to use an attention-based neural network to select similar sentences from the irrelevant surrounding contexts and evaluate the model on ACL Anthol- ogy Corpus. Chapter 4 shows how we model the structure of a cascade as a directed spanning tree Introduction 9 and the effectiveness of including side information. Chapter 5 introduces a self-attention network embedding model based on Transformers to learn node representations. We use several real-life social networks for evaluation. Finally, we summarize the modeling contributions of this thesis in the concluding Chap- ter 6. Chapter 2
Text reuse in social networks
As noted in the introduction, we are interested in detecting passages of text reuse (poems, stories, political talking points) that comprise a small fraction of the containing documents (newspaper issues, political speeches). Using the terminology of biological sequence align- ment, we are interestd in local alignments between documents. In text reuse detection re- search, two primary methods are n-gram shingling and locality-sensitive hashing (LSH) (Henzinger, 2006). The need for local alignments makes LSH less pratical without per- forming a large number of sliding-window matches. In contrast to work on near-duplicate document detection and to work on “meme track- ing” that takes text between quotation marks as the unit of reuse (Leskovec et al., 2009; Suen et al., 2013), here the boundaries of the reused passages are not known. Also in con- trast to work on the contemporary news cycle and blogosphere, we are interested both in texts that are reprinted within a few days and after many years. We thus cannot exclude potentially matching documents for being far removed in time. Text reuse that occurs only among documents from the same “source” (run of newspapers; Member of Congress) should be excluded. Similarly, Henzinger (2006) notes that many of the errors in near- duplicate webpage detection arose from false matches among documents from the same
10 Text reuse in social networks 11
website that shared boilerplate navigational elements.
2.1 Local Text Reuse Detection
Two approaches to text reuse detection have been broadly explored. One approach is hash- ing subsequences of words in documents to construct fingerprints of the document. This approach is known to work well for copy detection. Shivakumar and Garcia-Molina (1995, 1998) and Broder (1997) present generic frameworks for working with a fingerprinting ap- proach, and various selection algorithms have been proposed since (Manber et al., 1994; Heintze et al., 1996; Brin et al., 1995; Schleimer et al., 2003), due to the computational complexity of handling many fingerprints. Broder et al. (1997) propose to use super- shinglings which means hashing sequences of fingerprints again. Charikar (2002) intro- duced a hashing algorithm based on random projections of words in documents. Henzinger (2006) compare the methods proposed by Charikar and Broder et al. on a large scale of web pages and propose a combined algorithm. Schleimer et al. (2003) propose an fingerprint algorithm named Winnowing which uses a second window size w and for each consecu- tive sequence of w shingles it outputs the shingle with the smallest fingerprint value, or the right-most shingle selected if no unique smallest shingle exists. Seo and Croft (2008) proposed two new algorithms: Revised Hash-breaking to break the document into non- overlapping (text) segments at the hashed tokens whose value is divisible for some integer p and fingerprint again all the tokens in the segments; DCT (discrete cosine transformation) which is based on hash-breaking but applies a DCT to sequnces of text segments and quan- tize each coefficient from which the final fingerprint is formed. Abdel-Hamid et al. (2009) include the methods from Seo and Croft (2008) and propose a new algorithm Hailstorn to select the shingle s iff the minimum fingerprint value of all k fingerprinted tokens in s occurs at the first or the last positions of s. Mittelbach et al. (2010) propose a selection Text reuse in social networks 12
strategy Anchor which locates and hashes certain predefined strings (anchors) in a given document. Another approach is computing similarities between documents. Shivakumar and Garcia- Molina (1995) suggest similarity measures based on the relative frequence of words be- tween documents. Hoad and Zobel (2003) extend the notion and explore different variants of identity measure based on cosine measure and different parameters for fingerprinting algorithms, such as generation, granularity, and resolution, for full-document duplication detection.
2.2 Efficient N-gram Indexing
The first step in our approach is to build for each n-gram feature an inverted index of the documents where it appears. As in other duplicate detection and text reuse applications, we are only interested in n-gram shared by two or more documents. The index, therefore, does not need to contain entries for the n-grams that occur only once. We use the two- pass space-efficient algorithm described by Huston et al. (2011), which empirically, is very efficient on large collections. In a first pass, n-grams are hashed into a fixed number of bins. On the second pass, n-grams that hash to bins with one occupant can be discarded; other postings are passed through. Due to hash collisions, there may still be a small number of singleton n-grams that reach this stage. These singletons are filtered out as the index is written. In building an index of n-grams, an index of (n-1)-grams can also provide a useful filter. No 5-gram, for example, can occur twice unless its constituent 4-grams occur at least twice. Text reuse in social networks 13
2.3 Extracting and Ranking Candidate Pairs
Once we have an inverted index of the documents that contain each (skip) n-gram, we use it to generate and rank document pairs that are candidates for containing reprinted texts.
Each entry, or posting list, in the index may be viewed as a set of pairs (di, pi) that record the document identifier and position in that document of that n-gram. Once we have a posting list of documents containing each distinct n-gram, we output all pairs of documents in each list. We suppress repeated n-grams that appear in differ- ent issues of the same newspaper. These repetitions often occur in editorial boilerplate or advertisements, which, while interesting, are outside the scope of this project. We also