<<

Semantic Document Measures and Unsupervised Document Revision Detection

Xiaofeng Zhu, Diego Klabjan Patrick N Bless Northwestern University, USA Intel Corporation, [email protected] Chandler, AZ, USA [email protected] [email protected]

Abstract The two research problems that are most rele- vant to document revision detection are plagiarism In this paper, we model the document revi- detection and revision provenance. In a plagiarism sion detection problem as a minimum cost detection system, every incoming document is branching problem that relies on comput- compared with all registered non-plagiarized doc- ing document . Furthermore, we uments (Si et al., 1997; Oberreuter and VelaSquez´ , propose two new document distance mea- 2013; Hagen et al., 2015; Abdi et al., 2015). The sures, word vector-based Dynamic Time system returns true if an original copy is found in Warping (wDTW) and word vector-based the database; otherwise, the system returns false Tree (wTED). Our revision and adds the document to the database. Thus, it detection system is designed for a large is a 1-to-n problem. Revision provenance is a 1- scale corpus and implemented in Apache to-1 problem as it keeps track of detailed updates Spark. We demonstrate that our system of one document (Buneman et al., 2001; Zhang can more precisely detect revisions than and Jagadish, 2013). Real-world applications in- state-of-the-art methods by utilizing the clude GitHub, version control in Microsoft Word Wikipedia revision dumps 1 and simulated and Wikipedia version trees (Sabel, 2007). In con- data sets. trast, our system solves an n-to-n problem on a 1 Introduction large scale. Our potential target data sources, such as the entire web or internal corpora in corpora- It is a common habit for people to keep several tions, contain numerous original documents and versions of documents, which creates duplicate their revisions. The aim is to find all revision doc- data. A scholarly article is normally revised sev- ument pairs within a reasonable time. eral times before being published. An academic Document revision detection, plagiarism detec- paper may be listed on personal websites, digital tion and revision provenance all rely on compar- conference libraries, Google Scholar, etc. In major ing the content of two documents and assessing corporations, a document typically goes through a distance/similarity score. The classic document several revisions involving multiple editors and similarity measure, especially for plagiarism de- authors. Users would benefit from visualizing the tection, is fingerprinting (Hoad and Zobel, 2003; arXiv:1709.01256v2 [cs.IR] 22 Nov 2017 entire history of a document. It is worthwhile to Charikar, 2002; Schleimer et al., 2003; Fujii and develop a system that is able to intelligently iden- Ishikawa, 2001; Manku et al., 2007; Manber et al., tify, manage and represent revisions. Given a col- 1994). Fixed-length fingerprints are created us- lection of text documents, our study identifies re- ing hash functions to represent document features vision relationships in a completely unsupervised and are then used to measure document similar- way. For each document in a corpus we only use ities. However, the main purpose of fingerprint- its content and the last modified timestamp. We ing is to reduce computation instead of improv- assume that a document can be revised by many ing accuracy, and it cannot capture word seman- users, but that the documents are not merged to- tics. Another widely used approach is comput- gether. We consider collaborative editing as revis- ing the sentence-to-sentence Levenshtein distance ing documents one by one. and assigning an overall score for every docu- 1https://snap.stanford.edu/data/wiki-meta.html ment pair (Gustafson et al., 2008). Neverthe- less, due to the large number of existing docu- score is computed by feeding paragraph represen- ments, as well as the large number of sentences tations to DTW or TED. Our code and data are in each document, the Levenshtein distance is not publicly available. 2 computation-friendly. Although alternatives such The primary contributions of this work are as as the vector space model (VSM) can largely re- follows. duce the computation time, their effectiveness is low. More importantly, none of the above ap- • We specify a model and algorithm to find the proaches can capture semantic meanings of words, optimal document revision network from a which heavily limits the performances of these large corpus. approaches. For instance, from a semantic per- spective, “I went to the bank” is expected to be • We propose two algorithms, wDTW and similar to “I withdrew some money” rather than wTED, for measuring semantic document “I went hiking.” Our document distance measures distances based on distributed representa- are inspired by the weaknesses of current doc- tions of words. The wDTW algorithm calcu- ument distance/similarity measures and recently lates document distances based on DTW by proposed models for word representations such as sequentially comparing any two paragraphs (Mikolov et al., 2013) and Paragraph of two documents. The wTED method rep- Vector (PV) (Le and Mikolov, 2014). Replacing resents the section and subsection structures words with distributed vector embeddings makes of a document in a tree with paragraphs being it feasible to measure semantic distances using ad- leaves. Both algorithms hinge on the distance vanced algorithms, e.g., between two paragraphs. (DTW) (Sakurai et al., 2005;M uller¨ , 2007; Ma- tuschek et al., 2008) and Tree Edit Distance (TED) The rest of this paper is organized in five parts. In (Tai, 1979; Zhang and Shasha, 1989; Klein, 1998; Section 2, we clarify related terms and explain the Demaine et al., 2007; Pawlik and Augsten, 2011, methodology for document revision detection. In 2014, 2015, 2016). Although calculating text dis- Section 3, we provide a brief background on ex- tance using DTW (Liu et al., 2007), TED (Sidorov isting document similarity measures and present et al., 2015) or Word Mover’s Distance (WMV) our wDTW and wTED algorithms as well as the (Kusner et al., 2015) has been attempted in the overall process flow. In Section 4, we demonstrate past, these measures are not ideal for large-scale our revision detection results on Wikipedia revi- document distance calculation. The first two al- sion dumps and six simulated data sets. Finally, gorithms were designed for sentence distances in- in Section 5, we summarize some concluding re- stead of document distances. The third measure marks and discuss avenues for future work and im- computes the distance of two documents by solv- provements. ing a transshipment problem between words in the 2 Revision Network two documents and uses word2vec embeddings to calculate semantic distances of words. The The two requirements for a document D¯ being a biggest limitation of WMV is its long computa- revision of another document D˜ are that D¯ has tion time. We show in Section 5.3 that our wDTW been created later than D˜ and that the content of and wTED measures yield more precise distance D¯ is similar to (has been modified from) that of scores with much shorter running time than WMV. D˜. More specifically, given a corpus D, for any two documents D,¯ D˜ ∈ D, we want to find out We recast the problem of detecting document the yes/no revision relationship of D¯ and D˜, and revisions as a network optimization problem (see then output all such revision pairs. Section2) and consequently as a set of document We assume that each document has a creation distance problems (see Section4). We use trained date (the last modified timestamp) which is read- word vectors to represent words, concatenate the ily available from the meta data of the document. word vectors to represent documents and combine In this section we also assume that we have a Dist word2vec with DTW or TED. Meanwhile, in or- method and a cut-off threshold τ. We represent a der to guarantee reasonable computation time in corpus as network N 0 = (V 0,A), for example large data sets, we calculate document distances at the paragraph level with Apache Spark. A distance 2https://github.com/XiaofengZhu/wDTW-wTED (a) Revision network N 0 (b) Cleaned revision network N (c) Possible solution R

Figure 1: Revision network visualization

Figure 1a, in which a vertex corresponds to a doc- Algorithm 1 Find minimum branching R for net- ument. There is an arc a = (D,¯ D˜) if and only work N = (V,A) ¯ ˜ ¯ if Dist(D, D) ≤ τ and the creation date of D 1: Input: N ˜ is before the creation date of D. In other words, 2: R = ∅ ˜ ¯ D is a revision candidate for D. By construction, 3: for every vertex v ∈ V do 0 N is acyclic. For instance, d2 is a revision candi- 4: Set δ(u) to correspond to all arcs with head date for d7 and d1. Note that we allow one docu- u ment to be the original document of several revised 5: Select a = (v, u) ∈ δ(u) such that documents. As we only need to focus on revision Dist(v, u) is minimum 0 candidates, we reduce N to N = (V,A), shown 6: R = R ∪ a in Figure 1b, by removing isolated vertices. We 7: end for define the weight of an arc as the distance score 8: Output: R between the two vertices. Recall the assumption that a document can be a revision of at most one document. In other words, documents cannot be 3 Distance/similarity Measures merged. Due to this assumption, all revision pairs form a branching in N. (A branching is a subgraph In this section, we first introduce the classic VSM where each vertex has an in-degree of at most 1.) model, the word2vec model, DTW and TED. We The document revision problem is to find a mini- next demonstrate how to combine the above com- mum cost branching in N (see Fig 1c). ponents to construct our semantic document dis- The minimum branching problem was earlier tance measures: wDTW and wTED. We also dis- solved by Edmonds(1967) and Velardi et al. cuss the implementation of our revision detection (2013). The details of his algorithm are as follows. system. 3.1 Background – For each node select the smallest weighted incoming arc. This yields a subgraph. 3.1.1 Vector Space Model VSM represents a set of documents as vectors of – If cycles are present in the selected subgraph, identifiers. The identifier of a word used in this then recursively find the replacing arc that has work is the tf-idf weight. We represent documents the minimum weight among previously non- as tf-idf vectors, and thus the similarity of two doc- selected arcs to eliminate cycles. uments can be described by the cosine distance between their vectors. VSM has low algorithm In our case, N is acyclic and, therefore, the sec- complexity but cannot represent the semantics of ond step never occurs. For this reason, Algorithm words since it is based on the bag-of-words as- 1 solves the document revision problem. sumption. The essential part of determining the minimum branching R is extracting arcs with the lowest dis- 3.1.2 Word2vec tance scores. This is equivalent to finding the most Word2vec produces semantic embeddings for similar document from the revision candidates for words using a two-layer neural network. Specif- every original document. ically, word2vec relies on a skip-gram model that uses the current word to predict context words in a where L1[i] → L2[j] means transferring L1[i] to surrounding window to maximize the average log L2[j] based on M, and ∧ represents an empty probability. Words with similar meanings tend to node. have similar embeddings.

3.1.3 Dynamic Time Warping 3.2 Semantic Distance between Paragraphs DTW was developed originally for speech recog- According to the description of DTW in Section nition in time series analysis and has been widely 3.1.3, the distance between two documents can be used to measure the distance between two se- calculated using DTW by replacing each element quences of vectors. in the feature vectors A and B with a word vector. Given two sequences of feature vectors: A = However, computing the DTW distance between a1, a2, ..., ai, ..., am and B = b1, b2, ..., bj, ..., bn, two documents at the word level is basically as DTW finds the optimal alignment for A and B by expensive as calculating the Levenshtein distance. first constructing an (m + 1) × (n + 1) matrix Thus in this section we propose an improved algo- in which the (i, j)th element is the alignment cost rithm that is more appropriate for document dis- of a1...ai and b1...bj, and then retrieving the path tance calculation. from one corner to the diagonal one through the In order to receive semantic representations for matrix that has the minimal cumulative distance. documents and maintain a reasonable algorithm This algorithm is described by the following for- complexity, we use word2vec to train word vectors mula. and represent each paragraph as a sequence of vec- tors. Note that in both wDTW and wTED we take DTW (i, j) = Dist(ai, bj) + min( document titles and section titles as paragraphs. DTW (i − 1, j), //insertion Although a more recently proposed model PV can DTW (i, j − 1), //deletion directly train vector representations for short texts DTW (i − 1, j − 1)) //substitution such as movie reviews (Le and Mikolov, 2014), our experiments in Section 5.3 show that PV is 3.1.4 Tree Edit Distance not appropriate for standard paragraphs in general documents. Therefore, we use word2vec in our TED was initially defined to calculate the minimal work. Algorithm2 describes how we compute the cost of node edit operations for transforming one distance between two paragraphs based on DTW labeled tree into another. The node edit operations and word vectors. The distance between one para- are defined as follows. graph in a document and one paragraph in another – Deletion Delete a node and connect its chil- document can be pre-calculated in parallel using dren to its parent maintaining the order. Spark to provide faster computation for wDTW and wTED. – Insertion Insert a node between an existing node and a subsequence of consecutive chil- Algorithm 2 DistPara dren of this node. Replace the words in paragraphs p and q with e f word2vec embeddings: {vi} and {wj} – Substitution Rename the label of a node. i=1 j=1 Input: p = [v1, .., ve] and q = [w1, .., wf ] Initialize the first row and the first column of Let L1 and L2 be two labeled trees, and Lk[i] th (e + 1) × (f + 1) matrix DTWpara +∞ be the i node in Lk(k = 1, 2). M corresponds DTWpara(0, 0) = 0 to a mapping from L1 to L2. TED finds mapping M with the minimal edit cost based on for i in range (1, e + 1) do for j in range (1, f + 1) do X Dist(v , w ) = ||v − w || c(M) = min{ cost(L1[i] → L2[j]) i j i j (i,j)∈M calculate DTWpara(i, j) X end for + cost(L1[i] → ∧) end for i∈I Return: DTW (e, f) X para + cost(∧ → L2[j])} j∈J 4 wDTW and wTED Measures 4.2 Word Vector-based Tree Edit Distance TED is reasonable for measuring document dis- 4.1 Word Vector-based Dynamic Time tances as documents can be easily transformed to Warping tree structures visualized in Figure3. The docu- As a document can be considered as a sequence of ment tree concept was originally proposed by Si paragraphs, wDTW returns the distance between et al.(1997). A document can be viewed at mul- two documents by applying another DTW on top tiple abstraction levels that include the document of paragraphs. The cost function is exactly the title, its sections, subsections, etc. Thus for each DistPara distance of two paragraphs given in Al- document we can build a tree-like structure with gorithm2. Algorithm3 and Figure2 describe our title → sections → subsections →...→ paragraphs wDTW measure. wDTW observes semantic in- being paths from the root to leaves. Child nodes formation from word vectors, which is fundamen- are ordered from left to right as they appear in the tally different from the word distance calculated document. from hierarchies among words in the algorithm proposed by Liu et al.(2007). The shortcomings of their work are that it is difficult to learn seman- tic taxonomy of all words and that their DTW al- gorithm can only be applied to sentences not doc- uments.

Algorithm 3 wDTW Figure 3: Document tree

Represent documents d1 and d2 with vectors m n of paragraphs: {pi}i=1 and {qj}j=1 We represent labels in a document tree as Input: d1 = [p1, .., pm] and d2 = [q1, .., qn] the vector sequences of titles, sections, subsec- Initialize the first row and the first column of tions and paragraphs with word2vec embeddings. (m + 1) × (n + 1) matrix DTWdoc +∞ wTED converts documents to tree structures and DTWdoc(0, 0) = 0 then uses DistPara distances. More formally, the for i in range (1, m + 1) do distance between two nodes is computed as fol- for j in range (1, n + 1) do lows. Dist(pi, qj) = DistPara(pi, qj) calculate DTW (i, j) doc – The cost of substitution is the DistPara value end for of the two nodes. end for Return: DTW (m, n) doc – The cost of insertion is the DistPara value of an empty sequence and the label of the in- serted node. This essentially means that the cost is the sum of the L2-norms of the word vectors in that node.

– The cost of deletion is the same as the cost of insertion.

Compared to the algorithm proposed by Sidorov et al.(2015), wTED provides different edit cost functions and uses document tree struc- tures instead of syntactic n-grams, and thus wTED yields more meaningful distance scores for long documents. Algorithm4 and Figure4 describe Figure 2: wDTW visualization how we calculate the edit cost between two doc- ument trees. Algorithm 4 wTED 4.4 Estimating the Cut-off Threshold 1: Convert documents d1 and d2 to trees T1 and Hyperprameter τ is calibrated by calculating the T2 absolute extreme based on an initial set of docu- 2: Input: T1 and T2 ments, i.e., all processed documents since the mo- 3: Initialize tree edit distance c = +∞ ment the system was put in use. Based on this set, 4: for node label p ∈ T1 do we calculate all distance/similarity scores and cre- 5: for node label q ∈ T2 do ate a histogram, see Figure5. The figure shows the 6: Update TED mapping cost c using correlation between the number of document pairs 7: cost(p → q) = DistPara(p, q) and the similarity scores in the training process of 8: cost(p → ∧) = DistPara(p, ∧) one simulated corpus using VSM. The optimal τ 9: cost(∧ → q) = DistPara(∧, q) in this example is around 0.6 where the number of 10: end for document pairs noticeably drops. 11: end for As the system continues running, new docu- 12: Return: c ments become available and τ can be periodically updated by using the same method.

Figure 4: wTED visualization

Figure 5: Setting τ 4.3 Process Flow 5 Numerical Experiments Our system is a boosting learner that is composed of four modules: weak filter, strong filter, revision This section reports the results of the experiments network and optimal subnetwork. First of all, we conducted on two data sets for evaluating the per- sort all documents by timestamps and pair up doc- formances of wDTW and wTED against other uments so that we only compare each document baseline methods. with documents that have been created earlier. In 5.1 Distance/Similarity Measures the first module, we calculate the VSM similar- ity scores for all pairs and eliminate those with We denote the following distance/similarity mea- scores that are lower than an empirical threshold sures. (τ˜ = 0.5). This is what we call the weak filter. After that, we apply the strong filter wDTW or – wDTW: Our semantic distance measure ex- wTED on the available pairs and filter out docu- plained in Section 4.1. ment pairs having distances higher than a thresh- – wTED: Our semantic distance measure ex- old τ. For VSM in Section 5.1, we directly filter plained in Section 4.2. out document pairs having similarity scores lower than a threshold τ. The cut-off threshold esti- – WMD: The Word Mover’s Distance intro- mation is explained in Section 4.4. The remain- duced in Section1. WMD adapts the earth ing document pairs from the strong filter are then mover’s distance to the space of documents. sent to the revision network module. In the end, we output the optimal revision pairs following the – VSM: The similarity measure introduced in minimum branching strategy. Section 3.1.1. – PV-DTW: PV-DTW is the same as Algorithm 3 except that the distance between two para- graphs is not based on Algorithm2 but rather computed as ||PV (p1) − PV (p2)|| where PV (p) is the PV embedding of paragraph p.

– PV-TED: PV-TED is the same as Algorithm 4 except that the distance between two para- Figure 6: Corpora simulation graphs is not based on Algorithm2 but rather computed as ||PV (p1) − PV (p2)||. some existing documents in a file system, make Our experiments were conducted on an Apache some changes (e.g. addition, deletion or re- Spark cluster with 32 cores and 320 GB total placement), and save them as separate documents. memory. We implemented wDTW, wTED, WMD, These documents become revisions of the original VSM, PV-DTW and PV-TED in Java Spark. The documents. We started from an initial corpus that paragraph vectors for PV-DTW and PV-TED were did not have revisions, and kept adding new doc- trained by gensim. 3 uments and revising existing documents. Similar to a file system, at any moment new documents 5.2 Data Sets could be added and/or some of the current docu- In this section, we introduce the two data sets ments could be revised. The revision operations we used for our revision detection experiments: we used were deletion, addition and replacement Wikipedia revision dumps and a document revi- of words, sentences, paragraphs, section names sion data set generated by a computer simulation. and document titles. The addition of words, ..., The two data sets differ in that the Wikipedia re- section names, and new documents were pulled vision dumps only contain linear revision chains, from the Wikipedia abstracts. This corpus gener- while the simulated data sets also contains tree- ation process had five time periods {t1, t2, ..., t5}. structured revision chains, which can be very com- Figure6 illustrates this simulation. We set a Pois- mon in real-world data. son distribution with rate λ = 550 (the number of documents in the initial corpus) to control the 5.2.1 Wikipedia Revision Dumps number of new documents added in each time pe- The Wikipedia revision dumps that were previ- riod, and a Poisson distribution with rate 0.5λ to ously introduced by Leskovec et al. (2010) con- control the number of documents revised in each tain eight GB (compressed size) revision edits time period. with meta data. We generated six data sets using different ran- We pre-processed the Wikipedia revision dom seeds, and each data set contained six cor- dumps using the JWPL Revision Machine (Fer- pora (Corpus 0 - 5). Table1 summarizes the first schke et al., 2011) and produced a data set data set. In each data set, we name the initial that contains 62,234 documents with 46,354 re- corpus Corpus 0, and define T0 as the timestamp visions. As we noticed that short documents when we started this simulation process. We set just contributed to noise (graffiti) in the data, we j Tj = Tj−1 + t , j ∈ [1, 5]. Corpus j corre- eliminated documents that have fewer than three sponds to documents generated before timestamp paragraphs and fewer than 300 words. We re- Tj. We extracted document revisions from Cor- moved empty lines in the documents and trained pus k ∈ [2, 5] and compared the revisions gen- word2vec embeddings on the entire corpus. We erated in (Corpus k - Corpus (k − 1)) with the used the documents occurring in the first 80% of ground truths in Table1. Hence, we ran four ex- the revision period for τ calibration, and the re- periments on this data set in total. In every exper- maining documents for test. iment, τ k is calibrated based on Corpus (k − 1). 5.2.2 Simulated Data Sets For instance, the training set of the first experi- ment was Corpus 1. We trained τ 1 from Corpus 1. The generation process of the simulated data sets We extracted all revisions in Corpus 2, and com- is designed to mimic the real world. Users open pared revisions generated in the test set (Corpus 2 - 3https://radimrehurek.com/gensim/models/doc2vec.html Corpus 1) with the ground truth: 258 revised doc- uments. The word2vec model shared in the four relies on a greedy algorithm that sums up the min- experiments was trained on Corpus 5. imal cost for every word. wDTW and wTED per- form better than PV-DTW and PV-TED, which in- Table 1: A simulated data set dicates that our DistPara distance in Algorithm2 is Number of Number of Number of more accurate than the Euclidian distance between Corpus documents new documents revision pairs paragraph vectors trained by PV. 0 550 0 0 We show in Table2 the average running time of 1 1347 542 255 2 2125 520 258 the six distance/similarity measures. In all the ex- 3 2912 528 259 periments, VSM is the fastest, wTED is faster than 4 3777 580 285 wDTW, and WMD is the slowest. Running WMD 5 4582 547 258 is extremely expensive because WMD needs to solve an x2 sequential transshipment problem for 5.3 Results every two documents where x is the average num- ber of words in a document. In contrast, by split- We use precision, recall and F-measure to evalu- ting this heavy computation into several smaller ate the detected revisions. A true positive case is a problems (finding the distance between any two correctly identified revision. A false positive case paragraphs), which can be run in parallel, wDTW is an incorrectly identified revision. A false nega- and wTED scale much better. tive case is a missed revision record. Combining Figure7, Figure8 and Table2 we We illustrate the performances of wDTW, conclude that wDTW yields the most accurate re- wTED, WMD, VSM, PV-DTW and PV-TED on sults using marginally more time than VSM, PV- the Wikipedia revision dumps in Figure7. wDTW TED and PV-DTW, but much less running time and wTED have the highest F-measure scores than WMD. wTED returns satisfactory results us- compared to the rest of four measures, and wDTW ing shorter time than wDTW. also have the highest precision and recall scores. Figure8 shows the average evaluation results on 6 Conclusion the simulated data sets. From left to right, the corpus size increases and the revision chains be- This paper has explored how DTW and TED can come longer, thus it becomes more challenging to be extended with word2vec to construct semantic detect document revisions. Overall, wDTW con- document distance measures: wDTW and wTED. sistently performs the best. WMD is slightly bet- By representing paragraphs with concatenations of ter than wTED. In particular, when the corpus size word vectors, wDTW and wTED are able to cap- increases, the performances of WMD, VSM, PV- ture the semantics of the words and thus give more DTW and PV-TED drop faster than wDTW and accurate distance scores. In order to detect revi- wTED. Because the revision operations were ran- sions, we have used minimum branching on an ap- domly selected in each corpus, it is possible that propriately developed network with document dis- there are non-monotone points in the series. tance scores serving as arc weights. We have also wDTW and wTED perform better than WMD assessed the efficiency of the method of retriev- especially when the corpus is large, because they ing an optimal revision subnetwork by finding the use to find the global opti- minimum branching. mal alignment for documents. In contrast, WMD Furthermore, we have compared wDTW and

(a) Precision (b) Recall (c) F-measure

Figure 7: Precision, recall and F-measure on the Wikipedia revision dumps ● wDTW wTED WMD VSM PV−DTW PV−TED ● wDTW wTED WMD VSM PV−DTW PV−TED ● wDTW wTED WMD VSM PV−DTW PV−TED 1 1 1

● ● ● ● ● ● ● ● ● ● ● 0.9 0.9 ● 0.9 0.8 0.8 0.8 Recall Precision F−measure 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5

2 3 4 5 2 3 4 5 2 3 4 5 Corpus Corpus Corpus (a) Precision (b) Recall (c) F-measure

Figure 8: Average precision, recall and F-measure on the simulated data sets

Table 2: Running time of VSM, PV-TED, PV-DTW, wTED, wDTW and WMD

VSM PV-TED PV-DTW wTED wDTW WMD Wikipedia revision dumps 1h 38min 3h 2min 3h 18min 5h 13min 13h 27min 515h 9min corpus 2 2 min 3 min 3 min 7 min 8 min 8 h 53 min corpus 3 3 min 4 min 5 min 9 min 11 min 12 h 45 min corpus 4 4 min 6 min 6 min 11 min 12 min 14 h 34 min corpus 5 7 min 9 min 9 min 14 min 16 min 17 h 31 min

wTED with several distance measures for revision Peter Buneman, Sanjeev Khanna, and Tan Wang- detection tasks. Our results demonstrate the effec- Chiew. 2001. Why and where: A characterization of tiveness and robustness of wDTW and wTED in data provenance. In Database Theory ICDT 2001, pages 316–330. Springer. the Wikipedia revision dumps and our simulated data sets. In order to reduce the computation time, Moses S Charikar. 2002. Similarity estimation tech- we have computed document distances at the para- niques from rounding algorithms. In Proceedings of graph level and implemented a boosting learning the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388. ACM. system using Apache Spark. Although we have demonstrated the superiority of our semantic mea- Erik D Demaine, Shay Mozes, Benjamin Rossman, and sures only in the revision detection experiments, Oren Weimann. 2007. An optimal decomposition wDTW and wTED can also be used as semantic algorithm for tree edit distance. In Automata, Lan- guages and Programming, pages 146–157. Springer. distance measures in many clustering, classifica- tion tasks. Jack Edmonds. 1967. Optimum branchings. Journal Our revision detection system can be enhanced of Research of the National Bureau of Standards B, with richer features such as author information and 71(4):233–240. writing styles, and exact changes in revision pairs. Oliver Ferschke, Torsten Zesch, and Iryna Gurevych. Another interesting aspect we would like to ex- 2011. Wikipedia revision toolkit: Efficiently access- plore in the future is reducing the complexities of ing wikipedias edit history. In Proceedings of the calculating the distance between two paragraphs. ACL-HLT 2011 System Demonstrations, pages 97– 102. ACL.

Acknowledgments Atsushi Fujii and Tetsuya Ishikawa. 2001. Japanese/english cross-language information This work was supported in part by Intel Corpora- retrieval: Exploration of query translation and tion, Semiconductor Research Corporation (SRC). transliteration. Computers and the Humanities, 35(4):389–420.

Nathaniel Gustafson, Maria Soledad Pera, and Yiu- References Kai Ng. 2008. Nowhere to hide: Finding plagia- rized documents based on sentence similarity. In Asad Abdi, Norisma Idris, Rasim M Alguliyev, and Proceedings of the 2008 IEEE/WIC/ACM Interna- Ramiz M Aliguliyev. 2015. Pdlk: Plagiarism de- tional Conference on Web Intelligence and Intelli- tection using linguistic knowledge. Expert Systems gent Agent Technology - Volume 01, pages 690–696. with Applications, 42(22):8936–8946. IEEE. Matthias Hagen, Martin Potthast, and Benno Stein. Mateusz Pawlik and Nikolaus Augsten. 2011. Rted: a 2015. Source retrieval for plagiarism detection from robust algorithm for the tree edit distance. Proceed- large web corpora: recent approaches. Working ings of the VLDB Endowment, 5(4):334–345. Notes Papers of the CLEF, pages 1613–0073. Mateusz Pawlik and Nikolaus Augsten. 2014. A Timothy C. Hoad and Justin Zobel. 2003. Methods for memory-efficient tree edit distance algorithm. In identifying versioned and plagiarized documents. Database and Expert Systems Applications, pages Journal of the American Society for Information Sci- 196–210. Springer. ence Technology, 54(3):203–215. Mateusz Pawlik and Nikolaus Augsten. 2015. Efficient Philip N Klein. 1998. Computing the edit-distance be- computation of the tree edit distance. ACM Trans- tween unrooted ordered trees. In European Sympo- actions on Database Systems, 40(1):3. sium on Algorithms, pages 91–102. Springer. Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit Matt J Kusner, Yu Sun, Nicholas I Kolkin, and Kil- distance: Robust and memory-efficient. Information ian Q Weinberger. 2015. From word embeddings to Systems, 56:157–173. document distances. In Proceedings of the 32nd In- ternational Conference on Machine Learning, pages Mikalai Sabel. 2007. Structuring wiki revision history. 957–966. In Proceedings of the 2007 International Symposium on Wikis, pages 125–130. ACM. Quoc V Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. arXiv Yasushi Sakurai, Masatoshi Yoshikawa, and Chris- preprint arXiv:1405.4053. tos Faloutsos. 2005. Ftw: fast similarity search under the time warping distance. In Proceed- Jure Leskovec, Daniel P Huttenlocher, and Jon M ings of the twenty-fourth ACM SIGMOD-SIGACT- Kleinberg. 2010. Governance in social media: A SIGART Symposium on Principles of Database Sys- case study of the wikipedia promotion process. In tems, pages 326–337. ACM. International Conference on Weblogs and Social Media. Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. 2003. Winnowing: local algorithms for document Xiaoying Liu, Yiming Zhou, and Ruoshi Zheng. 2007. fingerprinting. In Proceedings of the 2003 ACM Sentence similarity based on dynamic time warping. SIGMOD International Conference on Management In International Conference on Semantic Comput- of Data, pages 76–85. ACM. ing, pages 250–256. IEEE. Antonio Si, Hong Va Leong, and Rynson WH Lau. Udi Manber et al. 1994. Finding similar files in a large 1997. Check: a document plagiarism detection sys- file system. In Usenix Winter, volume 94, pages 1– tem. In Proceedings of the 1997 ACM Symposium 10. on Applied Computing, pages 70–77. ACM.

Gurmeet Singh Manku, Arvind Jain, and Anish Grigori Sidorov, Helena Gomez-Adorno,´ Ilia Markov, Das Sarma. 2007. Detecting near-duplicates for David Pinto, and Nahun Loya. 2015. Computing web crawling. In Proceedings of the 16th interna- text similarity using tree edit distance. In Fuzzy In- tional conference on World Wide Web, pages 141– formation Processing Society (NAFIPS) held jointly 150. ACM. with 2015 5th World Conference on Soft Comput- ing, 2015 Annual Conference of the North American, Michael Matuschek, Tim Schluter,¨ and Stefan Con- pages 1–4. IEEE. rad. 2008. Measuring text similarity with dynamic time warping. In Proceedings of the 2008 interna- Kuo-Chung Tai. 1979. The tree-to-tree correction tional symposium on Database engineering & appli- problem. Journal of the ACM, 26(3):422–433. cations, pages 263–267. ACM. Paola Velardi, Stefano Faralli, and Roberto Navigli. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- 2013. Ontolearn reloaded: A graph-based algorithm rado, and Jeff Dean. 2013. Distributed representa- for taxonomy induction. Computational , tions of words and phrases and their compositional- 39(3):665–707. ity. In Advances in Neural Information Processing Systems, pages 3111–3119. Jing Zhang and HV Jagadish. 2013. Revision prove- nance in text documents of asynchronous collabora- Meinard Muller.¨ 2007. Dynamic time warping. Infor- tion. In 2013 IEEE 29th International Conference mation Retrieval for Music and Motion, pages 69– on Data Engineering, pages 889–900. IEEE. 84. Kaizhong Zhang and Dennis Shasha. 1989. Simple Gabriel Oberreuter and Juan D VelaSquez.´ 2013. Text fast algorithms for the editing distance between trees mining applied to plagiarism detection: The use of and related problems. SIAM Journal on Computing, words for detecting deviations in the writing style. 18(6):1245–1262. Expert Systems with Applications, 40(9):3756– 3763.