Semantic Document Distance Measures and Unsupervised Document Revision Detection

Semantic Document Distance Measures and Unsupervised Document Revision Detection Xiaofeng Zhu, Diego Klabjan Patrick N Bless Northwestern University, USA Intel Corporation, [email protected] Chandler, AZ, USA [email protected] [email protected] Abstract The two research problems that are most rele- vant to document revision detection are plagiarism In this paper, we model the document revi- detection and revision provenance. In a plagiarism sion detection problem as a minimum cost detection system, every incoming document is branching problem that relies on comput- compared with all registered non-plagiarized doc- ing document distances. Furthermore, we uments (Si et al., 1997; Oberreuter and VelaSquez´ , propose two new document distance mea- 2013; Hagen et al., 2015; Abdi et al., 2015). The sures, word vector-based Dynamic Time system returns true if an original copy is found in Warping (wDTW) and word vector-based the database; otherwise, the system returns false Tree Edit Distance (wTED). Our revision and adds the document to the database. Thus, it detection system is designed for a large is a 1-to-n problem. Revision provenance is a 1- scale corpus and implemented in Apache to-1 problem as it keeps track of detailed updates Spark. We demonstrate that our system of one document (Buneman et al., 2001; Zhang can more precisely detect revisions than and Jagadish, 2013). Real-world applications in- state-of-the-art methods by utilizing the clude GitHub, version control in Microsoft Word Wikipedia revision dumps 1 and simulated and Wikipedia version trees (Sabel, 2007). In con- data sets. trast, our system solves an n-to-n problem on a 1 Introduction large scale. Our potential target data sources, such as the entire web or internal corpora in corpora- It is a common habit for people to keep several tions, contain numerous original documents and versions of documents, which creates duplicate their revisions. The aim is to find all revision doc- data. A scholarly article is normally revised sev- ument pairs within a reasonable time. eral times before being published. An academic Document revision detection, plagiarism detec- paper may be listed on personal websites, digital tion and revision provenance all rely on compar- conference libraries, Google Scholar, etc. In major ing the content of two documents and assessing corporations, a document typically goes through a distance/similarity score. The classic document several revisions involving multiple editors and similarity measure, especially for plagiarism de- authors. Users would benefit from visualizing the tection, is fingerprinting (Hoad and Zobel, 2003; arXiv:1709.01256v2 [cs.IR] 22 Nov 2017 entire history of a document. It is worthwhile to Charikar, 2002; Schleimer et al., 2003; Fujii and develop a system that is able to intelligently iden- Ishikawa, 2001; Manku et al., 2007; Manber et al., tify, manage and represent revisions. Given a col- 1994). Fixed-length fingerprints are created us- lection of text documents, our study identifies re- ing hash functions to represent document features vision relationships in a completely unsupervised and are then used to measure document similar- way. For each document in a corpus we only use ities. However, the main purpose of fingerprint- its content and the last modified timestamp. We ing is to reduce computation instead of improv- assume that a document can be revised by many ing accuracy, and it cannot capture word seman- users, but that the documents are not merged to- tics. Another widely used approach is comput- gether. We consider collaborative editing as revis- ing the sentence-to-sentence Levenshtein distance ing documents one by one. and assigning an overall score for every docu- 1https://snap.stanford.edu/data/wiki-meta.html ment pair (Gustafson et al., 2008). Neverthe- less, due to the large number of existing docu- score is computed by feeding paragraph represen- ments, as well as the large number of sentences tations to DTW or TED. Our code and data are in each document, the Levenshtein distance is not publicly available. 2 computation-friendly. Although alternatives such The primary contributions of this work are as as the vector space model (VSM) can largely re- follows. duce the computation time, their effectiveness is low. More importantly, none of the above ap- • We specify a model and algorithm to find the proaches can capture semantic meanings of words, optimal document revision network from a which heavily limits the performances of these large corpus. approaches. For instance, from a semantic per- spective, “I went to the bank” is expected to be • We propose two algorithms, wDTW and similar to “I withdrew some money” rather than wTED, for measuring semantic document “I went hiking.” Our document distance measures distances based on distributed representa- are inspired by the weaknesses of current doc- tions of words. The wDTW algorithm calcu- ument distance/similarity measures and recently lates document distances based on DTW by proposed models for word representations such as sequentially comparing any two paragraphs word2vec (Mikolov et al., 2013) and Paragraph of two documents. The wTED method rep- Vector (PV) (Le and Mikolov, 2014). Replacing resents the section and subsection structures words with distributed vector embeddings makes of a document in a tree with paragraphs being it feasible to measure semantic distances using ad- leaves. Both algorithms hinge on the distance vanced algorithms, e.g., Dynamic Time Warping between two paragraphs. (DTW) (Sakurai et al., 2005;M uller¨ , 2007; Ma- tuschek et al., 2008) and Tree Edit Distance (TED) The rest of this paper is organized in five parts. In (Tai, 1979; Zhang and Shasha, 1989; Klein, 1998; Section 2, we clarify related terms and explain the Demaine et al., 2007; Pawlik and Augsten, 2011, methodology for document revision detection. In 2014, 2015, 2016). Although calculating text dis- Section 3, we provide a brief background on ex- tance using DTW (Liu et al., 2007), TED (Sidorov isting document similarity measures and present et al., 2015) or Word Mover’s Distance (WMV) our wDTW and wTED algorithms as well as the (Kusner et al., 2015) has been attempted in the overall process flow. In Section 4, we demonstrate past, these measures are not ideal for large-scale our revision detection results on Wikipedia revi- document distance calculation. The first two al- sion dumps and six simulated data sets. Finally, gorithms were designed for sentence distances in- in Section 5, we summarize some concluding re- stead of document distances. The third measure marks and discuss avenues for future work and im- computes the distance of two documents by solv- provements. ing a transshipment problem between words in the 2 Revision Network two documents and uses word2vec embeddings to calculate semantic distances of words. The The two requirements for a document D¯ being a biggest limitation of WMV is its long computa- revision of another document D~ are that D¯ has tion time. We show in Section 5.3 that our wDTW been created later than D~ and that the content of and wTED measures yield more precise distance D¯ is similar to (has been modified from) that of scores with much shorter running time than WMV. D~. More specifically, given a corpus D, for any two documents D;¯ D~ 2 D, we want to find out We recast the problem of detecting document the yes/no revision relationship of D¯ and D~, and revisions as a network optimization problem (see then output all such revision pairs. Section2) and consequently as a set of document We assume that each document has a creation distance problems (see Section4). We use trained date (the last modified timestamp) which is read- word vectors to represent words, concatenate the ily available from the meta data of the document. word vectors to represent documents and combine In this section we also assume that we have a Dist word2vec with DTW or TED. Meanwhile, in or- method and a cut-off threshold τ. We represent a der to guarantee reasonable computation time in corpus as network N 0 = (V 0;A), for example large data sets, we calculate document distances at the paragraph level with Apache Spark. A distance 2https://github.com/XiaofengZhu/wDTW-wTED (a) Revision network N 0 (b) Cleaned revision network N (c) Possible solution R Figure 1: Revision network visualization Figure 1a, in which a vertex corresponds to a doc- Algorithm 1 Find minimum branching R for net- ument. There is an arc a = (D;¯ D~) if and only work N = (V; A) ¯ ~ ¯ if Dist(D; D) ≤ τ and the creation date of D 1: Input: N ~ is before the creation date of D. In other words, 2: R = ; ~ ¯ D is a revision candidate for D. By construction, 3: for every vertex v 2 V do 0 N is acyclic. For instance, d2 is a revision candi- 4: Set δ(u) to correspond to all arcs with head date for d7 and d1. Note that we allow one docu- u ment to be the original document of several revised 5: Select a = (v; u) 2 δ(u) such that documents. As we only need to focus on revision Dist(v; u) is minimum 0 candidates, we reduce N to N = (V; A), shown 6: R = R [ a in Figure 1b, by removing isolated vertices. We 7: end for define the weight of an arc as the distance score 8: Output: R between the two vertices. Recall the assumption that a document can be a revision of at most one document. In other words, documents cannot be 3 Distance/similarity Measures merged. Due to this assumption, all revision pairs form a branching in N. (A branching is a subgraph In this section, we first introduce the classic VSM where each vertex has an in-degree of at most 1.) model, the word2vec model, DTW and TED.

Load more