Detecting Near Duplicates in Software Documentation

D.V. Luciv1, D.V. Koznov1, G.A. Chernishev1, A.N. Terekhov1 1 Saint Petersburg State University 199034 St. Petersburg, Universitetskaya Emb., 7–9 E-mail: {d.lutsiv, d.koznov, g.chernyshev, a.terekhov}@spbu.ru 2017-11-15

Abstract Contemporary software documentation is as complicated as the software itself. During its lifecycle, the documentation accumulates a lot of “near duplicate” fragments, i.e. chunks of text that were copied from a single source and were later modified in different ways. Such near duplicates decrease documentation quality and thus hamper its further utilization. At the same time, they are hard to detect manually due to their fuzzy nature. In this paper we give a formal definition of near duplicates and present an algorithm for their detec- tion in software documents. This algorithm is based on the exact software clone detection approach: the software clone detection tool Clone Miner was adapted to detect exact dupli- cates in documents. Then, our algorithm uses these exact duplicates to construct near ones. We evaluate the proposed algorithm using the documentation of 19 open source and com- mercial projects. Our evaluation is very comprehensive — it covers various documentation types: design and requirement specifications, programming guides and API documenta- tion, user manuals. Overall, the evaluation shows that all kinds of software documentation contain a significant number of both exact and near duplicates. Next, we report onthe performed manual analysis of the detected near duplicates for the Linux Kernel Documen- tation. We present both quantative and qualitative results of this analysis, demonstrate algorithm strengths and weaknesses, and discuss the benefits of duplicate management in software documents. Keywords: software documentation, near duplicates, documentation reuse, software clone detection.

initially similar fragments become “near dupli- cates”. Depending on the document type [1], arXiv:1711.04705v1 [cs.SE] 13 Nov 2017 duplicates can be either desired or not, but 1 Introduction in any case duplicates increase documentation complexity and thus, maintenance and author- Every year software is becoming increasingly ing costs [2]. more complex and extensive, and so does soft- ware documentation. During the software life Textual duplicates in software documenta- cycle documentation tends to accumulate a lot tion, both exact and near ones, are extensively of duplicates due to the copy and paste pat- studied [2, 3, 4, 5]. However, there are no meth- tern. At first, some text fragment is copied ods for detection of near duplicates, only for several times, then each copy is modified, pos- exact ones and mainly using software clone de- sibly in its own way. Thus, different copies of tection techniques [2, 3, 6]. At the same time This work is partially supported by RFBR grant Juergens et al. [2] indicates the importance of No 16-01-00304. near duplicates and recommends to “pay par- 2 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov

ticular attention to subtle differences in dupli- derline the contribution of the current paper. cated text”. However, there are studies which In Section 3 the formal definition of the near addressed near duplicates, for example in [7] duplicate is given and the near duplicate de- the authors developed a duplicate specification tection algorithm is presented. Also, the algo- tool for JavaDoc documentation. This tool al- rithm correctness theorem is formulated. Fi- lows user to specify near duplicates and ma- nally, Section 4 presents evaluation results. nipulate them. Their approach was based on an informal definition of near duplicates and the problem of duplicate detection was not ad- 2 Related work dressed. In our previous studies [8, 9] we pre- sented a near duplicate detection approach. Its Let us consider how near duplicates are em- core idea is to uncover near duplicates and then ployed in documentation-oriented software en- to apply the reuse techniques described in our gineering research. Horie et al. [14] consider earlier studies [4, 5]. Clone detection tool Clone the problem of text fragment duplicates in Java Miner [10] was adapted for detection of exact API documentation. The authors introduce a duplicates in documents, then near duplicates notion of crosscutting concern, which is essen- were extracted as combinations of exact du- tially a textual duplicate appearing in docu- plicates. However, only near duplicates with mentation. The authors present a tool named one variation point were considered. In other CommentWeaver, which provides several mech- words, the approach can detect only near dupli- anisms for modularization of the API documen- cates that consist of two exact duplicates with tation. It is implemented as an extention of a single chunk of variable text between them: Javadoc tool, and provides new tags for con- trolling reusable text fragments. However, near 푒푥푎푐푡1 푣푎푟푖푎푏푙푒1 푒푥푎푐푡2. duplicates are not considered, facilities for du- In this paper we give the formal def- plicate detection are not provided. inition of near duplicates with an ar- Nos´al and Porub¨an [7] extend the ap- bitrary number of variation points, proach from [14] by introducing near dupli- exhibiting the following pattern: cates. In this study the notion of documen- 푒푥푎푐푡 푣푎푟푖푎푏푙푒 푒푥푎푐푡 푣푎푟푖푎푏푙푒 푒푥푎푐푡 ... 1 1 2 2 3 tation phrase is used to denote the near dupli- 푣푎푟푖푎푏푙푒 푒푥푎푐푡 푛−1 푛. Our definition is the cate. Parametrization is used to define varia- formalized version of the definition given tive parts of duplicates, similarly to our ap- in the reference [11]. We also present a proach [4, 5]. However, the authors left the generalization of the algorithm described in problem of near duplicate detection untouched. [8, 9]. The algorithm is implemented in the In [3] Nos´aland Porub¨anpresent the results Documentation Refactoring Toolkit [12], which of a case study in which they searched for exact is a part of the DocLine project [4]. In this duplicates in internal documentation (source paper, an evaluation of the proposed algorithm code comments) of an open source project set. is also presented. The documentation of 19 They used a modified copy/paste detection open source and commercial projects is used. tool, which was originally developed for code The results of the detailed manual analysis analysis and found considerable number of text of the detected near duplicates for the Linux duplicates. However, near duplicates were not Kernel Documentation [13] are reported. considered in this paper. The paper is structured as follows. Sec- Wingkvist et al. adapted a clone detection tion 1 provides a survey of duplicate manage- tool to measure the document uniqueness in ment for software documentation and gives a a collection [6]. The authors used found du- brief overview of near duplicate detection meth- plicates for documentation quality estimation. ods in information retrieval, theoretical com- However, they did not address near duplicate puter science, and software clone detection ar- detection. eas. Section 2 describes the context of this The work of Juergens et al. [2] is the closest study by examining our previous work to un- one to our research and presents a case study Detecting Near Duplicates in Software Documentation 3 for analyzing redundancy in requirement spec- performance on the large collections of docu- ifications. The authors analyze 28 industrial ments. More importantly, these studies dif- documents. At the first step, they found dupli- fer from ours by ignoring duplicate meaning- cates using a clone detection tool. Then, the fulness. Nonetheless, adapting these methods authors filtered the found duplicates by man- for the near-duplicate search in software docu- ually removing false positives and performed a mentation could improve the quality metrics of classification of the results. They report that the search. It is a promising avenue for further the average duplicate coverage of documents studies. they analyzed is 13.6%: some documents have The theoretical computer science community a low coverage (0.9%, 0.7% and even 0%), but also addressed the duplicate detection problem. there are ones that have a high coverage (35%, However, the primary focus of this community 51.1%, 71.6%). Next, the authors discuss how was the development of (an approximate) string to use discovered duplicates and how to de- matching algorithms. For example, in order to tect related duplicates in the source code. The match two text fragments of an equal length, impact of duplicates on the document reading the was used [24]. The Lev- process is also studied. Furthermore, the au- enshtein distance [25] is employed to account thors propose a classification of meaningful du- not only for symbol modification, but also for plicates and false positive duplicates. However, symbol insertion and removal. For this metric it should be noted that they consider only re- it is possible to handle all the three editing op- quirement specifications and ignore other kinds erations at the cost of performance [26]. Signif- of software documentation. Also, they do not icant performance optimizations to string pat- use near duplicates. tern matching using the Rago et al. [15] apply natural language pro- were applied in the fuzzy Bitap algorithm [27] cessing and machine learning techniques to the which was later optimized to handle longer pat- problem of searching duplicate functionality in terns efficiently [28]. Similarity preserving sig- requirement specifications. These documents natures like MinHash [29] can be used to speed are considered as a set of textual use cases; the up for approximate matching approach extracts sequence chain (usage sce- problem. Another indexing approach is pro- narios) for every use case and compares the posed in [30], with an algorithm for the fast pairs of chains to find duplicate subchains. The detection of documents that share a common authors evaluate their approach using several sliding window with the query pattern but dif- industrial requirement specifications. It should fer by at most 휏 tokens. While being useful for be noted that this study considers a very spe- our goals in general, these studies do not re- cial type of requirement specifications, which late to this paper directly. All of these studies are not widely used in industry. Near dupli- are focused on the performance improvement of cates are also not considered. simple string matching tasks. Achieving high Algorithms for duplicate detection in textual performance is undoubtedly an important as- content have been developed in several other ar- pect, but these algorithms become less neces- eas. Firstly, the information retrieval commu- sary when employed for documents of 3–4 MBs nity considered several near-duplicate detection of plain text — a common size of a large indus- problems: document similarity search [16, 17], trial documentation file. (local) text reuse detection [18, 19], template Various techniques have been employed to detection in web document collections [20, 21]. detect near duplicate clones in a source code. Secondly, the plagiarism detection community SourcererCC [31] detects near duplicates of also extensively studied the detection of simi- code blocks using a static bag-of-tokens strat- larity between documents [22, 23]. Thus, there egy that is resilient to minor differences be- are a number of efficient solutions. However, tween code blocks. Clone candidates of a code the majority of them are focused on document- block are queried from a partial inverted index level similarity, with the goal of attaining high for better scalability. DECKARD [32] com- 4 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov putes certain characteristic vectors of code to 3.2 Basic near duplicate detec- approximate the structure of Abstract Syntax tion and refactoring Trees in Euclidean space. Locality sensitive hashing (LSH) [33] is used to group similar vec- We have already demonstrated in Section 1 tors with the Euclidean distance metric, form- that near duplicate detection in software doc- ing clones. NICAD [34] is a text-based near du- umentation is an important research avenue. plicate detection tool that also uses tree-based In our previous studies [8, 9], we presented structural analysis with a lightweight of an approach that offers a partial solution for the source code to implement flexible pretty- this problem. At first, similarly to Juergens printing, code normalization, source transfor- et al. [2], Wingkvist et al. [6], we applied soft- mation and code filtering for better results. ware clone detection techniques to exact du- However, these techniques are not directly ca- plicate detection [37]. Then, in [8, 9] we pro- pable of detecting duplicates in text documents posed an approach to near duplicate detection. as they involve some degree of parsing of the It is essentially as follows: having exact dupli- underlying source code for duplicate detection. cates found by Clone Miner, we extract sets Suitable customization for this purpose can be of duplicate groups where clones are located explored in the future. close to each other. For example, suppose that the following phrase can be found in the text 5 times with different variations (various port numbers): “inet daemon can listen on ... port 3 Background and then transfer the connection to appropri- ate handler”. In this case we have two duplicate groups with 5 clones in each group: one group 3.1 Exact duplicate detection and includes the text “inet daemon can listen on”, Clone Miner while the other includes “port and then trans- fer the connection to appropriate handler”. We combine these duplicate groups into a group Not only documentation, but also software it- of near duplicates: every member of this group self is often developed with a lot of copy/pasted has one variation to capture different port num- information. To cope with duplicates in the bers. In these studies we developed this ap- source code, software clone detection methods proach only for one variation, i.e. we “glued” are used. This area is quite mature; a sys- only pairs of exact duplicate groups. tematic review of clone detection methods and tools can be found in [35]. In this paper, the Then it is possible to apply the adaptive Clone Miner [10] software clone detection tool reuse technique [38, 39] to near duplicates is used to detect exact duplicates in software and to perform automatic refactoring of doc- documentation. Clone Miner is a token-based umentation. Our toolkit [12] allows to create source code clone detector. A token in the con- reusable text fragment definition templates for text of text documents is a single word sepa- the selected near duplicate group. Then it is rated from other words by some separator: ‘.’, possible to substitute all the occurrences of the ‘(’, ‘)’, etc. For example, the following text group’s members with parameterized references fragment consists of 2 tokens: “FM registers”. to definition. The overall scheme of the process Clone Miner considers input text as an ordered is shown in Fig. 1. collection of lexical tokens and applies suffix The experiments with the described ap- array-based string matching algorithms [36] to proach showed a considerable number of inter- retrieve the repeated parts (clones). In this esting near duplicates. Thus, we decided to study we use the Clone Miner tool. We have generalize the algorithm from [8, 9] in order selected it for its simplicity and its ability to to allow an arbitrary number of exact near- be easily integrated with other tools using a duplicates combinations. This will allow to command line interface. have several variation parts in the resulting Detecting Near Duplicates in Software Documentation 5

Document clone its length as |[푔]| = 푒−푏+1. For simplicity, detection |푔| |[푔]| Input Preparation for Clone detection we will use notion instead of . document clone detection (Clone Miner) • For any 푔1, 푔2 ∈ 퐷 we consider their in- 1 2 Modified 푔 ∩ 푔 Refactoring Filtering tersection as intersection of corre- document sponding intervals [푔1] ∩ [푔2], and 푔1 ⊂ 푔2 implies [푔1] ⊂ [푔2]. Figure 1: The process of near duplicate search • We define the binary predicate 퐵푒푓표푟푒 on 퐷* × 퐷*, which is true for text frag- ments 푔1, 푔2 ∈ 퐷, iff 푒1 < 푏2, when [푔1] = near duplicates. The generalized algorithm is [푏1, 푒1], [푔2] = [푏2, 푒2]. described below. Definition 2. Let us consider a set 퐺 of text fragments of 퐷 such that ∀푔1, 푔2 ∈ 퐺 (푠푡푟(푔1) = 4 Near duplicate detection 푠푡푟(푔2)) ∧ (푔1 ∩ 푔2 = ∅). We name those frag- algorithm ments as exact duplicates and 퐺 as exact duplicate group or exact group. We also 4.1 Definitions denote number of elements in 퐺 as #퐺. Definition 3. For ordered set of exact dupli- Let us define the terms necessary for describ- cate groups 퐺 , . . . , 퐺 , we say that it forms ing the proposed algorithm. We consider docu- 1 푁 variational group ⟨퐺 , . . . , 퐺 ⟩ when the fol- ment 퐷 as a sequence of symbols. Any sym- 1 푁 lowing conditions are satisfied: bol of 퐷 has a coordinate corresponding to its offset from the beginning of the document, 1. #퐺1 = ... = #퐺푁 . and this coordinate is a number belonging to [1, 푙푒푛푔푡ℎ(퐷)] interval, where 푙푒푛푔푡ℎ(퐷) is the 2. Text fragments having similar positions in number of symbols in 퐷. different groups, occur in the same order in 푘 푘 document text: ∀푔푖 ∈ 퐺푖 ∀푔푗 ∈ 퐺푗 ((푖 < 푘 푘 Definition 1. For 퐷 we define a text frag- 푗) ⇔ 퐵푒푓표푟푒(푔푖 , 푔푗 )), and 푘 푘+1 ment as an occurrence of some text substring ∀ 푘 ∈ {1, . . . , 푁 − 1}퐵푒푓표푟푒(푔푁 , 푔1 ). in 퐷. Hence, each text fragment has a corre- sponding integer interval [푏, 푒], where 푏 is the We also say that for any 퐺푘 of this set 퐺푘 ∈ coordinate of its first symbol and 푒 is the coor- 푉 퐺. dinate of its last symbol. For text fragment 푔 of Note 1. According to condition 2 of defini- document 퐷, we say that 푔 ∈ 퐷. 푘 푘 푘 푘 tion 3, ∀푔푖 ∈ 퐺푖, ∀푔푗 ∈ 퐺푗(푖 ̸= 푗 ⇒ 푔푖 ∩ 푔푗 = ∅). Let us introduce the following sets: 퐷* is a 퐷 set of all text fragments of 퐷, 퐼 is a set of all Note 2. When 푉 퐺 = ⟨퐺1, . . . , 퐺푁 ⟩ and ′ ′ ′ integer intervals within interval [1, 푙푒푛푔푡ℎ(퐷)], 푉 퐺 = ⟨퐺1, . . . , 퐺푀 ⟩, are variational groups, 퐷 ′ ′ ′ 푆 is a set of all strings of 퐷. ⟨푉 퐺, 푉 퐺 ⟩ = ⟨퐺1, . . . , 퐺푁 , 퐺1, . . . , 퐺푀 ⟩ is also Also, let us introduce the following notations: a variational group in case when is satisfies def- inition 3. • [푔]: 퐷* → 퐼퐷 is a function that takes text 푉 퐺 = fragment 푔 and returns its interval. For example, suppose that we have * 퐷 ⟨퐺1, 퐺2, 퐺3⟩ and each of 퐺푖 consists of three • 푠푡푟(푔): 퐷 → 푆 is a function that takes 푘 clones 푔 , 푖 ∈ {1, 2, 3}. Then, these text fragment 푔 and returns its text. 푖 clones appear in the text in the following 퐼 : 퐼퐷 → 퐷* 1 1 1 2 2 2 • is a function that takes inter- order: 푔1 . . . 푔2 . . . 푔3 ...... 푔1 . . . 푔2 . . . 푔3 ...... 퐼 3 3 3 val and returns corresponding text frag- 푔1 . . . 푔2 . . . 푔3. Next, it should be possible to ment. compute the distance between variations or ex- • |[푏, 푒]| : 퐼퐷 → [0, 푙푒푛푔푡ℎ(퐷)] is a function act duplicate groups. This is required to sup- that takes interval [푔] = [푏, 푒] and returns port group merging inside our algorithm which 6 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov selects several closest groups to form a new one. Note 3. An exact group 퐺 can be considered Thus, a distance function should be defined. as a variational one formed by itself: ⟨퐺⟩. Definition 4. Distance between text frag- Definition 10. Consider near duplicate group 1 2 ments for any 푔 , 푔 ∈ 퐷 is defined as follows: ⟨퐺1, 퐺2⟩, where 퐺1 and 퐺2 are exact groups. We assume that this group contains a single ⎧ extension point, and the text fragments con- 0, 푔1 ∩ 푔2 ̸= ∅, 푘 푘 ⎨⎪ tained in positions [푒1 + 1, 푏2 − 1] are called ex- 푑푖푠푡(푔1, 푔2) = 푏2 − 푒1 + 1, 퐵푒푓표푟푒(푔1, 푔2), tension point values. In the general case, a ⎪ ⎩푏1 − 푒2 + 1, 퐵푒푓표푟푒(푔2, 푔1), near duplicate group ⟨퐺1, . . . , 퐺푁 ⟩ has 푁 − 1 (1) extension points. where [푔1] = [푏1, 푒1] and [푔2] = [푏2, 푒2]. Definition 11. Consider two near dupli- ′ Definition 5. Distance between exact cate groups 퐺 = ⟨퐺1, . . . , 퐺푛⟩ and 퐺 = ⟨퐺′ , . . . 퐺′ ⟩. Suppose that they form a groups 퐺1 and 퐺2, having #퐺1 = #퐺2, is 1 푚 ′ ′ defined as follows: variational group ⟨퐺1, . . . , 퐺푛, 퐺1, . . . , 퐺푚⟩ or ′ ′ ⟨퐺1, . . . , 퐺푛, 퐺1, . . . , 퐺푚⟩, which in turn is also 푘 푘 푑푖푠푡(퐺1, 퐺2) = max 푑푖푠푡(푔1 , 푔2 ) (2) a near duplicate group. In this case, we call 퐺 푘∈{1,...,#퐺1} and 퐺′ nearby groups.

Definition 6. Distance between varia- Definition 12. Nearby duplicates are dupli- tional groups 푉 퐺1 and 푉 퐺2, when there are cates belonging to nearby groups. 퐺1 ∈ 푉 퐺1, 퐺2 ∈ 푉 퐺2 :#퐺1 = #퐺2, is defined as follows: Note 4. Due to remark 3, definition 12 is ap- plicable to both near and exact duplicates. 푑푖푠푡(푉 퐺1, 푉 퐺2) = max 푑푖푠푡(퐺1,퐺2) 퐺1∈푉 퐺1,퐺2∈푉 퐺2 (3) 4.2 Algorithm description Definition 7. Length of exact group 퐺 is The algorithm that constructs the set of near #퐺 푆푒푡푉 퐺 defined as follows: 푙푒푛푔푡ℎ(퐺) = ∑︀(푒푘 −푏푘 +1), duplicate groups ( ) is presented below. 푘=1 Its input is the set of exact duplicate groups where 푔푘 ∈ 퐺, [푔푘] = [푏푘, 푒푘]. (푆푒푡퐺) belonging to document 퐷. It employs an interval tree — a whose pur- Definition 8. Length of variational group pose is to quickly locate intervals that inter- 푉 퐺 = ⟨퐺1, . . . , 퐺푁 ⟩ is defined as follows: sect with a given interval. Initially, the 푆푒푡퐺

푁 set is created using the Clone Miner tool. The ∑︁ core idea of our algorithm is to repeatedly find 푙푒푛푔푡ℎ(푉 퐺) = 푙푒푛푔푡ℎ(퐺푖) (4) 푖=1 and merge nearby exact groups from 푆푒푡퐺. At each step, the resulting near duplicate groups Definition 9. Near duplicate group is such are added to 푆푒푡푉 퐺. Let us consider this algo- a variational group ⟨퐺1, . . . , 퐺푁 ⟩ that satisfies rithm in detail. following condition for ∀푘 ∈ {1,...,#퐺1}: The initial interval tree for 푆푒푡퐺 is con- structed using the 퐼푛푖푡푖푎푡푒() function (line 푁−1 푁 ∑︁ 푘 푘 ∑︁ 푘 2). The core part of the algorithm is a loop 푑푖푠푡(푔 ,푔 ) ≤ 0.15 * |푔 |. (5) 푖 푖+1 푖 in which new near duplicate groups are con- 푖=1 푖=1 structed (lines 3–18). This loop repeats until This definition is constructed according to we can construct at least one near duplicate the near duplicate concept from [38]: varia- group, i.e. the set of newly constructed near tional part of near duplicates with similar infor- duplicate groups (푆푒푡푁푒푤) is not empty (line mation (delta) should not exceed 15% of their 18). Inside of this loop, the algorithm cycles exact duplicate (archetype) part. through all groups of 푆푒푡퐺 ∪ 푆푒푡푉 퐺. For each Detecting Near Duplicates in Software Documentation 7

Algorithm 1: Near Duplicate Groups Con- Let us describe the functions employed in struction this algorithm. Input data: 푆푒푡퐺 The 퐼푛푖푡푖푎푡푒() function builds the interval Result: 푆푒푡푉 퐺 tree. The idea of this data structure is the fol- 1 푆푒푡푉 퐺 ← ∅ lowing. 2 퐼푛푖푡푖푎푡푒() Suppose we have 푛 natural number intervals, 3 repeat where 푏1 is the minimum and 푒푛 is the maxi- 4 푆푒푡푁푒푤 ← ∅ mum value of all interval endpoints, and 푚 is 5 foreach 퐺 ∈ 푆푒푡퐺 ∪ 푆푒푡푉 퐺 do the midpoint of [푏1, 푒푛]. The intervals are di- 6 푆푒푡퐶푎푛푑 ← 푁푒푎푟퐵푦(퐺) vided into three groups: fully located to the 7 if 푆푒푡퐶푎푛푑 ̸= ∅ then left of 푚, fully located to the right of 푚, and ′ 8 퐺 ← 퐺푒푡퐶푙표푠푒푠푡(퐺, 푆푒푡퐶푎푛푑) intervals containing 푚. The current node of the ′ 9 푅푒푚표푣푒(퐺,퐺 ) interval tree stores the last interval group and ′ 10 if 퐵푒푓표푟푒(퐺,퐺 ) then references to its left and right child nodes con- ′ 11 푆푒푡푁푒푤 ← 푆푒푡푁푒푤 ∪ {⟨퐺, 퐺 ⟩} taining the intervals to the left and to the right 12 else of 푚 respectively. This procedure is repeated ′ 13 푆푒푡푁푒푤 ← 푆푒푡푁푒푤 ∪ {⟨퐺 , 퐺⟩} for each child node. Further details regarding 14 end if the construction of an interval tree can be found 15 end if in [40, 41]. 16 end foreach In this study, we build our interval tree from 17 퐽표푖푛(푆푒푡푉 퐺, 푆푒푡푁푒푤) the extended intervals that correspond to the 18 until 푆푒푡푁푒푤 = ∅ exact duplicates found by CloneMiner. These 19 푆푒푡푉 퐺 ← 푆푒푡푉 퐺 ∪ 푆푒푡퐺 extended intervals are obtained as follows: orig- inal intervals belonging to exact duplicates are enlarged by 15%. For example, if [푏, 푒] is of them, the 푁푒푎푟퐵푦 function returns the set the initial interval, then an extended one is of nearby groups 푆푒푡퐶푎푛푑 (lines 5, 6), which [푏 − 0.15 * (푒 − 푏 + 1), 푒 + 0.15 * (푒 − 푏 + 1)]. is then used for constructing near duplicate We will denote the extended interval that cor- groups. Later, we will discuss this function in responds to the exact duplicate 푔 as ↕ 푔. We more detail and prove its correctness, i.e. that also modify our interval tree as follows: each it actually returns groups that are close to 퐺. stored interval keeps the reference to the corre- Next, the closest group to 퐺, denoted 퐺′, is se- sponding exact duplicate group. lected from 푆푒푡퐶푎푛푑 (line 8) and a variational The 푅푒푚표푣푒 function removes groups from group ⟨퐺, 퐺′⟩ or ⟨퐺′, 퐺⟩ is created. This group sets and their intervals from the interval tree. is added into 푆푒푡푁푒푤 (lines 10–14). Since 퐺 The interval deletion algorithm is described in and 퐺′ are merged and therefore cease to exist references [40, 41]. as independent entities, they are deleted from The 퐽표푖푛 function, in addition to the op- 푆푒푡퐺 and 푆푒푡푉 퐺 by the 푅푒푚표푣푒 function (line erations described above, adds intervals of 9). Next, the 퐽표푖푛 function adds 푆푒푡푁푒푤 to the newly created near duplicate group 퐺 = 푆푒푡푉 퐺 (line 17). It is essential to note that ⟨퐺1, . . . , 퐺푁 ⟩ to the interval tree. The stan- the 푅푒푚표푣푒 and 퐽표푖푛 functions perform some dard insertion algorithm described in refer- auxiliary actions described below. ences [40, 41] is used. Extended intervals In the end of the algorithm 푆푒푡퐺 is added added to the tree of each near duplicate 푔푘 = 푘 푘 to 푆푒푡푉 퐺. The result — 푆푒푡푉 퐺 — is pre- (푔1 ,..., 푔푁 ), where 푘 ∈ {1,...,#퐺1}, have the 푘 푘 푘 푘 푘 sented as the algorithm’s output. This step is form of [푏1 − 푥 , 푒푁 + 푥 ], where 푥 = 0.15 * 푁 푁−1 required in order for the output to contain not ∑︀ 푘 ∑︀ 푘 푘 |푔푖 | − 푑푖푠푡(푔푖 , 푔푖+1). We will denote this only near duplicate groups, but also exact du- 푖=1 푖=1 plicate groups which have not been used for extended interval of 푔푘 (now, a near duplicate) creation of near duplicate ones (line 19). as ↕ 푔푘 as well. 8 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov

The 푁푒푎푟퐵푦 function selects nearby groups implementation of the interval tree data struc- for some group 퐺 (its parameter). To do this, ture. for each text fragment from 퐺 a collection of We have evaluated 19 industrial documents intervals that intersect with its interval is ex- belonging to various types: requirement speci- tracted.Text fragments that correspond to these fication, programming guides, API documenta- intervals turn out to be neighboring to the ini- tion, user manuals, etc. (see Table 1). The size tial fragment, i.e. for them, condition (5) is of the evaluated documents is up to 3 Mb. satisfied. The retrieval is done using the inter- Our evaluation produced the following re- val tree search algorithm [40, 41]. We construct sults. The majority of the duplicate groups the 퐺퐿1 set, which contains groups that are ex- detected are exact duplicates (88.3–96.5%). pected to be nearby to 퐺: Groups having one variation point amount to 3.3–12.5%, two variation points – to 0–1.7%, ′ ′ 퐺퐿1(퐺) = {퐺 |(퐺 ∈ 푆푒푡퐺 ∪ 푆푒푡푉 퐺)∧ and three variation points – to less 1%, etc. (6) ∃푔 ∈ 퐺, 푔′ ∈ 퐺′ : ↕ 푔∩ ↕ 푔′ ̸= ∅. A few near duplicates with 11, 12 13, and 16 variation points also were detected. We per- That is, the 퐺퐿1 set consists of groups that formed a manual analysis of the automatically contain at least one duplicate that is close to detected near duplicates for Linux Kernel Doc- at least one duplicate from 퐺. Then, only the umentation (programming guide, document 1 groups that can form a variational group with in the Table 1) [13]. We found 70 meaningful 퐺 are selected and placed into the 퐺퐿2 set: text groups (5.4%), 30 meaningful groups for ′ ′ ′ 퐺퐿2(퐺) = {퐺 |퐺 ∈ 퐺퐿1 ∧ (⟨퐺, 퐺 ⟩or the code example (2.3%), and 1191 false posi- ′ (7) tive groups (92.3%). We found 21 near dupli- ⟨퐺 , 퐺⟩is variational group)}. cate groups, i.e. 21% of the meaningful dupli- Finally, the 퐺퐿3 set (the 푁푒푎푟퐵푦 function’s cate groups. Therefore, the share of near du- output) is created. The only groups placed in plicates significantly increases after discarding this set are those from 퐺퐿2 whose all elements false positives. are close to corresponding elements of 퐺: Having analyzed the evaluation results, we ′ ′ can make the following conclusions: 퐺퐿3(퐺) = {퐺 |퐺 ∈ 퐺퐿2 ∧ (8) ∀푘 ∈ {1,...,#퐺} : ↕ 푔푘∩ ↕ 푔′푘 ̸= ∅} 1. During our experiments we did not manage to find any near duplicates in considered Theorem 1. Suggested algorithm detects near documents that were not detected by our duplicate groups that conform to definition 9. algorithm. However, we should note that It is easy to show by construction of 푁푒푎푟퐵푦 the claim of the algorithm’s high recall that for some group 퐺 it returns the set of its needs a more detailed justification. nearby groups (see definition 11). That is, each 2. Analyzing the Linux Kernel of these groups can be used to form a near du- Documentation, we have concluded plicate group with 퐺. Then for the set the al- that it does not have any cohesive style: gorithm selects the group closest to 퐺 and con- it was created sporadically by different structs a new near duplicate group. The cor- authors. Virtually all its duplicates are rectness of all intermediate sets and other used situated locally, i.e. close to each other. functions is immediate from their construction For example, some author created a methods. description of some driver’s functionality using copy/paste for its similar features. 5 Evaluation At the same time, another driver was described by a different author who did The proposed algorithm was implemented in not use the first driver’s description at the Duplicate Finder Toolkit [12]. Our proto- all. Consequently, there are practically no type uses the intervaltree library [42] as an duplicates that are found throughout the Detecting Near Duplicates in Software Documentation 9

Table 1: Near-duplicate groups detected

Document Kb Size, near Total dup groups Exact duplicates groups, % 1-variation groups, % 2-variation groups, % 3-variation groups, % 1 892 1291 93.9 5.1 0.9 0.2 2 2924 6056 93.3 5.3 0.9 0.2 3 1810 4220 95.9 3.4 0.5 0.2 4 686 1500 96.5 3.3 0.2 0.1 5 1311 4688 95.9 3.3 0.5 0.0 6 3136 6587 93.9 5.1 0.7 0.2 7 1491 4537 92.0 6.0 1.2 0.4 8 3160 7804 95.6 3.5 0.5 0.2 9 1104 152 93.4 5.3 0.7 0.7 10 1800 4685 92.7 5.8 0.9 0.3 11 1056 2436 91.5 6.7 1.1 0.3 12 36 59 83.1 11.9 1.7 0.0 13 166 392 89.5 9.7 0.5 0.3 14 103 208 91.3 7.7 0.5 0.0 15 98 117 94.0 4.3 0.9 0.9 16 241 394 88.8 9.1 0.8 0.3 17 43 16 81.3 12.5 6.3 0.0 18 50 77 88.3 11.7 0.0 0.0 19 167 145 88.3 9.7 0.7 1.4

whole text. Examples, warnings, notes, the search. and other documentation elements that are preceded by different introductory sentences are not styled cohesively as 5. The 0.15 value used in detecting near well. Thus, our algorithm can be used for duplicate groups does not allow to find analyzing the degree of documentation some significant groups (mainly small uniformity. ones, 10–20 tokens in size). It is possible that it would be more effective to use some 3. The algorithm performs well on two- function instead of a constant, which could element groups, finding near duplicate depend, for example, on the length of the groups with a different number of near duplicate. extension points. It appears that, in general, there are way fewer near duplicate groups with more than two elements. 6. Moreover, often the detected duplicate 4. Many detected duplicate groups consist of does not contain variational information figure and table captions, page headers, that is situated either in its end or parts of the table of contents and so in its beginning. Sometimes it could on — that is, they are not of any be beneficial to include it in order to interest to us. Also, many found duplicates ensure semantic completeness. To solve are scattered across different elements this problem, a clarification and a formal of document structure, for example, a definition of semantic completeness ofa duplicate can be a part of a header text fragment is required. Our experiments and a small fragment of text right after show that this can be done in various ways it. These kinds of duplicates are not (the simplest one is ensuring sentence-level desired since they are not very useful granularity, i.e. including all text until the for document writers. However, they start/end of sentence). are detected because currently document structure is not taken into account during 10 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov

6 Conclusion [4] Koznov D.V. and Romanovsky K.Y. Do- cLine: A method for software product In this paper the formal definition of near du- lines documentation development. Pro- plicates in software documentation is given and gramming and Computer Software, vol. 34, the algorithm for near duplicate detection is no. 4, 2008, pp. 216–224. DOI: 10.1134/ presented. An evaluation of the algorithm us- S0361768808040051. ing a large number of both commercial and open source documents is performed. [5] Romanovsky K., Koznov D., and Minchin The evaluation shows that various types L. Refactoring the Documentation of Lecture Notes of software documentation contain a signifi- Software Product Lines. in Computer Science CEE-SET 2008 cant number of exact and near duplicates. A , , near duplicate detection tool could improve the vol. 4980. Springer-Verlag, Berlin, Heidel- quality of documentation, while a duplicate berg, 2011, pp. 158–170. DOI: 10.1007/ management technique would simplify docu- 978-3-642-22386-0_12. mentation maintenance. [6] Wingkvist A., Lowe W., Ericsson M., and Although the proposed algorithm provides Lincke R. Analysis and visualization of in- ample evidence on text duplicates in industrial formation quality of technical documenta- documents, it still requires improvements be- tion. Proceedings of the 4th European Con- fore it can be applied to real-life tasks. The ference on Information Management and main issues to be resolved are the quality of Evaluation. 2010, pp. 388–396. the near duplicates detected and a large num- ber of false positives. Also, a detailed analysis [7] Nos´al’M. and Porub¨anJ. Reusable soft- of near duplicate types in various sorts of soft- ware documentation with phrase annota- ware documents should be performed. tions. Central European Journal of Com- puter Science, vol. 4, no. 4, 2014, pp. 242– 258. DOI: 10.2478/s13537-014-0208-3. References [8] Koznov D., Luciv D., Basit H.A., Lieh O.E., and Smirnov M. Clone Detection in [1] Parnas D.L. Precise Documentation: Reuse of Software Technical Documenta- The Key to Better Software. Springer tion. International Andrei Ershov Memo- Berlin Heidelberg, Berlin, Heidelberg, rial Conference on Perspectives of Sys- 2011, pp. 125–148. DOI: 10.1007/ tem Informatics (2015), Lecture Notes in 978-3-642-15187-3_8. Computer Science, vol. 9609. Springer Na- [2] Juergens E., Deissenboeck F., Feilkas M., ture, 2016, pp. 170–185. DOI: 10.1007/ Hummel B., Schaetz B., Wagner S., Do- 978-3-319-41579-6_14. mann C., and Streit J. Can clone de- [9] Luciv D.V., Koznov D.V., Basit H.A., tection support quality assessments of re- and Terekhov A.N. On fuzzy repetitions quirements specifications? Proceedings detection in documentation reuse. Pro- of ACM/IEEE 32nd International Con- gramming and Computer Software, vol. 42, ference on Software Engineering, vol. 2. no. 4, 2016, pp. 216–224. DOI: 10.1134/ 2010, pp. 79–88. DOI: 10.1145/1810295. s0361768816040046. 1810308. [10] Basit H.A., Puglisi S.J., Smyth W.F., [3] Nos´al’M. and Porub¨anJ. Preliminary re- Turpin A., and Jarzabek S. Efficient To- port on empirical study of repeated frag- ken Based Clone Detection with Flexi- ments in internal documentation. Proceed- ble Tokenization. Proceedings of the 6th ings of Federated Conference on Computer Joint Meeting on European Software En- Science and Information Systems. 2016, gineering Conference and the ACM SIG- pp. 1573–1576. SOFT Symposium on the Foundations of Detecting Near Duplicates in Software Documentation 11

Software Engineering: Companion Papers. Based on Sequence Matching. Proceed- ACM, New York, NY, USA, 2007, pp. 513– ings of the 33rd International ACM SI- 516. DOI: 10.1145/1295014.1295029. GIR Conference on Research and Develop- ment in Information Retrieval. ACM, New [11] Bassett P.G. Framing Software Reuse: York, NY, USA, 2010, pp. 675–682. DOI: Lessons from the Real World. Prentice- 10.1145/1835449.1835562. Hall, Inc., Upper Saddle River, NJ, USA, 1997. [19] Abdel Hamid O., Behzadi B., Christoph S., and Henzinger M. Detecting the Ori- [12] Documentation Refactoring Toolkit. gin of Text Segments Efficiently. Pro- URL: http://www.math.spbu.ru/user/ ceedings of the 18th International Con- kromanovsky/docline/index_en.html. ference on World Wide Web. ACM, New York, NY, USA, 2009, pp. 61–70. DOI: [13] Torvalds L. Linux Kernel Documenta- 10.1145/1526709.1526719. tion, Dec 2013 snapshot. URL: https: //github.com/torvalds/linux/tree/ [20] Ramaswamy L., Iyengar A., Liu L., and master/Documentation/DocBook/. Douglis F. Automatic Detection of Frag- ments in Dynamically Generated Web [14] Horie M. and Chiba S. Tool Support for Pages. Proceedings of the 13th Interna- Crosscutting Concerns of API Documenta- tional Conference on World Wide Web. tion. Proceedings of the 9th International ACM, New York, NY, USA, 2004, pp. 443– Conference on Aspect-Oriented Software 454. DOI: 10.1145/988672.988732. Development. ACM, New York, NY, USA, 2010, pp. 97–108. DOI: 10.1145/1739230. [21] Gibson D., Punera K., and Tomkins A. 1739242. The Volume and Evolution of Web Page Templates. Special Interest Tracks and [15] Rago A., Marcos C., and Diaz-Pace J.A. Posters of the 14th International Confer- Identifying duplicate functionality in tex- ence on World Wide Web. ACM, New tual use cases by aligning semantic actions. York, NY, USA, 2005, pp. 830–839. DOI: Software & Systems Modeling, vol. 15, 10.1145/1062745.1062763. no. 2, 2016, pp. 579–603. DOI: 10.1007/ s10270-014-0431-3. [22] Vall´esE. and Rosso P. Detection of Near- duplicate User Generated Contents: The [16] Huang T.K., Rahman M.S., Madhyastha SMS Spam Collection. Proceedings of the H.V., Faloutsos M., and Ribeiro B. An 3rd International Workshop on Search and Analysis of Socware Cascades in On- Mining User-generated Contents. ACM, line Social Networks. Proceedings of the New York, NY, USA, 2011, pp. 27–34. 22Nd International Conference on World DOI: 10.1145/2065023.2065031. Wide Web. ACM, New York, NY, USA, 2013, pp. 619–630. DOI: 10.1145/2488388. [23] Barr´on-Cede˜noA., Vila M., Mart´ıM., and 2488443. Rosso P. Plagiarism Meets Paraphrasing: Insights for the Next Generation in Auto- [17] Williams K. and Giles C.L. Near Du- matic Plagiarism Detection. Comput. Lin- plicate Detection in an Academic Digital guist., vol. 39, no. 4, 2013, pp. 917–947. Library. Proceedings of the ACM Sym- DOI: 10.1162/COLI_a_00153. posium on Document Engineering. ACM, New York, NY, USA, 2013, pp. 91–94. [24] Smyth W. Computing Patterns in Strings. DOI: 10.1145/2494266.2494312. Addison-Wesley, 2003.

[18] Zhang Q., Zhang Y., Yu H., and Huang [25] Levenshtein V. Binary codes capable of X. Efficient Partial-duplicate Detection correcting spurious insertions and dele- 12 D.V. Luciv, D.V. Koznov, G.A. Chernishev, A.N. Terekhov

tions of ones. Problems of Information the Curse of Dimensionality. Proceed- Transmission, vol. 1, 1965, pp. 8–17. ings of the Thirtieth Annual ACM Sym- posium on Theory of Computing, STOC [26] Wagner R.A. and Fischer M.J. The String- ’98. ACM, New York, NY, USA, 1998. to-String Correction Problem. Journal of ISBN 0-89791-962-9, pp. 604–613. DOI: the ACM, vol. 21, no. 1, 1974, pp. 168–173. 10.1145/276698.276876. URL: http:// DOI: 10.1145/321796.321811. doi.acm.org/10.1145/276698.276876.

[27] Wu S. and Manber U. Fast Text Searching: [34] Cordy J.R. and Roy C.K. The NiCad Allowing Errors. Commun. ACM, vol. 35, Clone Detector. Proceedings of IEEE no. 10, 1992, pp. 83–91. DOI: 10.1145/ 19th International Conference on Program 135239.135244. Comprehension. 2011, pp. 219–220. DOI: 10.1109/ICPC.2011.26. [28] Myers G. A Fast Bit-vector Algorithm for Approximate String Matching Based on [35] Rattan D., Bhatia R., and Singh M. Soft- Dynamic Programming. J. ACM, vol. 46, ware clone detection: A systematic re- no. 3, 1999, pp. 395–415. DOI: 10.1145/ view. Information and Software Technol- 316542.316550. ogy, vol. 55, no. 7, 2013, pp. 1165–1199. DOI: 10.1016/j.infsof.2013.01.008. [29] Broder A. On the Resemblance and Con- tainment of Documents. Proceedings of [36] Abouelhoda M.I., Kurtz S., and Ohlebusch the Compression and Complexity of Se- E. Replacing Suffix Trees with Enhanced quences. IEEE Computer Society, Wash- Suffix Arrays. J. of Discrete Algorithms, ington, DC, USA, 1997, pp. 21–29. vol. 2, no. 1, 2004, pp. 53–86. DOI: 10. 1016/S1570-8667(03)00065-0. [30] Wang P., Xiao C., Qin J., Wang W., Zhang X., and Ishikawa Y. Local Simi- [37] Koznov D.V., Shutak A.V., Smirnov larity Search for Unstructured Text. Pro- M.N., and A. S.M. Clone detection in ceedings of International Conference on refactoring software documentation [Poisk Management of Data. ACM, New York, klonov pri refaktoringe tehnicheskoj doku- NY, USA, 2016, pp. 1991–2005. DOI: mentacii]. Computer Tools in Educa- 10.1145/2882903.2915211. tion [Komp’juternye instrumenty v obra- zovanii], vol. 4, 2012, pp. 30–40. (In Rus- [31] Sajnani H., Saini V., Svajlenko J., Roy sian). C.K., and Lopes C.V. SourcererCC: Scal- ing Code Clone Detection to Big-code. [38] Bassett P.G. The Theory and Practice of Proceedings of the 38th International Con- Adaptive Reuse. SIGSOFT Softw. Eng. ference on Software Engineering. ACM, Notes, vol. 22, no. 3, 1997, pp. 2–9. DOI: New York, NY, USA, 2016, pp. 1157–1168. 10.1145/258368.258371. DOI: 10.1145/2884781.2884877. [39] Jarzabek S., Bassett P., Zhang H., and [32] Jiang L., Misherghi G., Su Z., and Glondu Zhang W. XVCL: XML-based Variant S. DECKARD: Scalable and Accurate Configuration Language. Proceedings of Tree-Based Detection of Code Clones. the 25th International Conference on Soft- Proceedings of the 29th International Con- ware Engineering. IEEE Computer Soci- ference on Software Engineering. IEEE ety, Washington, DC, USA, 2003, pp. 810– Computer Society, Washington, DC, USA, 811. 2007, pp. 96–105. DOI: 10.1109/ICSE. 2007.30. [40] de Berg M., Cheong O., van Kreveld M., and Overmars M. chapter Interval Trees. [33] Indyk P. and Motwani R. Approximate Springer Berlin Heidelberg, 2008, pp. 220– Nearest Neighbors: Towards Removing 226. DOI: 10.1007/978-3-540-77974-2. Detecting Near Duplicates in Software Documentation 13

[41] Preparata F.P. and Shamos M.I. chap- ter Intersections of rectangles. Texts and Monographs in Computer Science. Springer, 1985, pp. 359–363.

[42] PyIntervalTree. URL: https://github. com/chaimleib/intervaltree.