Modified N-Gram Based Model for Identifying and Filtering Near-Duplicate Documents Detection

International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-5, Issue-10, Oct.-2017 http://iraj.in MODIFIED N-GRAM BASED MODEL FOR IDENTIFYING AND FILTERING NEAR-DUPLICATE DOCUMENTS DETECTION

1FARHEEN NAAZ, 2FARHEEN SIDDIQUI

1,2School of Engineering Sciences and Technology (SEST), Jamia Hamdard, Hamdard University, New Delhi. India E-mail: [email protected], [email protected]

Abstract - During last three decades World Wide Web (WWW) has expanded exponentially. A great deal of the web is full of duplicate or near-duplicate content. Documents that are served on the web are in different formats like PDF, HTML, excel and text. Our proposed solution is created on a publicly available dataset files. The dataset consists of files which are tagged as duplicate. Our work in this paper is based on the duplicate and near duplicate document detection using n-Gram based, a low-dimensional demonstration(LSI-SVD) approach, implemented in c#.net.

Keywords - Duplicate document, N-gram, SVD (Singular Value Decomposition), LSI(Latent Semantic Indexing), Cosine similarity etc.

I. INTRODUCTION suggested algorithm, all the categorizations of nearby words are not considered. If two documents cover the Business practices these days are dependent on same shingles set they are saved as equal documents digital data for interaction (HTML pages, software and if the shingles set overlaps have same value, they generated documents). This is because these digital are saved as similar. Fetterly et al. [9] has given five- documents are easy to store and efficient to query gram method which finally calculates two super upon. Starting with bag of such documents a common shingles in mutual to decide upon duplicity of task is to find documents that are nearly duplicate documents. Yun Ling [25] developed a algorithm for documents from a given input document either detecting similarity in web pages that is based on closely or approximatly. This close copy has wide textual data. By this algorithm web pages can be application domains. For example a close copy can be mined textually and duplicity is detected through a applicable in the area of copyright implementation, position. This minimizes the error in web data and as unoriginality identification, and adaptation control. a result effective duplicity can be measured.. In this paper, we proposed an idea of finding the Narayana et al. [19] also worked in the area of duplicate and near duplicate detection of overall duplicate web pages specifically web crawling. The documents using n-Gram based approach. The technique is based on finding out and loading of process is is composed of four steps. In the first step crawled web pages in to origins. The keywords of we have used the NLP (Natural Language crawled pages are separated and with the help of the Processing) methods to remove the stop words from separated keyword; the similarity between two pages the documents and applied porter stemmer to remove is calculated. The pages are designated as partially or the common endings in the words. This step helped totally same if its calculated score reaches a threshold us in reducing the word count present in all the worth. Midhun Mathew et al. [16] worked in the area documents and also helped us in saving the of detection of near duplicate web pages. They have processing time for N-gram approach. In the second designed a three stage algorithm in which the step we have calculated the document scorring similarity verification is based on Singular Value through on n-gram model of word to word transition Decomposition (SVD) [18].SVD here uses angle probabilities for all the documents. In the third step threshold.SVD in the algorithm discussed need of we have applied SVD for next searching token complex Mathematical operations on TDW matrix decomposition. In the final step we have produced together with the application of Jaccard threshold to the ranking of the documents on the basis of cosine into angle threshold. This in turn enhances the similarity. algorithm’s complexity. Also there are problems in calculating the angle. Therefore they designed a novel II. RELATED RESEARCH WORK technique MWO for similarity detection which directly uses Jaccard threshold. As a result In recent times, the discovery of partial or fully complexity of the algorithm decreases. Yerra and Yiu duplicate web documents has increased application in Kai Ng [23] designed innovative algorithm that web mining research. Many research papers have calculates similarity detection on web proposed strategies for copy close copy recognition at documents.Their similarity identification algorithm two levels: multiuser documents and the web based identifies similar web documents, lines and identifies documents. Broder et al. [4] have suggested a way for similarity between pair is web pages. Jalbert and the approximation of the similarity detection with Weimer [15] designed a new technique for one be pairs of documents is known as shingling. In the one classification of similar bug logs to boost uo

Modified N-Gram based Model for Identifying and Filtering Near-Duplicate Documents Detection

55 International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-5, Issue-10, Oct.-2017 http://iraj.in developers efficiency.is technique estimates duplicity Step 2: Generate bi-grams from each token and between documents through examining obvious calculate term frequency inverse document attributes, logical meaning and graph clustering. frequency matrix Gong et al. [12] worked on the novel algorithm We computed a scoring scheme through application of SimFinder is an efficient algorithm for classification an n-gram model for calculating word to word of presence of full duplicates in large-scale databases transition probabilities for names of documents. In an that are composed of short length text. The three ideal n-gram model, the n-th word depends on methods, namely, are included in this SimFinder the n −1 previous words and the transition probability algorithm have three components: from word 1 to word n is P(wn | wn−1, wn−2,…, w1). a) Ad hoc term weighting After the pre-processing step and generating bi-grams b) Discriminative-term selection phrase, we will be constructing TF-IDF matrix term c) Optimization frequency–inverse document frequency, which is a It is well accepted that the SimFinder is an efficient numerical metric.The main purpose TF-IDF is to and effective method for short text duplicate record significance of a word with respect to to detection. The algorithm works with almost linear a document in a document corpus. time and storage complexity by the experiments TF-IDF Matrix A is an n-by-m rectangular grid which performed. Hui Yang et al. [22] worked on project is made out of m vectors [A1, A2, … , Am], where DURIAN which investigates the usage of clustering the vector Aj speaks to n-gram phrases contained in techniques on text followed by retrieval procedures record j. that create classes representing approximate similarity. DURIAN categorizes alphabets or its Step 3: Apply SVD on the TF-IDF calculated in copies in public remark groups by maintaining an old step 2 bag-of-words document examples, document Apply SVD on the TF-IDF matrix that we have metadata, and structure of textual data. calculated which generates three matrices U, S and V.  Take a parameter k and set all but the k highest III .SYSTEM MODEL singular value of matrix S to 0. This will generate a new matrix S’.  Calculate the final matrix by multiplying all the three matrices U, S’ and V

Step 4: Calculate cosine similarity The main task in this step is to calculate the cosine similarity of each document. The degree of similarity between two inputs is an element of the point between vector representation, which is described in the term vector space.

Searching In this stage, the question terms are sought against the modified file.Each document that covers the incidences of the query terms is recovered. Dependent on the request, the recovery can be completed even for the incompletely coordinated documents. The set of all documents and its token is a collection of viewed as a set in a vector space. Each term will have its own diagonal value Using the formula given below the system finds out the similarity between any two documents. Step 1: NLP (for Pre-processing of dataset) ∑ . CosineSimilarity S(i,j)= cos(i,j)= *Pi The very first step is pre-processing of documents ∑ .∑ where we applied NLP algorithms which is basically used to remove all the stop words and then applied IV. PROPOSED ALGORITHM porter stemmer to change the word into the respective word stem.  Predefined set of stop words are checked in each 4.1 Algorithm: Near_Duplicate_Detection of the documents and those words are removed.  Porter stemmer is then applied on each remaining Input: Dataset files, query document words in the documents which reduced the words to their root. Output: Ranking on the basis of cosine similarity.

Modified N-Gram based Model for Identifying and Filtering Near-Duplicate Documents Detection

56 International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-5, Issue-10, Oct.-2017 http://iraj.in 4.1.1 Algorithm: Pre-processing U  Unitary matrix Input: Dataset files, query document S  Diagonal matrix Output: List of pre-processed words V  Unitary matrix S’  Set all but k (equal to 10) highest value of Pre-processing(input_files, query_document): matrix S as 0. stop_words Set of pre-defined stop_words final_matrix  Multiply matrix U, S’ and V word_list  list of input_files names and query return final_matrix document filenames for each word in word_list 4.1.5 Algorithm: Cosine_similarity if word not present in stop_words Input: Matrix after applying SVD stem_word  porter_stemmer(word) Output: Ranking on the basis of cosine similarity. processed_word.append(stem_word) end do Cosine_similarity (final_matrix): end do similarity  empty 2D matrix return processed_words for each document i and j _(,) similarity ← 4.1.2 Algorithm: Construct_bigrams ||||∗|||| end do

Input: List of pre-processed words k  index of query document Output: List of bigrams similarity_ranking_list empty_list for each document i in similarity Construct_bigrams(pre-processed_words): similarity_ranking_list.append(similarityik) input_listlist of pre processed_words end do bigram_list  [] (empty_list) rankingsort similarity_ranking_list to get the for index i in input_list ranking of the similarity of the query doc

bigram_list.append(input_listi, input_listi+1) end do return ranking return bigram_list V. RESULT AND DISCUSSIONS

4.1.3 Algorithm : Construct_TFIDF_matrix The goal of this section is to show actual document Input: List of bigrams position of the proposed protocols. For this work we Output: Term frequency inverse document frequency used N-GRAM technique for similarity, SVD for matrix value decomposition, NLP for removing start-stop word. Construct_TF-IDF_matrix (bigram_list) : PF  frequency of bigram phrase DF  count of documents for each bigram phrase for bigram phrase i in document j if phrase i present in document j || . tf − idf ← + ..(||) else tf − idf ← 0 end do return tf-idf Figure 1: Inverted index of every feature in the same example 4.1.4 Algorithm : Apply_SVD document, ordered by their position in the document

In fig.1 we mention the “using big pot keep world Input: Term frequency inverse document frequency website” for searching and then get their position matrix ordered in database. Firstly we enter the keyword as

using big pot keep world website, now we can see that Output: Matrix after applying SVD the document id and their position. In order to validate

our algorithm, we used as near-duplicate six different Apply_SVD (tf-idf_matrix): sets of multi-view sequences, namely: using, big, pot, Compute SVD of tf-idf_matrix to generate three keep, world and website. For each sequence we matrix U, S and V considered the first search level in document.

Modified N-Gram based Model for Identifying and Filtering Near-Duplicate Documents Detection

57 International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-5, Issue-10, Oct.-2017 http://iraj.in gram has a loophole that a using trivial updation of document give result in favour of mismatch of at least one n-gram.For instance when every nth word in a input is updated via some method, then the given input pair of documents gives no common n-grams of length n and metrics measure as a result are unable to identify the duplicacyamoung the pair.

VI1. Conclusion In this paper, we disscussed the problem of how to disposing close duplicate from a corpus. The effective results of finding duplicate and near duplicates has many application areas that has emerged from ever increasing volume of data. Also it is necessary to integrate hetrogeneous data from wide variety of Figure 2: File path with their searched word, document id and sources. The proposed approach identify duplicates in position large corpus of documents by employing calculated tokenize factor that in turn minimize searching time Figure 2 shows the searched query of path, document and give results . id and all word position using LSI technique.In this figure we examined the recognition abilities on a large REFERENCES data management system and it is observed that document duplicacy could be found effectively. [1] V.A.Narayana,P.Premchandand,A.Govardhan,(2009)” A Experiments are conducted for different types of Novel and Efficient Approach For Near Duplicate Page Detection in Web Crawling”. IEEE International Advance similarity measures, stop words lists, spellchecking Computing Conference and weighting scheme. Results of the proposed [2] Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and technique were found to be sutaible for duplicacy Raghavan, S., (2001) “Searching the web”, ACM Transactions on Internet Technology, vol. 1, no. 1: pp. 2-43 detection . [3] Brin, S., Davis, J., and Garcia-Molina, H., (1995) “Copy detection mechanisms for digital documents”, In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), ACM Press, pp.398–409, May. [4] Broder, A., Glassman, S., Manasse, M., and Zweig G. “Syntactic Clustering of the Web, In 6th International World Wide Web Conference”, pp: 393-404, 1997. [5] Bacchin, M., Ferro, N., Melucci, M., (2002) “Experiments to evaluate a statistical stemming algorithm”, Proceedings of the international Cross- Language Evaluation Forum Workshop, University of Padua at CLEF 2002. [6] Cristina Scheau, Traian Rebedea, Costin Chiru and Stefan Trausan- Matu, Improving the Relevance of Search Engine Results by Using Semantic Information from Wikipedia, IEEE International Conference 2010, pp:151-156, 2010. [7] Standards for HTML Authoring for the World Wide Web, copyrighted by Web Design Group (1996-2006). [8] M. S.Charikar, “Similarity estimation techniques from rounding algorithms”, In Proceedings of 34th Annual ACM Symposium on Theory of Computing, (Montreal, Quebec, Canada, (2002), pp. 380-388 [9] Fetterly D, Manasse M, Najork M, “On the evolution of clusters of near-duplicate Web pages”, In Proceedings of the Figure 3: File path with their similarity word First Latin American Web Congress, pp.37- 45 ,Nov 2003. [10] C. Xiao, W. Wang, X. Lin and J. X. Yu, “Efficient Similarity Finally we find the similarity of searched query from Join for Near Duplicate Detection”, Beijing, China, (2008), pp. 131-140. database and their similarity using above equation (5). [11] L. P. S. Gauch and M. Speretta, “Document similarity Based N-gram algorithm is executed for generation of all on Concept Tree Distance”, Proceedings of Nineteeth ACM possible pair of near-duplicate. The comparision is conference on Hypertext and Hypermedia, Pitteburgh, PA, USA, (2008), pp. 127-132. done whenever a texts have at least two contiguous [12] Gong, C., Huang, Y., Cheng, X., Bai, S.“Detecting Near- occurance of words that are same. Duplicates in Large-Scale Short Text Databases", Lecture Notes in Computer Science, Springer Berlin / Heidelberg, Vol.5012, pp. 877-883,2008 LIMITATIONS [13] Maosheng Zhong, Yi Hu, Lei Liu and Ruzhan Lu, A Practical Approach for Relevance Measure of Inter-Sentence, Fifth The proposed algorithm do not give efficient results International Conference on Fuzzy Systems and Knowledge Discovery, pp: 140-144, 2008 in case of the first occurance of a record being [14] BarYossef, Z., Keidar, I., Schonfeld, U, Do Not Crawl in the modified for example the first recrd is being DUST: Different URLs with Similar Text, 16th International summarised. The reality of techniques based on n- world Wide Web conference, Alberta, Canada, Data Mining Track, pp: 111 – 120, 2007.

Modified N-Gram based Model for Identifying and Filtering Near-Duplicate Documents Detection

58 International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-5, Issue-10, Oct.-2017 http://iraj.in [15] Jalbert, N., Weimer, W. "Automated Duplicate Detection for [21] Li, B.Wang, and X. Yang, “VGRAM: improving performance Bug Tracking Systems",IEEE International Conference on of approximate queries on string collections using Dependable Systems and Networks With FTCS and DCC, variablelength grams,” in Proceedings of the 33rd DSN , pp. 52-61,24-27 June 2008 International Conference on Very Large Data Bases (VLDB [16] Midhun Mathew, Shine N Das, Pramod K.Vijayaraghavan "A ’07), Vienna, Austria, September 2007. Novel Approach for Near-Duplicate Detection of Web Pages [22] Yang, H., Callan, J., Shulman, S., "Next Steps in Near- using TDW Matrix." International Journal of Computer Duplicate Detection for eRulemaking", Proceedings of the Applications, pp :16-21,April 2011 2006 international conference on Digital government research, [17] Krishnamurthy Koduvayur Viswanathan and Tim Finin, Text Vol. 151, pp: 239 – 248, 2006 Based Similarity Metrics and Delta for Semantic Web Graphs, [23] Yerra, R., and Yiu Kai, NG.,. "A sentence-Based Copy pp: 17-20, 2010 Detection Approach for Web Documents", Lecture Notes in [18] Shine N Das, K. V. Pramod, “Relevancy based Re-ranking of Computer Science, Springer Berlin / Heidelberg, Vol. 3613, Search Engine Result”, Proceedings of International pp. 557-570,2005 Conference on Mathematical Computing and Management, [24] O. Alonso, D. Fetterly, and M. Manasse, “Duplicate news Kerala, India, June 2010 story detection revisited,” in Proceedings of the Asia [19] V.A. Narayana, P. Premchand and A. Govardhan, “Effective Information Retrieval Symposium, pp. 203–214, 2013. Detection of Near Duplicate Web Documents in Web [25] Y. Bernstein, J. Zobel ―Accurate discovery of co-derivative Crawling”, International Journal of Computational Intelligence documents via duplicate text detection‖ Information Systems Research, Volume 5, Number 1, pp. 83–96, 2009. pp: 595–609,2006 [20] Broder, “Identifying and filtering near-duplicate documents,” [26] Gaudence Uwamahoro 1 and Zhang Zuping” Efficient in Combinatorial Pattern Matching, R. Giancarlo and D. Algorithm for Near Duplicate Documents Detection” ISSN Sankoff, Eds., vol. 1848 of Lecture Notes in Computer (Print): 1694-0814 | ISSN (Online): 1694-0784 Science, pp. 1–10, Springer, Berlin,Germany, 2000.



Modified N-Gram based Model for Identifying and Filtering Near-Duplicate Documents Detection