Cross-Language Plagiarism Detection Steps
Total Page:16
File Type:pdf, Size:1020Kb
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 01, JANUARY 2020 ISSN 2277-8616 Cross-Language Plagiarism Detection Steps Marat Rakhmatullayev, Jasurbek Atadjanov, Gandjaeva Lola, Matyakubova Yulduzxon, Allabergenova Komila Abstract : This paper describes the cross-language plagiarism detection steps by method CLAD (Cross-Language Analog Detector) between test document and indexed documents. The main difference of this method from existing versions is the detection of plagiarism among multiple languages not only two languages. While translating terms, it used the dictionary-based machine-translation method. CLAD‘s working process consists of document indexing and detection process phases. In this paper, we will describe both of these phases and architecture of jComporator information system which developed based on this method. Index Terms : cross-language plagiarism detection, plagiarism detection steps, indexing documents. —————————— —————————— 1 INTRODUCTION. and then compared.‗jComporator‘ system was developed by Plagiarism, usage of the work by done another person this CLAD method and the detection quality of this system is without proper acknowledgment to the original source, is an directly related to synonyms and dictionary databases. The infamous problem in academic society. There are several database structure of the jComporator system is designed methods, system, and services which help to detect by the NoSQL mechanism [8-10]. In the document indexing plagiarism by machine [1]. As a result, the development of process, it was used Apache Lucene system which machine translation of the text has posed the problems of developed by Apache Software Foundation [11]. To store detecting cross-language plagiarism (CLPD) [2-4]. The main the rest of the information (dictionary, synonyms, reports, problem for СLPD during the translation of the text, it is and etc.), MySQL was used. MySQL is an open-source important to determine exactly what type of word translation relational database management system (RDBMS) [12]. is used in the document. For example, in Russian, there is word "человек" in English it can be translated "person", 2 RELATED WORKS. "human", "individual", and "man". So, if we translate this This section provides an overview of related works that deal word as "person" during the plagiarism detection process, with the detection of cross-language plagiarism. Work by but the document uses "man" version, during comparison Vera Danilova (2013) showed methods of cross-language the text we can get an incorrect result. If we check all plagiarism detection between documents. It described the synonym versions of the word, then we will have process of comparing documents that are written in different performance issues. In this paper, we will describe CLAD natural languages [13]. All only considered an algorithm for (Cross-Language Analog Detector) method which used to determining plagiarism between two languages. In addition, detect similarity score between documents which are in a synonym for the form of words is not considered in these different or same natural language. CLAD used ―Bag of algorithms. In the paper by Zaid Alaa, Sabrina Tiun, and words analysis‖ model to determine the similarity of two Mohammedhasan Abdulameer (2016) the method of cross- documents [5-7]. In this method plagiarism detection language method documents in Arabic and English was process consists of converting the document into plain text, described. The paper also showed a comparison of parsing text, analyzing words morphologically (stemming), documents considering the synonymity of words [2]. analyzing words lexically (detecting and removing stop- In an interesting paper [14], Daniele Anzelmi and colleagues words), normalizing synonym forms, translating words report the SCAM (Standard Copy Analysis Mechanism) (dictionary-based machine-translation method), comparing algorithm which is a relative measure to detect overlapping the bag of words. The main difference of this method from by making comparison on a set of words that are common existing ones is that it can detect check plagiarism more between test document and registered document. To than two natural language. To avoid compiling a dictionary compare documents, taking into account the synonym forms for each pair of languages, the main language is selected in of words, this algorithm suggests checking each synonym this method. If two documents are in a different language, form. In this case, the total number of operations will be first their bag of words are translated into the main language calculated using the following formula l ———————————————— s ci (1) Marat Rakhmatullayev, Doctor, TUITni kengaytmasi, Uzbekistan, i1 PH-998946883133, e-mail: [email protected] Jasurbek Atadjanov, PhD, Tashkent University of Information Here, l - count of words in document, ci - count of synonym Technologies named after Muhammad al-Khwarizmi, Uzbekistan, PH-998998702297, e-mail: [email protected] forms i - word, s - total operations number. The total Gandjaeva Lola Atanazarovna, PhD, Urgench State University number of the comparison operations will be even greater if (UrSU), Uzbekistan, PH-998975115511, e-mail: [email protected] we use algorithms of the class shilling [15]. Matyakubova Yulduzxon Amanbayevna, PhD, Urgench State l University (UrSU), Uzbekistan, PH-998999684907, e- s c (2) mail:[email protected] i i 1 Allabergenova Komila Sabirjanovna, Teacher at the Department of Biology, Urgench State University (UrSU), Uzbekistan, PH- The paper [11] introduced a cross-language plagiarism 998934692241, e-mail:[email protected] system for English-translated copies of Spanish document‘s detection. Their system was comprised of three stages; namely translation detection, internet search and report generation. There are several systems, which can detect 3303 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 01, JANUARY 2020 ISSN 2277-8616 document plagiarism by using web search engines, like AntiPlagiarism.NET, Advego Plagiatus, Unplag, Grammarly, TABLE 2. RESULT OF TOKENIZATION Copyscape. Also, there are Unicheck, Turnitin, PlagTracker, Антиплагиат, PlagScan system and services which work on Plain text (Input) Words collection (Output) their own database [16-19]. In mathematics and computer ―in‖, ―mathematics‖, ―and‖, science, an algorithm is a finite ―computer‖, ―science‖, ―an‖, sequence of well-defined, ―algorithm‖, ―is‖, ―a‖, ―finite‖, 3 IMPLEMENTATION computer-implementable ―sequence‖, ―of‖, ―well‖, ―defined‖, As discussed above, document plagiarism detection consists instructions, typically to solve a ―computer‖, ―implementable‖, of (1) document indexing, (2) similarity checking phases. In class of problems or to perform a ―instructions‖, ―typically‖, ―to‖, this paper, we will describe plagiarism detection process for computation. Algorithms are ―solve‖, ―a‖, ―class‖, ―of‖, Uzbek, English, and Russian documents. To describe this unambiguous specifications for ―problems‖, ―or‖, ―to‖, ―perform‖, performing calculation, data ―a‖, ―computation‖, ―algorithms‖, method, the main language is chosen as English. If a processing, automated ―are ‖, ―unambiguous‖, document is in Uzbek or Russian during the indexing reasoning, and other tasks. ―specifications‖, ―for‖, ―performing‖, process, its terms will be translated into English. ―calculation‖, ―data‖, ―processing‖, ―automated‖, ―reasoning‖, ―and‖, 4 DOCUMENT INDEXING PROCESS ―other‖, ―tasks‖ In this phase, we will describe the process of inserting a document into the database. It is known, that every natural language has stop words, which used inside of sentence to relate words to each other. 4.1 Document Normalization There is no single universal list of stop words used by all- The document normalization phase consists of (1) content natural language processing tools; and indeed, not all tools analyzing, (2) tokenization (3), and stop word removal steps. even use such a list. The next step consists of removing The main aim of this phase is to prepare the original stop words from collection words. The list of the English stop document‘s dataset for similarity comparisons with other words that has been used in this study is a default English texts. Content analysis is consists of retrieving simple text stop words list, and is a well-known list used by many (words, themes, or concepts) from digital files in different researchers, including [26]. In Table-3 it is displayed text formats. In this step, we can use Apache Tika toolkit. The after removing stop words step. Apache Tika™ toolkit supports extracting metadata and text from more than a thousand different file types (such as PPT, TABLE 3. REMOVING STOP WORDS XLS, and PDF) [20]. After this step document which in any Words collection (Input) Words collection (Output) file formats will be converted into plain text format, in Table-1 ―in‖, ―mathematics‖, ―and‖, ―mathematics‖, ―computer‖, it is given the result of this step. ―computer‖, ―science‖, ―an‖, ―science‖, ―algorithm‖, ―finite‖, ―algorithm‖, ―is‖, ―a‖, ―finite‖, ―sequence‖, ―well‖, ―defined‖, TABLE 1. RESULT OF CONTENT ANALYZING STEP ―sequence‖, ―of‖, ―well‖, ―defined‖, ―computer‖, ―implementable‖, ―computer‖, ―implementable‖, ―instructions‖, ―typically‖, ―solve‖, Source text in HTML format Plain text (Output) ―instructions‖, ―typically‖, ―to‖, ―class‖, ―problems‖, ―perform‖, (Input) ―solve‖, ―a‖, ―class‖, ―of‖, ―computation‖, ―algorithms‖, <p>In mathematics