Detection of Translator Stylometry Using Pair-Wise Comparative Classiﬁcation and Network Motif Mining

Detection of Translator Stylometry using Pair-wise Comparative Classification and Network Motif Mining Heba Zaki Mohamed Abdallah El-Fiqi M.Sc. (Computer Sci.) Cairo University, Egypt B.Sc. (Computer Sci.) Zagazig University, Egypt SCIENTIA MANU E T MENTE A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the School of Engineering and Information Technology University of New South Wales Australian Defence Force Academy © Copyright 2013 by Heba El-Fiqi [This page is intentionally left blank] i Abstract Stylometry is the study of the unique linguistic styles and writing behaviours of individuals. The identification of translator stylometry has many contributions in fields such as intellectual-property, education, and forensic linguistics. Despite the research proliferation on the wider research field of authorship attribution using computational linguistics techniques, the translator stylometry problem is more challenging and there is no sufficient machine learning literature on the topic. Some authors even claimed that detecting who translated a piece of text is a problem with no solution; a claim we will challenge in this thesis. In this thesis, we evaluated the use of existing lexical measures for the translator stylometry problem. It was found that vocabulary richness could not identify translator stylometry. This encouraged us to look for non-traditional representations to discover new features to unfold translator stylometry. Network motifs are small sub-graphs that aim at capturing the local structure of a real network. We designed an approach that transforms the text into a network then identifies the distinctive patterns of a translator by employing network motif mining. During our investigations, we redefined the problem of translator stylometry identification as a new type of classification problems that we call Comparative Classification Problem (CCP). In the pair-wise CCP (PWCCP), data are col- lected on two subjects. The classification problem is to decide given a piece of evidence, which of the two subjects is responsible for it. The key difference between PWCCP and traditional binary problems is that hidden patterns can only be unmasked by comparing the instances as pairs. A modified C4.5 decision tree classifier, we call PWC4.5, is then proposed for PWCCP. ii A comparison between the two cases of detecting the translator using traditional classification and PWCCP demonstrated a remarkable ability for PWCCP to discriminate between translators. The contributions of the thesis are: (1) providing an empirical study to eval- uate the use of stylistic based features for the problem of translator stylometry identification; (2) introducing network motif mining as an effective approach to detect translator stylometry; (3) proposing a modified C4.5 methodology for pair- wise comparative classification. iii keywords Stylometry Analysis; Translator Stylometry Identification; Parallel Translations; Computational Linguistics; Social Network Analysis; Network Motifs; Machine learning; Pattern Recognition; Classification Algorithms; Paired Classification; C4.5; Decision Trees. iv [This page is intentionally left blank] v Acknowledgements All praises are due to ALLAH the almighty God, the Most Beneficent and the Most Merciful. I thank Him for bestowing His Blessings upon me to accomplish this task. Afterwards, I wish to thank, first and foremost, my supervisor, Professor/Hussein Abbass. His wisdom, knowledge, endless ideas, and commitment to the highest standards inspired and motivated me through out my candidature. I cannot find words to express my gratitude to him. Thank you for your continued encouragement, guidance, and support. Thank you for believing in me. I consider it an honour to work with you. It gives me great pleasure in acknowledging the support and help of my co- supervisor Professor/ Eleni Petraki. She is one of the kindest people I have ever met. She always supported me and praised every single step that I was making toward thesis’ completion. Her useful discussions, suggestions, ideas were a vital component of this research. Immeasurable thanks go to Ahmed, my wonderful husband, who always be- lieved in me when I doubted. Without his love, constant patience, and sacrifices, I wouldn’t have the mindset to go through this challenging journey. Specially, with my little angel Nada. Ahmed always looked after her during the critical pe- riods of my research. He was the father for both of us during these times. That reminds me to thank Nada for being part of my life. Her smiles were the treasure bank of happiness that kept me going. I would especially like to thank my parents, Zaki and Fatma, for their un- vi conditional love, prayers and support. I would like to extend my thanks to my brothers, sisters, and in-laws: Mohamed, Ehab, Shaimaa, Asmaa, Alaa, Nashwa, Abeer, Sahar, and Dina for their continued support and encouragement even though they were physically thousands of miles away. I would like to express my gratitude to my colleagues in the Computational In- telligence Group: Amr Ghoneim, Shir Li Wang, Murad Hossain, Ayman Ghoneim, George Leu, Shen Ren, Bing Wang, Kun Wang, and Bin Zhang. My sincere thanks to Eman Samir, Mai Shouman, Noha Hamza, Hafsa Is- mail, Sara Khalifa, Mayada Tharwat, Mona Ezzeldeen, Sondoss El-Sawah, Yas- min Abdelraouf, Wafaa Shalaby, Arwa Hadeed, Amira Hadeed, Irman Hermadi, Ibrahim Radwan, AbdelMonaem Fouad, Saber Elsayed, and Mohamed Abdallah for their moral support and encouragement, which made my stay in Australia highly enjoyable. Last but not least, I am also thankful to UNSW Canberra for providing the university postgraduate research scholarship support during my PhD candidature. I am also grateful for the assistance provided to me by the school’s administration, the research office, and IT staff. vii Originality Statement ‘I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person, nor material which to a substantial extent has been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contri- bution made to the research by colleagues, with whom I have worked at UNSW or elsewhere, during my candidature, is fully acknowledged. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Heba El-Fiqi viii [This page is intentionally left blank] ix Copyright Statement ‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International. I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.’ Heba El-Fiqi x [This page is intentionally left blank] xi Authenticity Statement ‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’ Heba El-Fiqi xii [This page is intentionally left blank] xiii Contents Abstract ii Keywords iv Acknowledgements vi Declaration viii Table of Contents xiv List of Figures xx List of Tables xxiv List of Acronyms xxviii List of Publications xxx 1 Introduction 2 1.1 Overview ............................... 2 1.2 ResearchMotivation ......................... 4 1.2.1 Author, Translator, and the Intellectual property . ... 4 xiv 1.2.2 The Linguistic Challenge: Fidelity and transparency (from theorytopractise) ...................... 5 1.2.3 The Computational Linguistic Challenge . 6 1.3 ResearchQuestionandhypothesis . 8 1.4 Originalcontributions . 11 1.5 Organizationofthethesis . 12 2 Background 14 2.1 Introduction.............................. 14 2.2 StylometryAnalysis ......................... 15 2.2.1 Authors Stylometric Analysis (Problem Definitions) . ... 31 2.2.2 Translators Stylometric Analysis (Problem Definitions) .. 40 2.2.3 Methods of Stylometric Analysis . 46 2.3 Translator Identification: A Data Mining Perspective . ..... 53 2.3.1 Classification for Translator Identification . .. 53 2.3.2 Stylometric Features for Translator Stylometry Identification 54 2.3.3 SocialNetworkAnalysis . 57 2.3.4 Analysing Local and Global Features of Networks . 60 2.4 ChapterSummary .......................... 62 3 Data and Evaluation of Existing Methods 64 3.1 Overview................................ 64 3.2 Why Arabic to English translations? . 65 xv 3.3 Why the “Holy Qur’an”? Is the proposed approach restricted to HolyQur’an? ............................. 66 3.4 Howthedatasetisstructured?. 70 3.5 Evaluating Existing

Detection of Translator Stylometry Using Pair-Wise Comparative Classiﬁcation and Network Motif Mining

Quantitative Authorship Attribution: a History and an Evaluation of Techniques

Stylometric Analysis of Parliamentary Speeches: Gender Dimension

Identifying Idiolect in Forensic Authorship Attribution: an N-Gram Textbite Approach Alison Johnson & David Wright University of Leeds

Learning Stylometric Representations for Authorship Analysis

A Comparison of Classifiers and Features for Authorship Authentication of Social Networking Messages

Profiling a Set of Personality Traits of Text Author: What Our Words Reveal About Us*

On the Feasibility of Internet-Scale Author Identiﬁcation

Adversarial Stylometry: Circumventing Authorship Recognition to Preserve

Linguistic Authentication and Reliability 1,1. Authorship in An

Forensic Authorship Analysis of Microblogging Texts Using N-Grams and Stylometric Features

Stylometric Techniques for Multiple Author Clustering Shakespeare‘S Authorship in the Passionate Pilgrim

Automatic IQ Estimation Using Stylometry Methods