University of Alicante

University of Alicante Final Master Thesis in English and Spanish for Specific Purposes Department of Language and Philosophy MAFIA LANGUAGE ANALYSIS AND DETECTION USING COMPUTATIONAL LINGUISTICS TOOLS Author: Elena Morandini Supervisor: Prof. Elena Lloret Regular Session Year 2020/2021 ABSTRACT Hate Speech and violent language detection with Natural Language Processing (NLP) tools are up-to-the- minute. If such technology could be applied to criminal jargon, it would give law enforcement agencies new resources to speed up investigations and evidence gathering against organised crime. This paper foresees the innovative approach of using computational tools and Machine Learning (ML) to detect Mafia language in electronic surveillance transcriptions used in Italian courts as evidence. Starting from a null hypothesis, this research will try to demonstrate the alternative hypothesis, that the variable Mafia and no-Mafia language can be differentiated. The first part of the investigation will determine the extent to which NLP tools detect Mafia language features, contrasting them with no-Mafia criminal jargon. The second part will describe an experiment to demonstrate how ML tools can recognise the Mafia variable for investigative purposes. In the conclusions, the effectiveness of the proposal as a pioneering method to fight the Mafia using its language will be discussed, defending the work of linguists and leaving the door open to new perspectives. Keywords: Computational Linguistics, criminal, mafia, language, Natural Language Processing tools, Machine Learning. RESUMEN El Procesamiento del Lenguaje Natural (PLN) está en su punto álgido permitiendo el análisis automático de millones de datos disponibles para detectar y bloquear el lenguaje violento en las redes. El punto de partida de este estudio es la posibilidad de aplicar esta tecnología en la recopilación de pruebas judiciarias para agilizar las investigaciones contra el crimen organizado, en concreto contra la Mafia siciliana. Este innovador enfoque prevé utilizar por primera vez el Machine Learning (ML) para detectar el lenguaje mafioso en transcripciones de vigilancia electrónica utilizadas en los tribunales italianos como prueba. Partiendo del presupuesto que las variables lingüísticas Mafia y no-Mafia son las mismas, o sea, que no se puede distinguir entre ellas, se intentará demostrar la hipótesis alternativa: que lenguaje de la Mafia y no-Mafia son independientes y por tanto se pueden discriminar. En la primera parte del trabajo se determinará si las herramientas del PLN identifican características lingüísticas exclusivas del lenguaje de la Mafia, contrastándolas con la jerga criminal no mafiosa. En la segunda parte se describirá un experimento pionero cuya finalidad es demostrar como una herramienta de ML puede reconocer la variable Mafia, para fines investigativos. En las conclusiones, se establecerá la eficacia de este enfoque como método pionero para luchar contra la Mafia usando su propio lenguaje, reivindicando la labor de los lingüistas y dejando la puerta abierta a nuevas posibilidades. Keywords: Lingüística Computacional, Mafia, lenguaje criminal, Procesamiento del Lenguaje Natural, Aprendizaje automático. Acknowledgements This paper started with a conversation with Lorenzo and matured in a Master’s presentation about Legal English. The experiment with Weka and the ML approach would not exist without my supervisor Prof. Elena Lloret. Her ideas, constant aid, motivation and support (and patience) have been essential to this study. Avv. Claudio Falleti helped with the no-Mafia dataset: without part of his documentation, no experiment could have been done either. The effort is dedicated to Ettore. He was deprived of time when the time was the only thing we had. Nothing of this would even exist without Prosecutor Giovanni Falcone, in beloved memory. Layout with LATEX, just to make it more difficult, if possible. Contents Contents 1 Introduction 1 2 Discipline of Study 5 2.1 Text Mining . .7 2.2 Corpus Linguistics . .7 2.3 Natural Language Processing modules . .8 2.4 Solving ambiguities . 11 2.4.1 Ambiguities in Mafia language . 13 2.5 Strategies for addressing Natural Language Processing problems . 14 2.5.1 Machine Learning . 14 2.6 Practical applications of National Language Processing . 16 2.6.1 Document Classification . 16 3 State of the Art 19 4 Problem and purpose statements 23 5 Methodology 26 5.1 Dataset creation . 27 5.1.1 Mafia dataset . 27 5.1.2 No-Mafia dataset . 29 5.2 Dataset cleaning . 30 5.2.1 Mafia dataset . 30 5.2.2 No-Mafia dataset . 33 5.3 Dataset processing . 33 5.3.1 Mafia dataset . 34 5.3.2 No-Mafia dataset . 36 5.3.3 Partial conclusions on datasets content processing . 37 6 Research 39 6.1 Content Analysis . 39 6.1.1 AntConc . 40 6.1.2 RStudio . 44 6.1.3 T-lab . 46 6.1.4 Partial conclusions on the content analysis . 51 6.2 Machine Learning experiment . 52 6.2.1 Datasets adaptation . 54 6.2.2 Weka settings and experiment . 56 7 Results and Discussion 59 8 Conclusions 63 Bibliography List of Tables 2.1 Tokens, stemmas, lemmas and PoS Tagger for one Mafia dataset sentence . .8 5.1 Final dataset words distribution . 33 6.1 Keywords frequency with AntConc . 41 6.2 Final dataset distribution . 54 6.3 Final composition of .arff files dataset for Weka . 56 7.1 Weka experiment success rate for detecting Mafia Language . 59 7.2 Weka experiment confusion matrix results . 60 7.3 Success rate for the baseline results (majority class approach) . 62 List of Figures 2.1 Euler Venn Diagram with computational fields and sub-disciplines relationships [Talib et al., 2016] . .6 2.2 Sample Mafia dataset sentence output with Babelfy . 11 2.3 Example of Mafia dataset transcriptions with translations Italian-Sicilian dialect 13 2.4 Machine Learning Life Circle . 15 2.5 Euler Venn Diagram Document Classifications and the other text-oriented sub- disciplines of NLP and IE relationships [Talib et al., 2016] . 18 5.1 Datasets creation and content process flowchart . 27 5.2 Judgement retrieved from Archivio Antimafia in Acrobat .pdf file format . 28 5.3 Actobat .pdf as no OCR file, dismissed from Mafia dataset . 30 5.4 Reconstruction word by word of supposed “good quality” .pdf file for Mafia dataset . 31 5.5 Cleaning .pdf for dataset with RStudio [RStudio Team, 2020] . 32 5.6 Using Notepad++ and regular expressions for cleaning process . 32 5.7 Mafia language dataset Lemmatisation and Stemming with NLTK . 34 5.8 PoS Tagger with SpaCy in Colab . 35 5.9 Example of a sentence Parsing with SpaCy and UDPipe in Colab . 36 5.10 No-Mafia dataset parser output with SpaCy UDPipe in Colab . 36 5.11 No-Mafia dataset PoS Tagger and dependency parsing analysis with SpaCy in Colab . 37 6.1 Step-by-step content analysis procedure flowchart . 40 6.2 Mafia dataset splitting into sentences with RStudio . 45 6.3 Mafia dataset words frequency experiment with RStudio . 46 6.4 RStudio words cloud experiment with sample of Mafia language dataset . 47 6.5 Mafia dataset topic analysis result for discorso .................. 48 6.6 No-Mafia dataset topic analysis result for discorso ................ 48 6.7 Mafia dataset co-word analysis result for the theme discorso .......... 49 6.8 No-Mafia dataset co-word analysis result for discorso .............. 49 6.9 Cloud of words for mandamento and capimandamento in Mafia dataset . 50 6.10 Co-word analysis result for mandamento in Mafia dataset . 50 6.11 No-Mafia dataset cloud of words analysis result for discorso .......... 51 6.12 No-Mafia dataset co-word analysis result for cristiano .............. 51 6.13 Weka procedure and settings for experiment flowchart . 53 6.14 Problems to import .arff files into Weka . 55 6.15 Example of .arff files structure . 55 6.16 Weka interface: Traindataset.arff uploaded . 57 6.17 Weka interface: test classifiers settings . 58 7.1 Weka final results . 62 Chapter 1 Introduction Technological advances have undoubtedly made it possible to analyse immense amounts of diverse data in extremely short times, which was unthinkable not so long ago. Human or natural language (NL) is one of these diverse data. Applied to the linguistic field, Natural Language Processing (NLP) and Machine Learning (ML) allow the detection and analysis of NL in spoken and written form for different purposes. One of the most talked-about and highly topical NLP and ML uses consists of the written hate speech detection on social networks, like Twitter or Facebook, as in the recent study from Rani et al. (2020). Public opinion has forced the communication giants to regulate and censor posts considered controversial or politically incorrect on their digital platforms. The only way to deal with this exorbitant amount of data is automatic, developing ML algorithms that offer a perfect example of NLP application. Their job is to read immense volumes of information posted in different languages almost instantly and block social media profile accounts because of inappropriate language. In order to develop this technology for identifying, detecting and blocking so-called hate speech, NLP and ML have made giant strides. In a relatively short time, increasingly powerful algorithms have been adapted to different areas and might pertain to any field in which the processing of large amounts of diverse data becomes necessary. For example, the IBM Watson Health1 has beneficial applications in medicine. Its algorithms can extract relevant information from large amounts of data consisting of clinical notes, discharge summaries, clinical trial protocols and literature data to generate insights for many applications like identifying pertinent problems, procedures and medications to develop better treatments. 1 https://www.ibm.com/watson-health?cmsp=Scheduler-ĆopyChng2-Ć 1 Indeed, if powerful tools can process millions of words per minute to find specific elements, a profound reflection on the application of this new field should be possible from the linguistics perspective.

Load more