University of Alicante

Final Master Thesis in English and Spanish for Specific Purposes Department of Language and Philosophy

MAFIA LANGUAGE ANALYSIS AND DETECTION USING COMPUTATIONAL LINGUISTICS TOOLS

Author:

Elena Morandini

Supervisor:

Prof. Elena Lloret

Regular Session Year 2020/2021 ABSTRACT

Hate Speech and violent language detection with Natural Language Processing (NLP) tools are up-to-the- minute. If such technology could be applied to criminal jargon, it would give law enforcement agencies new resources to speed up investigations and evidence gathering against organised crime. This paper foresees the innovative approach of using computational tools and Machine Learning (ML) to detect Mafia language in electronic surveillance transcriptions used in Italian courts as evidence. Starting from a null hypothesis, this research will try to demonstrate the alternative hypothesis, that the variable Mafia and no-Mafia language can be differentiated. The first part of the investigation will determine the extent to which NLP tools detect Mafia language features, contrasting them with no-Mafia criminal jargon. The second part will describe an experiment to demonstrate how ML tools can recognise the Mafia variable for investigative purposes. In the conclusions, the effectiveness of the proposal as a pioneering method to fight the Mafia using its language will be discussed, defending the work of linguists and leaving the door open to new perspectives. Keywords: Computational Linguistics, criminal, mafia, language, Natural Language Processing tools, Machine Learning. RESUMEN

El Procesamiento del Lenguaje Natural (PLN) está en su punto álgido permitiendo el análisis automático de millones de datos disponibles para detectar y bloquear el lenguaje violento en las redes. El punto de partida de este estudio es la posibilidad de aplicar esta tecnología en la recopilación de pruebas judiciarias para agilizar las investigaciones contra el crimen organizado, en concreto contra la Mafia siciliana. Este innovador enfoque prevé utilizar por primera vez el Machine Learning (ML) para detectar el lenguaje mafioso en transcripciones de vigilancia electrónica utilizadas en los tribunales italianos como prueba. Partiendo del presupuesto que las variables lingüísticas Mafia y no-Mafia son las mismas, o sea, que no se puede distinguir entre ellas, se intentará demostrar la hipótesis alternativa: que lenguaje de la Mafia y no-Mafia son independientes y por tanto se pueden discriminar. En la primera parte del trabajo se determinará si las herramientas del PLN identifican características lingüísticas exclusivas del lenguaje de la Mafia, contrastándolas con la jerga criminal no mafiosa. En la segunda parte se describirá un experimento pionero cuya finalidad es demostrar como una herramienta de ML puede reconocer la variable Mafia, para fines investigativos. En las conclusiones, se establecerá la eficacia de este enfoque como método pionero para luchar contra la Mafia usando su propio lenguaje, reivindicando la labor de los lingüistas y dejando la puerta abierta a nuevas posibilidades. Keywords: Lingüística Computacional, Mafia, lenguaje criminal, Procesamiento del Lenguaje Natural, Aprendizaje automático. Acknowledgements

This paper started with a conversation with Lorenzo and matured in a Master’s presentation about Legal English. The experiment with Weka and the ML approach would not exist without my supervisor Prof. Elena Lloret. Her ideas, constant aid, motivation and support (and patience) have been essential to this study. Avv. Claudio Falleti helped with the no-Mafia dataset: without part of his documentation, no experiment could have been done either. The effort is dedicated to Ettore. He was deprived of time when the time was the only thing we had. Nothing of this would even exist without Prosecutor , in beloved memory.

Layout with LATEX, just to make it more difficult, if possible. Contents

Contents

1 Introduction 1

2 Discipline of Study 5 2.1 Text Mining ...... 7 2.2 Corpus Linguistics ...... 7 2.3 Natural Language Processing modules ...... 8 2.4 Solving ambiguities ...... 11 2.4.1 Ambiguities in Mafia language ...... 13 2.5 Strategies for addressing Natural Language Processing problems ...... 14 2.5.1 Machine Learning ...... 14 2.6 Practical applications of National Language Processing ...... 16 2.6.1 Document Classification ...... 16

3 State of the Art 19

4 Problem and purpose statements 23

5 Methodology 26 5.1 Dataset creation ...... 27 5.1.1 Mafia dataset ...... 27 5.1.2 No-Mafia dataset ...... 29 5.2 Dataset cleaning ...... 30 5.2.1 Mafia dataset ...... 30 5.2.2 No-Mafia dataset ...... 33 5.3 Dataset processing ...... 33 5.3.1 Mafia dataset ...... 34 5.3.2 No-Mafia dataset ...... 36 5.3.3 Partial conclusions on datasets content processing ...... 37

6 Research 39 6.1 Content Analysis ...... 39 6.1.1 AntConc ...... 40 6.1.2 RStudio ...... 44 6.1.3 T-lab ...... 46 6.1.4 Partial conclusions on the content analysis ...... 51 6.2 Machine Learning experiment ...... 52 6.2.1 Datasets adaptation ...... 54 6.2.2 Weka settings and experiment ...... 56

7 Results and Discussion 59

8 Conclusions 63 Bibliography List of Tables

2.1 Tokens, stemmas, lemmas and PoS Tagger for one Mafia dataset sentence . . .8

5.1 Final dataset words distribution ...... 33

6.1 Keywords frequency with AntConc ...... 41 6.2 Final dataset distribution ...... 54 6.3 Final composition of .arff files dataset for Weka ...... 56

7.1 Weka experiment success rate for detecting Mafia Language ...... 59 7.2 Weka experiment confusion matrix results ...... 60 7.3 Success rate for the baseline results (majority class approach) ...... 62 List of Figures

2.1 Euler Venn Diagram with computational fields and sub-disciplines relationships [Talib et al., 2016] ...... 6 2.2 Sample Mafia dataset sentence output with Babelfy ...... 11 2.3 Example of Mafia dataset transcriptions with translations Italian-Sicilian dialect 13 2.4 Machine Learning Life Circle ...... 15 2.5 Euler Venn Diagram Document Classifications and the other text-oriented sub- disciplines of NLP and IE relationships [Talib et al., 2016] ...... 18

5.1 Datasets creation and content process flowchart ...... 27 5.2 Judgement retrieved from Archivio Antimafia in Acrobat .pdf file format . . . . 28 5.3 Actobat .pdf as no OCR file, dismissed from Mafia dataset ...... 30 5.4 Reconstruction word by word of supposed “good quality” .pdf file for Mafia dataset ...... 31 5.5 Cleaning .pdf for dataset with RStudio [RStudio Team, 2020] ...... 32 5.6 Using Notepad++ and regular expressions for cleaning process ...... 32 5.7 Mafia language dataset Lemmatisation and Stemming with NLTK ...... 34 5.8 PoS Tagger with SpaCy in Colab ...... 35 5.9 Example of a sentence Parsing with SpaCy and UDPipe in Colab ...... 36 5.10 No-Mafia dataset parser output with SpaCy UDPipe in Colab ...... 36 5.11 No-Mafia dataset PoS Tagger and dependency parsing analysis with SpaCy in Colab ...... 37

6.1 Step-by-step content analysis procedure flowchart ...... 40 6.2 Mafia dataset splitting into sentences with RStudio ...... 45 6.3 Mafia dataset words frequency experiment with RStudio ...... 46 6.4 RStudio words cloud experiment with sample of Mafia language dataset . . . . 47 6.5 Mafia dataset topic analysis result for discorso ...... 48 6.6 No-Mafia dataset topic analysis result for discorso ...... 48 6.7 Mafia dataset co-word analysis result for the theme discorso ...... 49 6.8 No-Mafia dataset co-word analysis result for discorso ...... 49 6.9 Cloud of words for mandamento and capimandamento in Mafia dataset . . . . 50 6.10 Co-word analysis result for mandamento in Mafia dataset ...... 50 6.11 No-Mafia dataset cloud of words analysis result for discorso ...... 51 6.12 No-Mafia dataset co-word analysis result for cristiano ...... 51 6.13 Weka procedure and settings for experiment flowchart ...... 53 6.14 Problems to import .arff files into Weka ...... 55 6.15 Example of .arff files structure ...... 55 6.16 Weka interface: Traindataset.arff uploaded ...... 57 6.17 Weka interface: test classifiers settings ...... 58

7.1 Weka final results ...... 62 Chapter 1

Introduction

Technological advances have undoubtedly made it possible to analyse immense amounts of diverse data in extremely short times, which was unthinkable not so long ago. Human or natural language (NL) is one of these diverse data. Applied to the linguistic field, Natural Language Processing (NLP) and Machine Learning (ML) allow the detection and analysis of NL in spoken and written form for different purposes. One of the most talked-about and highly topical NLP and ML uses consists of the written hate speech detection on social networks, like Twitter or Facebook, as in the recent study from Rani et al. (2020). Public opinion has forced the communication giants to regulate and censor posts considered controversial or politically incorrect on their digital platforms. The only way to deal with this exorbitant amount of data is automatic, developing ML algorithms that offer a perfect example of NLP application. Their job is to read immense volumes of information posted in different languages almost instantly and block social media profile accounts because of inappropriate language. In order to develop this technology for identifying, detecting and blocking so-called hate speech, NLP and ML have made giant strides. In a relatively short time, increasingly powerful algorithms have been adapted to different areas and might pertain to any field in which the processing of large amounts of diverse data becomes necessary. For example, the IBM Watson Health1 has beneficial applications in medicine. Its algorithms can extract relevant information from large amounts of data consisting of clinical notes, discharge summaries, clinical trial protocols and literature data to generate insights for many applications like identifying pertinent problems, procedures and medications to develop better treatments.

1 https://www.ibm.com/watson-health?cmsp=Scheduler-´CopyChng2-´C

1 Indeed, if powerful tools can process millions of words per minute to find specific elements, a profound reflection on the application of this new field should be possible from the linguistics perspective. For example, in forensic linguistics, NLP tools are used to analyse circumstantial evidence for authorship identification, as Carole Chaski’s [Chaski, 1997] computer software ALIAS2: “Automated Linguistic Identification and Assessment System for law enforcement, investigators, crime laboratories, human resources departments, security teams and linguists”. Nevertheless, these innovative perspective are almost limited to the United States. A pioneering initiative is Roxanne3, a collaboration between the Criminology department from Università Cattolica del Sacro Cuore di Milano and the European Community “aiming to unmask criminal networks and their members as well as to reveal the true identity of perpetrators by combining the capabilities of speech/language technologies and visual analysis with network analysis” [Università Cattolica del Sacro Cuore and Transcrime Department, 2019]. However, NLP application to legal and investigative field is scarce, and a lot is still to be done. Specifically, and to enter fully into the scope of this research, to detect the Mafia language. The plague of the Sicilian Cosa Nostra that affects and many other countries could be fought with better tools if its peculiar language were analysed, distinguishing it from the common criminal jargon. The term Mafia refers to a system of power exercised through violence and intimidation to control territory, illegal trade and business activities [Torrealta, 2010]. It represents a system of alternative-power with hierarchical and top-down management, based on internal rules, which are founded on the use of violence and intimidation [Falcone and Padovani, 1991, pp. 29, 82, 94]. The Mafia is also known as the Honourable Society, hence the term uomini d’onore for its members. It is rooted in , where it has taken the name of Cosa Nostra since the XIX century. Family is at the base of the organisation, consisting of individual “men of honour" associated and coordinated by a , acting in a given territory. If the family is large, several capidecina refer to a single representative. Three or more contiguous families per territory constitute a mandamento, with one head. The mandamento heads, in turn, meet in the provincial commission,

2https://aliastechnology.com/ 3Set up in 2019. Official Web page available at: https://roxanne-euproject.org/

2 la Commissione. The chiefs of the province have always exercised the hegemony over Cosa Nostra, even if there are several mandamenti in the whole Sicily [Falcone and Padovani, 1991].

The first maxi-trial (February 1986-December 1987) against 475 defendants could be held thanks to the investigations of Prosecutor Giovanni Falcone and the Anti-Mafia Pool. The evidence gathered for the trial were based on statements, the first informer and turncoat, whose relevant information on Cosa Nostra inner structure led to the above final convictions in 1992. For the first time, the Italian judiciary action proved capable of affecting the Mafia organisation thanks to the aggravating circumstance of Associazione di stampo mafioso or “Mafia-style association”4.

Lamentably, Cosa Nostra’s reaction to Maxi-Trial and the new Law application triggered an offensive of real Mafia terrorism against the State. Between 1992 and 1993, a series of and massacres took place, such as the bombings that caused the death of Giovanni Falcone together with his wife and escort (Capaci, 23rd May 1992) [Palazzolo and Prestipino, 2017].

The judicial response to these events has produced new laws which have introduced, among other things, bodies specialised in the investigative and judicial fight against the Mafia, as the Direzione Investigativa Anti-Mafia (DIA5) and a regime of hard imprisonment for detained men of honour (the so-called article 41 bis of the Italian Prison Law). For this reason, Cosa Nostra strategy has changed, giving life to the so-called inabbissamento [Torrealta, 2010, p. 38] meaning diving, whose “director” was the boss . In few words, the presence on the territory remains constant, but without resorting to homicides or sensational acts of bloodshed, making it extremely difficult to fight Mafia. In the meantime, since the 1990s, trials of politicians and institutions at a national and local level have been set up and held for their alleged relations with Mafia organisations, revealing Cosa Nostra ability to infiltrate the Government Institutions to exercise their counter-power with impunity [Torrealta, 2010].

Suppose it were indeed possible to discriminate and highlight the Mafia language among thousands of data. In that case, it could gather other elements useful for investigative and legal evidence purposes thanks to NLP and ML tools. Not only could a difference be made in the fight against crime, but it could present irrefutable evidence in a courtroom. Furthermore, it would

4Law Rognoni – La Torre (L. 646/1982) Art. 614 bis and Art. 17 [Francolini, 2020, p. 5] 5https://direzioneinvestigativaantimafia.interno.gov.it/

3 undoubtedly speed up the investigative and judicial process by reducing investigations length, evidence gathering, evidence cataloguing and the judicial process timing. For Mafia language discrimination from any other criminal jargon, the lack of theoretical reflection is part of the problem. But suppose it were possible to mark a line between Mafia and no-Mafia language. In that case, it becomes necessary to offer linguistic elements that will allow algorithms to identify, among millions of conversations, those that could be of investigative interest and, eventually, preventive use in the fight against crime. Considering the context above, this study is based on the hypothesis that the undeniable semantic load of the Mafia language, rooted in its rural origins in Sicily, can be detected and discriminated against from other languages shared by criminals all over Italy thanks to NLP and ML. A sample of specialised written texts to carry out this analysis will be collected from evidence admitted in court proceedings in Italian. With this dataset, ML algorithms will obtain a model capable of detecting and discriminating against the Mafia language. To achieve this purpose, the dataset should be first analysed and processed with specific NLP tools. A null hypothesis is set to predict if the variable Mafia language and the variable no-Mafia language are the same. An alternative hypothesis is set to predict if the variable Mafia language and the variable no-Mafia language are independent, meaning they are different and can be analysed to recognise them. The process to prove the alternative hypothesis consists of four parts. First, a brief overview of NLP and its applications will introduce the discipline of study and State of the Art, outlining the most related existing research. Second, a description of the methodology applied for the analysis and presenting the NLP and ML tools used for this purpose will be discussed. Once the methodology has been explained, a description will be offered of the experiment carried out to predict whether the Mafia language variable and the no-Mafia language variable are independent, thus different and processable for its discrimination. This dataset will be used to train an NLP software called Weka to discriminate de facto the two variables. Finally, the results will be discussed, and conclusions will be drawn whether NLP and ML tools can help detect Mafia language for investigative purposes.

4 Chapter 2

Discipline of Study

Artificial Intelligence (AI) is the branch of computer science that aims to build computer- controlled machines and software to solve several problems at a time and analyse millions of data in few seconds, imitating human brain reasoning structures. In other words, it is the science of training machines to perform human tasks [Lee, 2020]. To fulfil this purpose, AI investigates if computer systems can naturally communicate with humans: within this area falls Natural Language Processing. So far, computers share information with users through friendly interfaces. Still, communication between humans and machines is not accessible when the user is not an IT specialist because humans communicate with alphabetic codes called words. In contrast, computers receive and provide binary code data: a language based on eight-digit combinations of 0 and 1 assigned to each character. The human-machine interrelation is further complicated because humans convey information not only with words but with other non-communicative strategies, like silence or eye contact. The information retrieved can be taken for granted, implied, hidden behind a metaphor, tinged with sarcasm and irony, or even mentioned in a previous conversation. Some of these peculiarities of NL, also called ambiguities, must be solved by Computer Linguists and Engineers, especially now that AI is a reality in everyday life. Conversations with Alexa1or Siri2, scheduling an appointment with the doctor, a “Question and Answer” with a chatbot to solve internet connection problems seems effortless thanks to the enormous development of NLP: a long journey that started with Alan Turing’s Test [Turing, 1950].

1https://developer.amazon.com/es-ES/alexa/alexa-skills-kit/start 2https://www.apple.com/siri/

5 NLP studies and conveys computationally effective strategies to facilitate the human-machine interrelation closer to NL, less rigid than formal languages like Java, Python, R, or PHP [Cre- mades et al., 2017]. Both Computational Linguistics (CL) and NLP can be considered names for the same discipline, even if some scientists prefer to separate CL as a theoretical discipline focused more on Linguistics and NLP, its practical application field. In this paper, the reference used will be NLP for both disciplines.

This first introduction shows that Computer science is a field in constant progress and evolu- tion. Its complexity can be drawn from figure 2.1, which aims to encompass the relationships between the various disciplines mentioned in this chapter and the rest of the investigation.

Figure 2.1: Euler Venn Diagram with computational fields and sub-disciplines relationships [Talib et al., 2016]

cs st at St g inin a M Dat A Document I & M Classifcaton ac hin Informaton e Le Extracton arn ing Databases NLP Text Mining

& Informaton

I Retrieval n f o r m L a i t b r o a n

B r y S c i e n c

e Computatonal Linguistcs s

After this brief introduction on the main Computer Science disciplines dealing with NL and its processing, the following paragraphs of this section will describe the NLP domains and tools used throughout this research to detect Mafia Language.

6 2.1 Text Mining

Text Mining (TM) is the field of interest within which this study moves, focusing on analysing large quantities of written words such as transcripts of electronic surveillance and statements in court by defendants in criminal cases. The fact is that “more than 80 percent of today’s data is composed of unstructured or semi- structured data. The discovery of appropriate patterns and trends to analyse the text documents from massive volume of data is a big issue” [Talib et al., 2016, p. 414]. At this point it is important to remark that the unstructured data consists of text, as well. TM is a method of extracting meaning from a massive volume of raw text, as the name suggests, obtaining a knowledge model to represent behavioural patterns for better understanding. Once the extraction of meaning and behavioural patterns has been carried out, TM proceeds with the interpretation and evaluation of the data, verifying that the conclusions obtained are valid and satisfactory. Also, TM deals with the preparation, probing and exploration of data to obtain information that is not visible. Mining techniques allow addressing problems that can arise in the prediction, classification and segmentation that are very useful for companies and their business and production approaches, following example patterns of customer satisfaction with a product or service. Together with Data Mining (DM), TM is linked to ML as both deal with large volumes of data: DM with databases or tabular data, while TM to pure texts. ML tools are sometimes used to extract data and text [Talib et al., 2016].

2.2 Corpus Linguistics

Corpus Linguistics is another area that collects and classifies a vast amount of text data for NLP and an analytical approach to NL study by applying linguistic theories to computational methods searching for patterns. Corpus Linguistics takes advantage of the statistical approach to texts by using frequency information about words or sentences and combines these statistical methods with functional interpretations [Mitkov, 2003, pp. 439-441]. Thanks to this statistical analysis based on methodological principles of rigour, transparency, and replicability and the use of co-occurrences modules, Corpus Linguistics can identify linguistic

7 patterns of language in use and, depending on the research question, make assertions about language use in discourse. TM is often applied to large amounts of text called corpora. “A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research” [Navarro- Colorado, 2021]. In order to analyse large quantities of procedural documents with NLP modules, this study has partly used corpus collection procedures without reaching the large amounts of texts required to define it as a corpus. For such reason, the compilation has been defined as a dataset, as explained in the Methodology section.

2.3 Natural Language Processing modules

The modularity approach through NLP follows a scheme to retrieve the relevant information during each stage of analysis and provide an output with the interpretation of a given text. Each module fulfils different tasks, and it is deterministic in the sense that the same input will get the same output. There are different types of module that form the structure of NLP analysis, that

Table 2.1: Tokens, stemmas, lemmas and PoS Tagger for one Mafia dataset sentence

Token Stemma Lemma PoS Tag Picciotti Picciott- Picciotto NOUN vedete ved- vedere VERB di d- di ADP trovare Trov- trovare VERB un un- un DET incontro incontr- incontro NOUN . . . PUNCT will be described in the following subsections.

Tokenisation

It is the first step module. It breaks down the given text into smaller units called tokens, removing the spaces between them. The tokens can be words, numbers or punctuation marks. NLP considers that this first approach module to NL problem is solved, for its excellent performance.

8 Stemming and Lemmatisation

Stemming reduces the inflected form of a word to its root, analysing its inflexion. Lemma is the word as listed in a dictionary. Tokenisation, stemming, and lemmatisation are the first grammar analysis of NLP modules.

Part of Speech (PoS) Tagger

A PoS Tagger is a morpho-lexical analysis that “assigns a PoS tag or label to each word of an input text. The tagger first obtains the set of possible PoS tags for each word from a lexicon and then disambiguates between them based on the word context” [Clark and Shalom, 2010]. A PoS label is an abbreviation of the grammar features obtained with the PoS Tagger analysis. In the example as per Table 2.1, the PoS label consists of NOUN, VERB, ADP (for preposition), PUNCT (punctuation) or DET (determiner). Different tools might offer different Tags, including the verb tense or the number and the case3. For a full understanding of each label, every PoS Tagger has its own list available for consultation.

Chunker

Chunking is an alternative module for parsing and can be a first step in constructing recursive phrase-structure analyses. It is more limited than full parsing, but sufficient for extracting or ignoring information. It is widely used because it is a faster process, offering more robust results than parsing. [Mitkov, 2003].

Parsing

It offers a syntactic analysis output and processes the relationship between words. Once estab- lished the morphological analysis of the words with the PoS Tagger, the Parser output consists of either a dependency or a constituency tree, as per the following example.

3SpaCy UDPipe PoS Labels can be consulted in: https://universaldependencies.org/u/pos/

9 S

NP VP

La Mafia V NP

è Cosa Nostra.

The sequence of modules might vary, but to fulfil the correct analysis, the input texts must pass previous modules of tokenisation, stemming, lemmatisation and PoS Tagger. Once identified the category of each token, the Parser module disambiguates and offers its best output.

Semantic Parsing

It is a further step into syntactic analysis, a deep parsing generates logical form graphs from utterances using a Word Sense Disambiguation (WSD) statistical system of assignment [Bose et al., 2020]. NLP has managed to offer WSD as a solution for semantic ambiguity by giving a token (in this case, a word) the meaning in a set context, resolving its ambiguity. An example is the “Word Sense Database” Wordnet4, a semantic-lexical database of the English, elaborated by Princeton University. This type of linguistic analysis improves computer-related writing, such as discourse, search engine relevance, anaphora resolution, coherence and inference. An all Italian example are Babelfy 5 and Babelnet 6, “an innovative multilingual encyclopedic dictionary, (...) and a semantic network/ontology in a network of semantic relations, made up of about 20 million entries. It (...) follows the WordNet model based on the notion of synset (for synonym set), but extends it to contain multilingual lexicalisations”. Babelfy has been tested with the same sentences for the previous modules offering the output as per Figure 2.2, where the algorithm recognised correctly the concepts included in the sentence and offered the definition retrieved from Wikipedia. The module has been asked to provide outputs in English, as for vediamo, trovare o incontro. For Picciotti, the output is in Italian, because there is no correspondence on Wikipedia for

4https://wordnet.princeton.edu/ 5http://babelfy.org/index 6https://babelnet.org/

10 English 7. The rest of the concepts have been offered with their correspondence in English. One of the applications of Semantic Parsing is information retrieval [Mitkov, 2003].

Figure 2.2: Sample Mafia dataset sentence output with Babelfy

Once offered valid solutions to NL ambiguities, algorithms still need to understand the message implicit in NL and execute what they are asked to do. In other words, they need to process an input and deliver an output automatically, learn from it in a constant flow and teach other algorithms to replicate and improve in the process.

2.4 Solving ambiguities

To fully understand NL and the messages retrieved implicitly, its different types of ambiguity still represent a problem to solve for NLP. Since Turing, scientists’ interest has focused to solve the ambiguity problem on NL three components, applying the modular approach: dividing the vast NLP problem into smaller pieces, literally into small parts, discriminating the written text from the speech. These three components of NL to be solved are Syntax, Semantics and Pragmatics. For example, I saw a bat is a short sentence with different meanings inside the same word category for the noun bat being a flying mammal or a wooden club. Simultaneously, this lexical ambiguity can cross categories as for saw: it can be the Past Tense of the verb to see, the Present

7https://it.wikipedia.org/wiki/Picciotto

11 Tense of the verb to saw, or the tool saw. A computer must solve the different options among the same or different categories of words, nouns, and verbs as in MT. The next challenge for NLP is structural or syntactic ambiguity, as in the example: Ann ate a salad with spinach from Cardiff for lunch on Tuesday. As per the above sentence, with spinach can depend on salad or ate, as well as from Cardiff might depend on spinach, salad, or ate. For lunch might depend on from Cardiff, spinach, salad, or ate, and on Tuesday can depend on for lunch, Cardiff, spinach, salad, or ate. The sample can have up to forty-two possible parse trees, as crossovers are not allowed on Tuesday to spinach and for lunch to salad. Of course, with conjunctions as in Ann ate a salad with spinach from Cardiff for lunch on Tuesday and Wednesday, the difficulty and ambiguities increase. Consequently, semantic ambiguity is the most significant issue because even after NLP has resolved the NL syntax and the meanings of the individual words, there are still two (or more) ways of reading a sentence as in the above example with bat. For example, Lucy and Tom are married. They might be married to each other. Alternatively, they might be married separately. In Tom kissed his wife, and so did Bob: Bob might have kissed Tom’s wife, or his own. In these sentences, the morpho-lexical analysis might be easy. Still, the meaning of the sentences is difficult to retrieve, not only for machines but also for humans, as Pragmatics is involved when the context is necessary to understand the utterance. Pragmatics has become a significant focus of NLP research, mainly because of its relevance to computer systems designed to engage in purposeful dialogues with human beings [Mitkov, 2003]. Pragmatics originated in philosophical thought in the work of Searle [Searle, 1980] and Grice [Grice, 1975]. Its academic abstraction makes it difficult to adapt to concrete computational applications, such as anaphoric ambiguity for the economy of language, a phrase or word refers to something previously mentioned, offering more than one possible interpretation. For example, Ann invited Helen for a visit, and she gave her a good lunch, where she is Ann and her is Helen. At the same time, On the train to London, Tom chatted with another passenger. He turned out to be a linguist, where He refers to another passenger. Alternatively, Tom went to the hospital, and they told him to go home and rest where They correspond to the hospital staff. Pragmatics also involves metonymy as in The White House made an announcement today where the White House are the Presidents’ staff, or a metaphor as in the proverb It is raining cats and dogs, whose meaning is that it is raining a lot. The ellipsis, or the omission from a sentence

12 of one or more words that the context of the remaining elements can assume, as in John can play the guitar, and Ann (can play the guitar,) too.

2.4.1 Ambiguities in Mafia language

In terms of ambiguities, it is necessary to spend few lines on the Mafia language.

First, this language belongs to the Sicilian dialect, generally the Palermitan variant. In transcriptions, for example, the specialist delivers the parlance into Italian, often introducing translations into brackets or specific terms such as munzeddu8 for many, or pisantuliddu, as per Figure 2.3. Having said this, the Mafia language presents ambiguities like polysemy that make

Figure 2.3: Example of Mafia dataset transcriptions with translations Italian-Sicilian dialect

both analysis and comprehension difficult from the lexical point of view. For example, the verb astutare means to extinguish. In his famous pizzini, Provenzano9 uses this verb to avoid the word kill or [Camilleri, 2009]. Another example is the word raggiunamentu, which is not a logical and rational discourse, but an actual settling of accounts for an insult. For Mafia mentality, the most logical reasoning for an insult is the death of the adversary. In this sense, it could be almost ventured that Sicilian, in Mafia parlance, functions as a “technolect”: it uses the structure of Italian and the Sicilian dialect but makes a closed and complex use of the polysemy to avoid being understood by other speakers.

All the ambiguities described above evidence that despite all the efforts, NL and even more Mafia language is difficult to understand, and the primary goal of NLP is to make it understandable to machines, solving these ambiguities and make that computer speak any human language [Espunya i Prat, 1994].

8https://it.glosbe.com/scn/it/munzeddu 9The so-called boss of bosses from , shared his with Totò Riina.

13 2.5 Strategies for addressing Natural Language Processing problems

NLP has two different strategies to approach NL: the first is rationalist, based on manually introduced rules that offer precision in its results. The second approach is corpus-based and dedicated to ML, where a corpus is used for training. ML algorithms extract statistically robust rules from previous training and apply them to a new and unknown corpus. The advantage of this approach is given by its coverage of a vast amount of data. Both strategies are valid and in use. However, for the analysis in this study, ML strategy will be used to establish whether the alternative hypothesis of the Mafia language existence is viable with a non-annotated dataset, as a deliberate choice being this study a first approach to ML tools for Mafia analysis purpose.

2.5.1 Machine Learning

But how can a machine really understand and reproduce NL? The dedicated field of investigation is ML, a sub-discipline of AI and “a subject that studies how to use computers to simulate human learning activities, and to study self-improvement methods of computers that to obtain new knowledge and new skills, identify existing knowledge, and continuously improve the performance and achievement” [Wang et al., 2009]. This learning activity is faster than human learning and constantly fed back by new inputs, as per the Figure 2.410. There are two different main ML approaches 11 the first one is the supervised method. It consists of providing computer systems with specific and codified data, models and examples, for building an information and experiences database. When the machine is faced with a problem, it has to draw on the experiences inserted in its system, analyse them and decide which output to offer based on the already codified experiences. This type of learning supported by statistical analysis allows the machine to choose the best response to inputs. Supervised learning algorithms are used in many sectors, as in vocal identification: they can make inductive hypotheses obtained by scanning a series of specific problems offering suitable solutions to general problems.

10 Adapted from the LATEXtutorial in https://texample.net/tikz/examples/tag/diagrams/ 11https://www.intelligenzaartificiale.it/machine-learning/

14 Figure 2.4: Machine Learning Life Circle

ion D ct at ea a R

n T

o

i r

t

a

a

ML i

n

u

l

i

n

a

g v E Test

This method will be applied in the experiment to detect the Mafia language variable in the present paper.

The second automatic learning method, also called unsupervised method, foresees that the input is not codified: the machine can draw on certain information without having any example of its use and, therefore, without knowing the expected results depending on the choice made. The machine must catalogue all the self-collected data (or as an input by the system), organise it and learn its meaning, its use, and, above all, the result to which it leads. Learning without supervision offers the machine more freedom to organise the information and learn the best results for the different inputs.

The supervised method is the most common, where all data is labelled, and the algorithms predict the output from the input data to learn. It could be compared to a teacher (the train-data) that guides the student (the test-data) to learn and apply it. On the other way round, in an unsupervised method, all data is unlabelled, and the algorithms must learn by themselves the input data patterns. It is a self-taught learning method.

The following section offers a brief overview of NLP practical applications.

15 2.6 Practical applications of National Language Processing

In the beginning, language solutions had to be solved in terms of numbers solely. Nowadays, the goal of NLP is to write programs that can understand and generate language material as natural as possible, both written and spoken. Back in the Fifties, the first applications of NLP theories were focused on automatic, o Machine Translation (MT), and written language [Cambria, 2014]. Nowadays, NLP applications can be grouped into the following specialities dedicated to oral and written language, as resumed by Lee [2020]:

• Neural Machine Translation (a dramatically improved automatic translation that uses Artificial Neural Networks to predict the likelihood of a sequence of words, offering a better and natural result);

• Information Extraction (IE) dedicated to extracting unstructured or semi-structured infor- mation from thousands of machine-readable texts, as mentioned in the introduction;

• Information Retrieval (IR) dedicated to retrieving information from thousands of pages. Any web search engine (Bing, Google) is an IR application;

• Sentiment Analysis (SA), to measure people’s opinion on a product or service or customers’ needs;

• Q&A like chatbots;

• Robot systems, such as technical support robots, customer service robots, language learning tutors, and companion devices like Alexa.

Smart devices interpret NL thanks to innovative algorithms that digest words, transform them into an appropriate computing representation (i.e. a sequence of 0 and 1) so that the machine can interpret them and returns them in a coherent structure understandable by humans.

2.6.1 Document Classification

Document Classification (DC) is one of the many practical applications of NLP, which consists of retrieving, classifying and interpreting raw texts with ML algorithms [Kastrati and Imran, 2019]. The vast amount of text produced and available daily, and not only on the Internet, requires

16 an automatic classification. For example, the news: press would be the documents, and each genre would be the five different classes such as national, foreign press, financial and sports. Words characterise documents, and one way to apply ML to document classification is to treat the presence or absence of each word as an attribute. For news classification, it is essential to consider the specific weight of the words because it addresses their classification. The word football indicates the class of sports, being of great importance when determining the category of the document (press). In DC “a document can be seen as a bag of words - a set containing all the words in the document, with multiple occurrences of a word appearing several times (technically, a set includes each of its members only once, whereas a bag may have repeated elements). Word frequencies can be adjusted by applying separate dedicated algorithms” [Witten et al., 2011, p.97]. Document classification has been used for this investigation to determine whether the Mafia variable and the no-Mafia variable can be automatically classified with NLP tools. Concluding this introductory section of the discipline that frames this study, figure 2.5 helps to identify Document Classification as the core centre of several disciplines as Keywords search or sentiment analysis. It also requires the use of the already mentioned NLP modules in Section 2.2.

17 Figure 2.5: Euler Venn Diagram Document Classifications and the other text-oriented sub-disciplines of NLP and IE relationships [Talib et al., 2016]

Keyword search

Informaton Retrieval

Informaton Extracton Text Classifcaton co-reference/entty resoluton Entty Extracton Sentment Document Classifcaton Analysis Word Sense Desambiguaton Tokenisaton PoS Tagger Shallow Parsing Chunker NLP

Parsing

Lemmatsaton Stemming

18 Chapter 3

State of the Art

When confronted with an investigation, it is necessary to look around and see how other re- searchers have tackled the problem. Despite the intensive searches on the analysis of criminal or mafia language by NLP, it has not been possible to find theoretical background or reference on the subject. This absence of a previous study and insight on the argument evidences the pioneering aspect of this research. The most similar topic in terms of linguistic approach and computational application is the so-called hate speech detection. As mentioned in the introduction, hate speech has transcended the boundaries to become a real social problem. Automatic flame detection and offensive or abusive language are the ultimate NLP frontier, where ML tries to offer solutions to analyse, qualify, identify and censor hate speech among the dozen exabytes of information created weekly on the Web, in an automatic and unsupervised method [Cambria, 2014]. As a consequence of this increasing interest, the volume of academic articles published on the argument is growing daily. The problem with such an abundance of approaches, studies and points of view is that the researcher looking for information feels overwhelmed. Sometimes, the more texts are consulted, the more confusion clouds the researcher’s objective. Among so many varied scientific offerings, the number of articles consulted had to be reduced to a minimum, selecting only those most suitable for the investigation. From each text, the relevant information pertinent to the case was extracted. The gist of each text has been used as an inspiration to set up an innovative experiment that could be replicated only to a small extent with the collected material, when compared to other researches. Moreover, a linguistic analysis

19 of the Mafia language in the legal context has not been studied with NLP tools yet.

For the present study, one of the selected investigations is “Offensive language detection using multi-level classification” published by [Razavi et al., 2010]. The authors offer a review of the different offensive language types and categories as taunts, references to handicaps, squalid language, homophobia, racism, extremism, crude language, disguise, four-letter words, and provocative expressions that may cause anger or violence, taboos and unrefined language. This appalling variety of abusive language sets a problem of classification due to its subjectivity when it comes to labelling. It is also a complex classification depending on the context of emission. According to Razavi et al. [2010], from the computational point of view, the relevant factors that can be objectively detected by a software are frequency of words and sentences, where the flaming patterns are weighted and placed in different grades. The words contained in sentences with abusive/extremist load for each grade, the highest grade in context (maximum weight) and the normalised average of the graded/weighted words or phrases. With these highlights, the authors “designed and implement a fuzzy gauge of flame detection, and implement it in a software that could be modified regarding the acceptable tolerance margin, based on training data, manual adjustment, or even instant labelled contexts” [Razavi et al., 2010, p. 20]. The software used for the training and flame language detection implemented by a dictionary of “insulting” words, is Weka from the University of Waikato [Witten, 2005], one of the few free software available on the net for ML training and test.

Razavi’s article offered a hint on a possible tool to detect Mafia Language. However, a dataset was still needed to proceed with the experiment. Looking at how other investigators have solved the problem of missing corpus, the compilation of a hate dataset offered by Chung et al. is of great interest, being “the first large-scale, multilingual, expert-based dataset of hate speech/counter-narrative pairs” that can be useful to analyse hate speech and help ML tool to detect it [Chung et al., 2019, p. 2819].

On the same line, we can find several studies offering datasets of hate speech, not only in one language, generally English but also bilingual, as is the case of the study carried out by [Rani et al., 2020]. The researchers provide a Hindi-English mixed-code dataset consisting of Facebook and Twitter posts and comments. Their experiment goes even further, as they show that their “deep learning model was able to capture the syntax and semantics of the hate speech more accurately even in the case of an unbalanced and unprocessed dataset” [Rani et al., 2020].

20 Actually, the fact that most investigative material in NLP English-based creates a problem for those who venture to analyse minority languages, like Italian. For such reason, a Hindi-English dataset is very welcomed. While several academic studies provide a deeper understanding of hate speech on the Internet, no study has been done on criminal language and its automatic detection. Not only on the Web, where many criminals share information, but also in SMS, WhatsApp or Telegram messages [Ciconte, 2017]. If nothing has been done on the argument, less seems to have been done to discriminate the different criminal jargon from the Mafia language, where it is relegated to its use in rural life metaphors or paremy. In this regard, it is interesting to mention the number of so-called Mafia dictionaries, such as “Dizionario mafioso-italiano italiano-mafioso” [Ceruso, 2010], or “Voi non sapete” [Camilleri, 2009]. As a background reading, other books about Mafia and its language have been used for this study. First, Prosecutor Giovanni Falcone’s extended interview with Marcelle Padovani just before his death recompiled in “Cose di Cosa Nostra”. As a man of Law, he was the first ever to evidence the importance of the linguistic code to understand Cosa Nostra structure. On a similar line, Prosecutor Prestipino wrote “Il Codice Provenzano”, underlying the importance of Language Codes to fight the Mafia, starting as a description of the boss of Bosses strategies with letters to manage the organisation while hiding from Justice for 43 years. The proliferation of this kind of text shows that the Mafia’s particular use of words fascinates men of Law and writers, deserving a proper academic approach from linguists. Coming to linguists, University of Trieste Professor Pontrandolfo compiled an interesting overview of legal corpora in different languages, such as English, German, Spanish and Italian1, where Spain is the country that has managed to collect the most. However, none of those consulted has been helpful when extracting relevant information for this study. Most of them deal with the specialised legal language itself, rather than the different languages used by defendants or witnesses in court [Pontrandolfo, 2012]. The closest study retrieved relating Cosa Nostra to “hard” science is a very interesting article that has provided ample food for thought for this study. As evidenced by the investigation carried out by an international group led by Lucia Cavallaro [Cavallaro et al., 2020] from the

1 http://corpora.dslo.unibo.it/bolceng.html

21 University of Derby, mathematics can help law enforcement agencies make inroads into Mafia clans. The research started from a real unsolved case, namely the committal for trial of a Mafia gang’s members who had put their hands on contracts for methane gas works. The telephone interceptions between the clan’s members and the meeting places (discovered thanks to the police surveillance) were reproduced in two networks. With this method, researchers identified the intermediary passing relevant information between bosses and affiliates. Its results were admitted to the court and used as evidence in a trial. This practical and socially relevant application of Mathematics to the Mafia’s criminal network offered an insight into the possibility of analysing and identifying Mafia language with NLP tools to be used as irrefutable evidence in legal proceedings. With all the above investigations in mind as inspiration, the present study claims a necessity to widen language detection with ML to legal fields, specifically focused on Mafia language. Also, it offers a pioneering insight on how this language could be detected automatically with ML tool called Weka. Should a good result be obtained, it would open the door to technologically advanced ML tools to be applied in the legal and forensic linguistics fields.

22 Chapter 4

Problem and purpose statements

Mafia is a problem for Italy: a threat for the citizens, the institutions and the economy. According to a study by Eurispes,1 the Italian Research Institute, Mafia produces illegally over 220 billion euros a year, equivalent to 11% of Italian GDP. The fight against the Mafia needs new and modern strategies to be fought. In this sense, cutting-edge technology can help in its pursuance. Prosecutor Giovanni Falcone [Falcone and Padovani, 1991] himself confirmed that language was fundamental to understand Cosa Nostra and enter into its workings, previously unknown to justice and investigators. If NLP and ML offer tools for studying and analysing NL, why not apply this knowledge to the law enforcement sector? If it is feasible to detect and analyse hate speech from enormous data with these novel tools, why not use them to detect other languages, such as the Italian crime language and specifically the Mafia language? This study seeks to obtain insight that will help to address the above-mentioned research gaps: a lack of insight on Criminal Language from linguists, especially on Mafia language, in the first place. In the second place, the lack of dedicated legal corpora to analyse this language extracted in courtroom or with electronic eavesdrop methods. It identifies a paucity of studies for a practical approach to investigative purposes with ML tools to be used against the organised crime that should be addressed. Last but not least, the predominance of studies focused on English and a very little interest in minor languages like Italian. This study aims to refute the null hypothesis that the variable Mafia language is the same as

1https://eurispes.eu/mediacontent/siciliainformazioni-it-affari-per-220-miliardi-allanno-le-mafie-sono-una- holding-finanziaria/

23 the variable no-Mafia language. With the proposed methodology and the experiment described, it should be possible to demonstrate the validity of the alternative hypothesis: the Mafia language identification is feasible. Should it be possible, it would open new opportunities to ML applications for the Italian law enforcement agencies to improve electronic surveillance methods, discriminating among millions of words only to those who could be of genuine investigative interest. At the same time, it would improve the gathering and classification of evidence in Penal Cases where there is a suspect of activities inscribed into the Organizzazione di Stampo mafioso aggravating circumstance, as per the Art. 614 bis and Art. 17, and it could be eventually exported to other languages and other kinds of mafias worldwide. Having described the main objective, all that remains is to value what has been learned for the present study. Firstly, corpus linguistics and its procedures, partly applied to the dataset used for this experiment. During this first approach to the proper experiment, it has been possible to see first-hand how this work is worthy of the most delicate Swiss watchmaker. In fact, in order to compile even a tiny dataset like the one used in this study, several processes of absolute precision were required to clean, file, classify and refine a set of words with a previously inexistent structure necessary to be processed by an NLP tool, as legal documents are. Secondly, the text analysis modules such as tokenisation, lemmatisation, stemming, PoS and parsers. Thirdly, the programming languages needed for the described evaluations, such as Python and R, with their dedicated packages, their installation, the commands and regular expressions needed, not only to carry out the proper analyses, but also to load the files with the texts to be analysed, and their not easy downloading into .txt or .csv documents. Fourthly, the use of a complex programme from the conceptual point of view, more than in terms of use, such as T-Lab. The software provides extensive possibilities for TM, which are not always fully comprehensible to a pure linguist. Finally, the use of another complex programme such as Weka allows large quantities of data analysis with several algorithms, most of which require a great deal of statistical background. Concerning NLP tools in linguistics, it is essential to underline that they are here to stay. For this reason, the academic world must restructure the use of modules as those used here in a

24 regulated way, within the academic curriculum, and why not since school. The delimitation of studies at all educational and academic levels as purely humanistic, neglecting what NLP offers for the study of NL, prevents these tools of enormous utility from finding their natural habitat in the classes of linguistics. Thanks to the large and supportive community of UA professors, computer scientists, engi- neers, experts, NLP and module amateurs and their online forums and contributions available on platforms such as Github2, Stackexchange3, blogs and video tutorials like Professor Witten’s MOOC4 on Weka the difficulties arisen during the analysis and experiment have been solved.

2https://github.com/ 3https://stackexchange.com/ 4https://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

25 Chapter 5

Methodology

Sicilian Mafia language detection using NLP tools is an argument not yet studied from the academic perspective. The lack of a precedent to this work has been the biggest problem in establishing a research methodology. For this reason, an empirical trial-and-error approach was embraced for this study, partly following the corpus linguistics methodology for text collection on one side, and the NLP standard procedures, on the other. Modularity has been used to reduce the text into tokens, then into lemmas to be analysed with PoS Tagger and Parsing. The frequency of words has been determined, as well as its keywords. Once the linguistic elements contained in the Mafia dataset had been identified, a semantic analysis was fulfilled from the quantitative point of view to extract specific features of the Mafia Language. However, since the beginning, the real burden of the experiment was focused on ML tool training to detect the Mafia variable from the no-Mafia variable using different algorithms. The theoretical underpinning of this research is how the NLP approaches the linguistic problems of supervised learning, starting with text classification, up to the Mafia language identification. In the present study, it was impossible to manually compile a large corpus for Mafia and no-Mafia language: first for lack of material, as the information on the Internet was not so abundant and easy to manipulate as from a first planning. Secondly, for a question of time as it is indeed a punctilious, tedious and painstaking work. Furthermore, being this a first and pioneering approach to studying criminal language (and specifically, Mafia language), the raw no-annotated corpus linguistics approach has been chosen.

26 Thus, the decision to limit the processed amount of text in a dataset consisting of Mafia and no-Mafia texts selected with the same criteria of time, quality and quantity. The timeframe was set for documentation registered in court procedures from 2003 and 2018. The quality consisted of defendants conversations or declarations, one for the Mafia and one for the no-Mafia variable. The conversations had to be of diverse origin: the dialogues had to be different and varied. From the quantitative point of view, these two datasets had to be sufficient and balanced. They had to be divided in a proportion of 80% for the training dataset and 20% for the test dataset. It was paramount that the dialogue parts contained in the training were unique and not repeated in the test dataset. The procedure has been outlined in the flowchart as per Figure 5.1.

Figure 5.1: Datasets creation and content process flowchart

5.1 Dataset creation

The first step for any linguistic analysis is the collection, selection and classification of significant and exemplary texts of the language. From the outset, it was clear that this collection had to be manual, because no corpus or dataset already available on the Internet meets the necessary characteristics for the present analysis.

5.1.1 Mafia dataset

The first of these characteristics was its judicial nature, from which the words of the defendants for Associazione di stampo mafioso “Mafia style organisations" can be extracted, meaning that the Italian Penal Code articles 614 bis and 17 have to be applied by the judge, as conditio sine qua non.

27 Also, a careful research on the Internet on Mafiosi’s names had to be made to check that all the conversations introduced into the dataset belonged to defendants, not from witnesses or experts. The contributions from non defendants were eliminated from the dataset in the first place. The texts had to be classified as verdicts, custodial or plain judgements, interrogations by lawyers or prosecutors, interceptions of telephone conversations admitted as evidence in a trial, requests for precautionary measures or memorials from the Public Prosecutor’s Office. The material for the Mafia dataset used for this analysis was digitally downloaded from the open-access web pages of the Italian Supreme Court1 and various archives of the prominent Italian newspapers and no-profit journalists associations2. As the fight against the Mafia is an essential aspect for Italian civil society, many websites are dedicated to file, document and organise the main Mafia facts. These pages contain valuable sources such the ones itemised in Figure 5.2.

Figure 5.2: Judgement retrieved from Archivio Antimafia in Acrobat .pdf file format

The material of the Mafia dataset consisted of twenty-six .pdf downloadable files and two .doc files for a total of 13.818 pages, 413.316 words from defendants witness, deposition or surveillance transcriptions. Twelve of the files were correspondence between Bernardo

1http://www.italgiure.giustizia.it/sncass/ 2https://www.archivioantimafia.org/testi.php

28 Provenzano and his then right-hand man, still at large and wanted worldwide, Matteo Messina Denaro3, used as evidence in court.

5.1.2 No-Mafia dataset

The material for the second dataset was much more difficult to retrieve due to the Italian Privacy Legislation 4. For this dataset, the text had to fulfil the same requirements as the Mafia dataset. It was important that the defendants needed to be charged with, and eventually convicted, of organised crime offences. The relevant criterion to distinguish the two dataset had to be the Art. 164 bis aggravating circumstance not to be applied by Judges in either the precedents or proven facts, the grounds of law, the verdict or the judgement. Another criterion was the information retrieved had to be balanced when compared to the Mafia dataset in quantity by the amount of 413.316 words, once cleaned and pre-processed as described in Section 5.2. The greatest limitation to the no-Mafia language variable dataset gathering was the above- mentioned Italian legal system: sensitive information can be exclusively managed by the defen- dant’s legal representatives. In a first attempt to gather no-Mafia documents and proceedings, Professor Portrandolfo as an expert in Legal Corpora was consulted without success, as well as the Cattolica di Sacro Cuore University and the Transcrime department (Criminology), de- velopers of Roxanne. In a second attempt, several Italian lawyers were consulted but refused to provide the necessary documentation. Fortunately, one of the lawyers consulted 5 had at his disposal helpful material for this study in .txt format: documents without names or references, but used and admitted in legal proceedings. As the file contained no personal data, only words that cannot be related to anyone identifiable, it was used for the no-Mafia dataset without infringing any regulations. Most importantly, the defendants’ data were protected at all times, being ignored by the investigator. The total amount consisted of six files, 860 pages and more than 38.674 words from electronic eavesdropping conversations included in five legal proceedings and one included as evidence in a pending trial for drug trafficking.

3Considered the new boss of bosses, after the death of Provenzano and Riina in prison. 4https://www.garanteprivacy.it/il-testo-del-regolamento 5https://www.studiolegalefalleti.it/

29 5.2 Dataset cleaning

While gathering the no-Mafia dataset, the available files for the Mafia dataset were treated in the first place, to settle the above described procedure that would be applied with this second set equally, avoiding the mistakes and delays of the first attempts.

5.2.1 Mafia dataset

Once inspected that each of the twenty-three files contained speech from defendants accused of Associazione di stampo mafioso, the pages with relevant information were separated from the rest of judicial document macro-structure as the title page, the heading, the background of facts or proven facts, the legal grounds or rulings, as well as the already mentioned interventions of lawyers, judges, witnesses or experts. Some .pdf, as the above-mentioned letters, were in such format that it was not possible to convert them in any other .txt, .doc, or .rtf file with Optical Character Recognition (OCR), because the pages were scanned in its origins as image format and therefore dismissed from the dataset, as per the sample in Figure 5.3.

Figure 5.3: Actobat .pdf as no OCR file, dismissed from Mafia dataset

As an aside regarding OCR, several problems were detected to convert even good quality .pdf into .txt. Many glyphs (typing letters or characters) were wrongly interpreted by the converter. In a first attempt, a specific tool available online was used to reconstruct the damaged glyphs

30 called “Post-Processing Workbench"6. Once obtained the first version of amended glyphs as “]" for “ll" or “ @" for “à" the file had to be checked manually. This first process of file cleaning caused a long and tedious work of reconstruction with the help of the auto-corrector and a constant double-checking of the originals, as per the example in Figure 5.4. It is also interesting to mention that most tools are thought for English, so they fail to provide correct outputs with languages as the Italian, so rich in glyphs for acute or grave accent, or abundance of apostrophes.

Figure 5.4: Reconstruction word by word of supposed “good quality” .pdf file for Mafia dataset

To speed the process of cleaning and exporting the .pdf files into .txt RStudio7 and the PDFtools and Tidyverse packages were used, with partial success, as per Figure 5.5. Once reconstructed the remaining nine editable files, the Mafia dataset was further cleaned by aligning the names of the persons mentioned and then removing the non-pertinent parts, as the parenthetical sentences describing the actions taking place or the incomprensibile (i.e. unintelligible), ride (i.e. laughs), grida (i.e. screams) or omissis (i.e. intentionally omitted) included by the police officer transcriptions, as in Figure 5.6. After the pure speech, the remaining text had to be further cleaned of numbers, punctuations and symbols following Razavi et al. recommendations [Razavi et al., 2010, p. 6]. Commas, full stops, question marks and exclamation marks were preserved as they might be useful to determine the speaker’s emphasis in the sentence.

6Successfuly used in Guthemberg Project and available at https://www.pgdp.net/ppwb/ 7https://www.rstudio.com/products/rstudio/

31 Figure 5.5: Cleaning .pdf for dataset with RStudio [RStudio Team, 2020]

Figure 5.6: Using Notepad++ and regular expressions for cleaning process

With Notepad++8 and the use of regular expressions, useless characters and symbols, empty lines, spaces and tabulations were removed, as well as numbers and capital letters.

8https://notepad-plus-plus.org/

32 When all the pre-processing and cleaning had been done and double-checked for missing parts, errors or forgotten entries in the text during the clean-up, the total number of words remained was 413.316.

5.2.2 No-Mafia dataset

Once obtained the Mafia language dataset, following the requirements of ML supervised method, the alternative no-Mafia variable dataset was treated following the above cleaning protocol. Again, once cleaned and pre-processed, following the same procedure as per the Mafia dataset, the total amount of words left was 18.667. Taking into consideration that any ML tool for TM needs two different datasets, one for the algorithms training and one for the algorithms testing, the extremely unbalanced relationship of quantity forced the size reduction of the variable Mafia language. This reduction allowed working with balanced datasets, so that biases towards the majority class could be avoided. The final datasets are detailed in Table 5.1. Table 5.1: Final dataset words distribution

Dataset Words Mafia 18.840 No-Mafia 18.667

The difference of 173 words depended on the choice not to break the complete final sentences until the full stop, to respect the paragraph meaning. The dataset was transformed into two .txt files with UTF-8 format to prevent any error for the characters as accents and apostrophes. Another file was created with a list of Italian stop-words containing a list of 674 items between articles, adverbs, auxiliary verbs. Stop-word lists are necessary for the pre-processing in order to keep the meaningful words only, by removing the above-mentioned words with no relevant semantic weight.

5.3 Dataset processing

Once cleaned, both datasets had to be processed to extract the relevant information for the present study. The NLP standard procedure was followed using RStudio, NLTK and SpaCy, as

33 mentioned in the Discipline of Study paragraph.

5.3.1 Mafia dataset

Tokenisation

The tokenisation of the Mafia Language dataset was made using SpaCy in a Google Colab9 environment to apply Python without installing the software. SpaCy, as per its web page, is “a free, open-source library for advanced Natural Language Processing (NLP) in Python (...) designed specifically for production use (...) and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning” 10. This first approach with the dataset offered the expected result, splitting each word and punctuation.

Lemmatisation and Stemming

For these two modules of morpho-lexical analysis, SpaCy was changed for NLTK, another tool based on Python. The results were not the expected, as the module did not succeed in obtaining any root or lemma for any word, as per Figure 5.7.

Figure 5.7: Mafia language dataset Lemmatisation and Stemming with NLTK

PoS Tagger

On the other hand, the results obtained in the morphological analysis with the PoS Tagger using SpaCy improved slightly. The module identified most grammatical categories, as shown in 9https://colab.research.google.com/notebooks/intro.ipynb 10https://spacy.io/usage/spacy-101

34 Figure 5.8. Few exceptions occurred, as the plural masculine noun picciutteddi. The truth is that the diminutive of picciotti, picciutteddi, belongs to the Sicilian Dialect (in Italian would be picciottini), so the module dictionary would not include it. Another possible explanation for not being recognised as a noun and wrongly classified as a VERB could be its place between full stops, for example, an imperative whose suffix is an -i in Italian. These misclassifications appeared frequently during this analysis, especially with Dialectal speech.

Figure 5.8: PoS Tagger with SpaCy in Colab

Parser

For the syntax analysis with a parsing tool, the Colab environment with SpaCy and UDPipe11 was used with the same relatively interesting result; as for the experiment, this pre-processing was meant to retrieve background information on the dataset to be helpful for the experiment. UDPipe offers a syntactical analysis using Python prototype: it can perform tagging, lemmatization and syntactic analysis in more than 50 languages, Italian among others.

The Mafia dataset was used to perform small sample analysis with some wrong outputs (marked in red) for Italian words, as per the constituency tree in Figure 5.9. In “Carissimo mio, spero”, spero was the main clause verb with omitted subject io, a singular and masculine pronoun. The module wrongly identified it as a noun. The same happened with the verb dico, of the correctly (marked in green) parsed sentence as a nominal modifier.

11https://ufal.mff.cuni.cz/udpipe/2

35 Figure 5.9: Example of a sentence Parsing with SpaCy and UDPipe in Colab

5.3.2 No-Mafia dataset

Once obtained the first results of the Mafia dataset gross analysis, the No-Mafia dataset was submitted to the same pre-processing procedure.

Token, Lemma and Stemming NLP modules

Again, Tokenisation, Lemmatisation and Stemming were applied. In the first attempt with the NLTK module, the PorterStemmer and SnowballStemmer for Italian gave bad outputs. The lemmas for each verb were wrong. After this first unsuccessful attempt, SpaCy was used, and it confirmed the successful outputs, as per the Mafia dataset.

PoS Tagger

Once obtained the first results, the PoS Tagger with SpaCy was applied, with as good results as for the Mafia dataset, as per Figure 5.10.

Figure 5.10: No-Mafia dataset parser output with SpaCy UDPipe in Colab

36 Parser

As per the Mafia dataset, the SpaCy UDPipe Parser in a Colab environment was used on small parts of the No-Mafia variable. The same positive results were obtained, confirming the validity of this tool for subordinates sentences, as per Figure 5.11. In the sample, the subordinate temporal sentence introduced by Quando was identified, as well as the xcomp sentence, with the subject ellipsis, was correctly detected, confirming this parser good results for Italian12.

Figure 5.11: No-Mafia dataset PoS Tagger and dependency parsing analysis with SpaCy in Colab

5.3.3 Partial conclusions on datasets content processing

The analysis of small portions of the Mafia and No-Mafia datasets with free NLP modules provided reasonable reliability for word and sentence structure analysis. In particular, SpaCy UDPpipe is better adapted to Italian than the NLTK packages PorterStemmer and SnowBall- Stemmer. However, it was almost impossible to check the efficiency of these tools on thousands of word dataset, sentence by sentence, to detect all incorrect outputs. For this particular study, it was necessary to choose among the many available on the Web, the ones dedicated for Italian because, as mentioned in the previous Chapter, many powerful applications are designed and developed only for English analysis. If compared, the results obtained from no-Mafia dataset offered better outputs. One of the main reasons being that the language of the no-Mafia dataset is almost entirely standard Italian and contains less dialect. Despite these apparent limitations, the application of standard NLP modules to corpus linguistics is a convenient approach. In fact, ML algorithms apply automatically the same

12https://universaldependencies.org/u/dep/all.html/al-u-dep/xcomp

37 modules used in this analysis, to learn in an orderly way. However, for the specific purpose of this study, no modules were strictly necessary to fulfil either the pre-processing or the experiment, being a tool for a first approach with large amounts of words. In the next part, a detailed study of word frequency, keywords, topic and co-word analysis was carried out to identify the words of higher semantic weight in the two datasets. A comparison of their usage will be offered to determine whether there are differences that make discrimination possible from a semantic point of view. Finally, the experiment itself will be described to prove the validity of the alternative hypothesis.

38 Chapter 6

Research

The section on methodology offered a first approach on NLP modules. Still, its application did not shed any light on the Mafia language existence or non-existence. For example, the dialect in the text might represent a variable, explained by the errors in the Mafia dataset, very rare in the no-Mafia dataset. Therefore, it was necessary to investigate the semantic aspect of the words in context, or any dialectal word might be wrongly recognised as Mafia language by NLP tools. To do so, two phases of the experiment were proposed: the first consisting of content analysis and the second dedicated to the training of ML tool to evaluate whether it is possible to discriminate the Mafia language in the two datasets. As mentioned in Section 2.1, Text Mining is the field that offers a data interpretation and evaluation, verifying that the conclusions obtained are valid and satisfactory, while gathering information that is not evident.

6.1 Content Analysis

Again, for this part of the content analysis, each tool was used on small samples of the dataset, each different from the other and chosen randomly. This choice depended on several factors. Firstly, there is no precise rule defining the amount of text needed for good analysis. Secondly, because the material provided was of distinct quality. Indeed, from a first reading made to select and catalogue the material, it was clear that the defendant’s language used before a court was not the same as between peers. In front of a court, depositions are negotiated with the lawyer, so the language is not spontaneous. The defendant is severely hampered by the fact that he may be accused and sentenced to several years of prison. Therefore, he controls his speech to make a

39 good impression or say only the bare minimum. Furthermore, it should not be forgotten that the tools used are powerful, but not the computer capabilities (memory, processor, Wi-Fi) to use them. With a large volume of text, the analysis would slow down and sometimes even crash. It is also worth mentioning that although the whole dataset was cleaned as much as possible with NLP tools (even manually, as mentioned in the Methodology), there were always outputs with wrong symbols or characters that made noise like accents, or @, that had to be included manually before each test, in the stop-word list. For an easy-to-follow structure, a flowchart in Figure 6.1 has been drawn with the step-by-step procedure.

Figure 6.1: Step-by-step content analysis procedure flowchart

6.1.1 AntConc

In this part of the analysis, the first tool used was AntConc [Anthony, 2019], designed specifically for Corpus analysis in Java. Thanks to AntConc, it was possible to detect the word frequency of a 19.040 tokens sample from the Mafia dataset and 7.503 for the no-Mafia. However, as mentioned at the beginning of this section, the samples were randomly selected. With regard to the size, there is no predefined rule for a good analysis as “the size of the reference corpus is not very important in making a keyword list” [Tabbert, 2015, p. 61] in the present case. The semantically weighted words were identified among the mentioned background bibli- ography by Falcone and Ceruso for the Mafia dataset and by Ciconte [Ciconte, 2017] for the no-Mafia dataset. The results are evidenced in Table 6.1. For the Mafia dataset, discorso-i was the first entry in a social environment where prevails l’, the code of silence. Until Prosecutor Falcone obtained the linguistic code for under- standing the Mafia from Buscetta, in-depth knowledge of the mafia structures was

40 Table 6.1: Keywords frequency with AntConc

Words Mafia No-Mafia discorso -i 433 36 zio 108 23 mandamento -i 104 2 accordo -i 94 7 cristiano -i 78 14 picciotto -i 58 7 capomandamento 25 0 compare 0 60 unthinkable precisely because of this silence [Falcone and Padovani, 1991, p. 52]. According to Buscetta, this code of silence was extended within the organisation’s structure to hide its existence and activities from outsiders and within itself for self-protection, creating obstacles to information among the different levels. Cosa Nostra was then so “discreet” in its activities that many politicians and even members of the Security Service insisted that it did not exist [Falcone and Padovani, 1991, p. 50]. The silence is also a question of self-protection among men of honour: it is preferable not to speak than telling a lie because only the truth can be spoken between mafiosi, a lie can be a sentence of death. In short: ellipsis of information is preferable to verbosity to avoid saying things one might regret. From this obligation of not to lie between members of Cosa Nostra, the proverb is born: “the better word is what you don’t say ” 1. Another important consideration is that according to Falcone: “A member of Cosa Nostra does not consider himself as a vulgar criminal ” (1991, p 59). Hence, neither speak like any criminal, not to be confounded with one. With these considerations in mind, the keywords discourse (discorso) and agreement (ac- cordo), together with mandamento and capomandamento indicated a need from Cosa Nostra to reach agreements among peers, which was the fourth most frequent word. This conversation was of exceptional value in an investigative context to understand the inner Cosa Nostra workings and changes over time. In this sense, the use of nouns belonging to the “diplomatic” semantic field showed that despite Mafia structure into independent families grouped locally in mandamenti, it required diplomatic management to maintain a good relationship for their prosperity. It was also interesting the detection of the keywords cristiano and picciotto. It is important to remark that between uomini d’onore the most common appellation is not signore, which is

1“’A megghiu Parola è chidda ca ’un si dici. ”

41 an insult for a mafioso, but cristiano: that is someone of a certain level inside the organisation structure. On the contrary, picciotto is low workforce or executive arm: someone who fetches protection money (il ) or a killer. The Family’s boss is called zio, same for uncle 2. As an aside, uncle, son, father are common terminology in the semantic field of Family, as Cosa Nostra is considered a cluster of families. For the Mafia structure, the blood ties are at the hierarchy’s base and part of its success, as it is challenging to betray a member of one’s Family. For the no-Mafia dataset, the same words identified in the Mafia dataset were searched for. In both appeared the keywords zio, discorso, accordo, cristiano, picciotto and mandamento. Capomandamento, compare only appeared in one of each. Each keyword was identified in each text, and its concordances were revised, with the following conclusions. The word zio3 was mentioned, but it was addressed to another member of a criminal group, called Zio Mino not belonging to Cosa Nostra. In any circumstance, uncle is a widespread name, of course, and it should not be considered, by itself outside a mandamento context, as a discriminative term for a Mafia chief. Another word in common, although with a different frequency within the text, perhaps due to its length, was cristiano. According to Falcone, Camilleri and Ceruso, the term refers to man of honour close to Costra Nostra top of the hierarchy. Now, according to Enzo Ciconte [Ciconte, 2017], this borrowing from Mafia language has permeated many criminal groups and mafias such as the ’Ndrangheta from or the from Naples. Reviewing the text in question from the no-Mafia dataset, one of the wirings included the conversation of a petty criminal with ’Ndrangheta and Mafia links, justifying the use of words such as zio, cristiano, mandamento or picciotto in the text. Yet, the relevance for the Mafia dataset was marked by the specific and exclusive use of campomandamento e mandamento, with a pending clarification about the position of discorso and compare. As per the Italian dictionary4, compare is a specific term for the male godparent, extremely common in Italy. The bibliography about Italian mafias and criminal associations was relevant to determine that compare is often used in criminal groups to define its members, but not for Cosa

2Cosa Nostra does not use the word Padrino in their vocabulary 3 https://it.wikipedia.org/wiki/27NdrinaMancuso 4https://www.treccani.it/vocabolario/

42 Nostra. The keyword discorso deserved a separate explanation. According to the mentioned Italian dictionary, it means either communication, speech or writing, about a particular, usually serious, subject. In the No-Mafia dataset, discourse was used precisely in this sense. However, in the Mafia language, a speech is a pact, an agreement, a law that must be respected and that only le commissioni, the verge of the organization, can negotiate when necessary. Once established that word frequency allowed a qualitative assessment of the importance of one electronic eavesdrop transcription compared to others, other NLP tools were used to validate their effectiveness.

Hapax Legomena

Hapax Legomena are unique words in a text. There were several unique words detected in the Mafia Language dataset because the persons involved were many, each with their idiolect. In spite of this, a first interesting result was offered by the word molluschi, the Italian plural for a mollusc. The relevant characteristic of molluscs is their lack of spine. For this condition as a spineless animal, it is used by mockery to describe someone wimpy. The only record of this use of the word mollusc in a Mafia context out of this study and the dataset was an old interview 5 with Luciano Liggio, ex-boss of Corleone and sentenced for Mafia association during the Maxi-trial. Well, thanks to AntConc, it was possible to identify the use of this same word in letters written by the still wanted new boss of bosses , contained in the Mafia dataset. In a total number of 2.030 tokens, the results of the letters’ dataset analysis gave as a unique word Molluschi. The word was carefully looked upon in all the rest of the dataset, with no success. It remains a unique word connecting an old boss to the new one, and still fugitive from justice. A peculiar connection indeed. Back to the No-mafia dataset analysis, it offered insight into the specific semantic field of prostitution. All the background reading included in the bibliography about Mafia agreed that prostitution is not considered a decent “business” for the honourable society, as Cosa Nostra considers itself.

5Retrieved from an interview by the Italian journalist Enzo Biagi, 1983 available at: https://www.youtube.com/watch?v=BZIqOeDnrGQ

43 It might be tolerated as a side business by some mobster connected to the Mafia, but not for a real Mafioso6. There was no word linked to the prostitution semantic field from the Mafia dataset, direct or indirectly. Even if only once, the No-Mafia dataset contained the following terms derived from the sexual field and women exploitation, as pompini, puttana, preservativi, topless, troie. The reference to Mafia inside the no-Mafia dataset was justified by the Hapax Leomenon: mafiosi, Totò Riina, Sicilia. Once checked in context, they were, again, mentioned casually but not as language in use. In order to further investigate the datasets contents with NLP tools, RStudio was used among the many available freely on the Internet.

6.1.2 RStudio

RStudio is an open-source integrated environment to ease the application of R formal lan- guage, allowing the implementation of different packages developed by R users, developers and individuals that freely improve the tool for text analysis. For the present investigation, the packages used were pdftools, tidyverse, quanteda and dplyr7. In particular, it allowed the sample text classification into sentences and offered a second insight on word frequency, complementing the results obtained with AntConc. RStudio is supposed to facilitate the text analysis to those who are not IT experts. The truth is that still, some training has to be done for the linguist who ventures to use this powerful tool, with no previous knowledge of formal languages. Back to the analysis results on Mafia dataset, RStudio helped to determine some frequencies in the small sample used, as per Figure 6.3. As it can be seen from the results, the most used words consisted of the two auxiliary verbs to be (essere) and to have (avere), not filtered by the Italian stop-words list. The names and surnames corresponded to one of Cosa Nostra foremost exponents in Palermo, the former mayor of the city , convicted for “mafia-style

6Cosa Nostra “values” measure and decency, as much as honour, respect for speech and truth, silence and loyalty for the Family. In fact, there is a deep contempt for the ostentation of money and libertinism. Among other things, a man of honour cannot make a living by women’s sexual exploitation but must provide for his parents, wife and children. [Falcone and Padovani, 1991, pp. 88-90] 7Libraries were used as stringr, ggplot2, scales, tidyr, widyr, ggraph, igraph, quanteda, topicmodels, cvTools.

44 Figure 6.2: Mafia dataset splitting into sentences with RStudio

association” at the beginning of the century. Donno is the name of a member of . The rest of the words belong to Italian law specialised language as magistrati, Stato, dichiarazioni.

A final look at a word cloud offered further insight on the quality of the data processing with RStudio, following the adage that “one image is worth a thousand words” as per Figure 6.4.

This example was one of many that have led to the courtroom statements by defendants’ removal in the final Mafia dataset intended for Weka training and testing.

First, as mentioned, the names of Ciancimino, together with his son Massimo, Carabiniere Donno, the magistrates, the Italian Government were the most frequent words, but all the rest in green did not offer further relevant information for this study on the Mafia language.

Similar experiments were carried out on the no-mafia dataset, with no further findings, compared to AntConc. However, in this case, no section was eliminated from the already small sample.

After discarding dataset parts consisting of defendants’ courtroom statements for the final

45 Figure 6.3: Mafia dataset words frequency experiment with RStudio

experiment, a deeper analysis of those texts derived from electronic eavesdrop of Mafia members was carried out. It is important to remember that all the transcriptions used are included in judicial texts, officially admitted as circumstantial evidence in Mafia or crime association cases.

6.1.3 T-lab

The sentiment analysis was carried out with a made-in-Italy multilingual software called T-Lab8. This complex but easy-to-use tool applies different algorithms and statistical analysis for a not specialised audience in computer science or formal languages, such as R or Python and their packages. Both data input and output are extremely visual, providing immediate understanding of complex statistical data, if compared to a pure numerical result. The keywords detected during the first part of the analysis confirmed mandamento and capomandamento words exclusive of the Mafia dataset, compare, an exclusive word for the no-Mafia dataset and discorso a common word between the two datasets. With T-Lab, these words were further analysed with some available tests. The first test on topic analysis evidenced the specific words found in proximity to discorso, as the already detected by AntConc, autorizzare: meaning that agreements and discourse must be authorised by the head of each family member of La commissione. For example, the name of

8https://www.tlab.it/

46 Figure 6.4: RStudio words cloud experiment with sample of Mafia language dataset

Benedetto (the head of the Villagrazia-Santa Maria del Gesù mandamento) was the word that shared the highest probability to be found close to discorso, as per figure 6.5.

Indeed, discorso was not a form of communication in the Mafia dataset. It was loaded with the meaning of a rule, a Law to keep the co-existence of the organisation. A norm that for Cosa Nostra had to be negotiated, agreed, discussed, mediated. It is not univocal, but emanated by a limited number of people endowed with decision-making and organisational power, to modify the discourse according to the situation and the moment. Hence, its inclusion in the diplomatic theme of mediation and clarification, together with the Commission.

When compared with the same topic analysis in no-Mafia dataset discorso, the keyword was surrounded by an ample number of generic words as persona, uomo, accettare, interessare, as per Figure 6.6.

There was a larger number of topic words surrounding the keyword discorso in the no-Mafia dataset, which evidenced the everyday use of discourse as communication. Also, in common use context (Figure 6.6), it grouped with a higher number of shared (in blue) and specific words (in red), giving further evidence that in the no-Mafia dataset discorso was not a keyword.

Thanks to the relatively simple graphical results obtained with T-Lab, it was possible to deepen the semantic analysis of a large text to extract relevant information. The second type of

47 Figure 6.5: Mafia dataset topic analysis result for discorso

Figure 6.6: No-Mafia dataset topic analysis result for discorso

analysis with T-Lab consisted of a tree map with the co-words for the “diplomacy” theme centred on discorso. Co-occurrences are quantities resulting from counting the number of times two (or more) lexical units are simultaneously present within the same elementary contexts. Each shade of colour represents a cluster of words based on the same topic, semantic field or common theme.

Following the same pattern of topic analysis, it was possible to retrieve that the keyword discorso appeared next to mandamento and zio as per Figure 6.7. However, picciotti was in a far distant position from the “diplomatic theme” clusters confirming their role of minions, distant from any possible negotiation. When compared to the no-Mafia dataset, as per Figure 6.8, the weight of words, such as zio, parlare, capire and discorso had different co-occurrences. The predominance was given to words as understand (capire) or compare, that, as mentioned previously, did not belong to Cosa

48 Figure 6.7: Mafia dataset co-word analysis result for the theme discorso

Nostra structure, evidencing that this discourse had a different context than the Mafia dataset.

Figure 6.8: No-Mafia dataset co-word analysis result for discorso

Once determined the different discorso semantic weight in both datasets, a further analysis was carried out on the two unique words within the Mafia dataset in the specific sense of the territorial influence of a Mafia family: mandamento and capomanadmento. As per Figure 6.9, the relationships between mandamento and capimandamento with famiglia, cristiano, rappresentare (to represent) and noi (us) with some mandamenti that constitute Cosa Nostra in the Palermo area were evident: peace, pace , was the aim of the meeting, riunire , at least for i mandamenti Resuttana, Cruillas Uditore or Altarello. Another example was offered by the analysis of co-words, the words that are semantically closest to the discourse theme. In

49 Figure 6.9: Cloud of words for mandamento and capimandamento in Mafia dataset

Figure 6.10, the relationships between mandamento, their heads’ capo, and especially the verb riunire meeting, were extremely evident.

Figure 6.10: Co-word analysis result for mandamento in Mafia dataset

Since mandamento only appeared twice in the no-Mafia dataset, it was impossible to establish this same type of analysis. The word discorso was then tested again, to remove it definitely between the keyword lists of the no-Mafia dataset, as per Figure 6.11, where the keyword was lost between other verbs as parlare, mandare, chiamare: normal words that gravitate around a discourse in the most general sense of communication, as any could do either criminal or not. A final cloud of words was then attempted with cristiano, confirming that the use of the keyword was utterly different from the Mafia dataset. where it represented a Cosa Nostra family

50 Figure 6.11: No-Mafia dataset cloud of words analysis result for discorso

member. In Figure 6.12 there were few words related, one of them was buono meaning good

Figure 6.12: No-Mafia dataset co-word analysis result for cristiano

person, following the Catholic religion. In this last case, there was a clear relationship between cristiano and compare. If analysed in the text, the two words appear in the famous conversation quoted on page 51, detected by the police between a petty criminal and un picciotto, wanting to compare his cronies or compari with cristiani being members of Cosa Nostra.

6.1.4 Partial conclusions on the content analysis

All the tests described above revealed specific features of the Mafia language and keywords extracted with NLP tools. The tests focused on the semantic analysis of different types of keywords highlighted in Chapter 5. The results obtained showed that the Mafia language variable was a line of investigation worth pursuing more precisely and in-depth since there are distinctive elements, such as the specific use of words like mandamento and capomandamento, and the specific use of a common term like discorso.

51 The possible investigative use of NLP tools could allow the extraction of this crucial in- formation in the fight against the Mafia, such as, for example, its structure in mandmenti and their geographical districts. Even possible meetings of the commissions and the participants or, as per the last example, eventual connections between criminal organisations. The fluidity of Cosa Nostra in adapting to time constraints makes interceptions of this kind a valuable element for investigations. If automatically detected and analysed with NLP tools, they would greatly facilitate the first investigative phases of collecting, classifying and interpreting the relevant data. Having carried out the preliminary analysis necessary to find elements that could support the alternative hypothesis, i.e., Mafia language contains distinctive elements compared to no-Mafia language, it is now time to continue in ML framework to empirically test whether the predictions made in this chapter come true.

6.2 Machine Learning experiment

The aim of this experiment was to empirically test if the alternative hypothesis is valid: if Mafia language variable is different from no-Mafia’s and to which extent the elements analysed in the previous part of this research can be automatically detected with a simple “bag of words” approach, as mentioned in Section 2.6.1, p. 17. The datasets were tested to add value to the above features with ML tool, for this investigation the tool is called Weka. Weka is both a New Zealand bird and the acronym of “Waikato Environment for Knowledge Analysis”, a potent ML tool for Data and Text Mining based on Java’s formal language. Weka is used for DM to extract from thousands of raw data a pattern for interpretation, and most of the examples on how to use it concern statistics and predictions, using numerical variables. Alternatively, as Professor Witten would say, “We have already encountered an important text mining problem: the classification of documents, where each instance represents a document and the class of the instance is the subject of the document. [Witten et al., 2011, p.387]” This experiment is text-based and each word is an attribute. Weka works with a specific file format called .arff (Attribute Relationship File Format), similar to .csv files. To train and test Weka, it is necessary to create two subsets: one for training and one for testing the accuracy of the different DC algorithms. In the analysed datasets, all words identified with AntConc first and by the content analysis

52 in Section 6.1.2 afterwards, were not helpful for the experiment because they appear in any everyday conversation. During the training and testing, thanks to its algorithms Weka makes a selection: between useless words, keywords and Hapax Legomena. Useless words can be eliminated a priori: this is not a major problem, as the utility detects them and eliminates them with the Italian stop-words file loaded into the programme. Hapax legomena occur so rarely that they are unlikely to be useful for classification because “Paradoxically, almost half of the words in a document or corpus of documents appear only once. Another problem is that the bag-of-words (or set-of-words) model does not consider word order and contextual effects. There is a strong case for detecting common phrases and treating them as individual units.” [Witten et al., 2011, p. 387] For this experiment, the task fits into a document classification problem where from given sentences, the algorithm has to determine whether they belong to the mafia language. Therefore, there will be two known and assigned classes for each training document: the mafia dataset has been assigned with a 1, and the no-mafia dataset with a 0. These were matched with a mixed Mafia/no-Mafia dataset assigned with a question mark, so that the utility had to associate each sentence to one or the other variable correctly. The following step-by-step paragraphs describe the document classification experiment drawn along Witten’s tutorial lines [Witten, 2005]. The raw data consisted of the Traindataset.arff

Figure 6.13: Weka procedure and settings for experiment flowchart

to allow Weka to analyse, detect and create a dictionary of terms from all sentences classified with a numerical attribute 1 and 0, Mafia and no-Mafia respectively. The numeric representation for each term was obtained using Weka’s StringToWord-Vector attribute filter. The steps of the Weka experiment are outlined in the flowchart as per Figure 6.13.

53 6.2.1 Datasets adaptation

Before transforming the dataset into .arff files for Weka, a necessary selection had to be done. Firstly, because it was established in the processing step in Chapter 4 the non-relevance of interrogations during the trials. For this reason, it was necessary to include only those parts of texts where the electronic eavesdrops were transcribed. Secondly, because of the bad quality of the Mafia language variable original texts, where a lot of conversations were dismissed. The finally selected total amount of data, as per the Table 6.2, had to be divided into 4 different sub-sets: split into a proportion of four to one, as recommended by Waikato University Professor Ian Witten in the MOOC tutorial of Weka, lesson 2.4 for Document Classification 9. As already mentioned previously, the small difference between sets in counting words depended on the sentence length: the cut was done with a full stop and a coherent sentence included. Sentences with stand-alone interjections at the end of the dataset were removed because not considered relevant for this research on Mafia language variable.

Table 6.2: Final dataset distribution

Dataset Words Training Test variable Mafia 18.840 14.851 3.989 1 No-Mafia 18.667 14.791 3.876 0

Once obtained the four final dataset, two for the training and two for the testing, they were treated to be transformed into ARFF format (.arff), as per the subsection below.

Arff files creation

To create an .arff file is an apparently easy task: it consists of a heading that has to be structured in a specific way and a classification of each sentence into 0 or 1, for the training dataset. On the other way round, the test dataset is all labelled with a “?”, as Weka is supposed to learn from the train on how to interpret the “?” and classify it into 0 or 1. To process the .txt into .arff, Notepad++ with the specific converter “arff.notepadplus” was used 10. What started as a supposedly easy task was a new tiresome and in some occasions nerve- racking trial-and-error task that took one whole day of hard work, that deserves a line in the study. Beware of commas and apostrophes in .arff files, they are detected as mistakes by the

9https://www.youtube.com/watch?v=Tggs3Bd3ojQ 10 https://waikato.github.io/weka-wiki/formatsandprocessing/arffsyntax/

54 software, at least for a neophyte because in all the samples provided by the University to try Weka appear both glyphs.

Figure 6.14: Problems to import .arff files into Weka

Once all commas and apostrophes were removed, each sentence line was delimited according to the instructions in the DM manual by Witten, Fran y Hall (2011), as shown in the Figure 6.15.

Figure 6.15: Example of .arff files structure

The @relation string contained the name of the file. The Text string attribute determined that the data in the file are strings of text. A nominal attribute containing the classification of the document was also required in @attribute class-att: being 1 for yes (Mafia) or 0 no (no-Mafia).

55 Once created the two .arff files based on the above Table 6.2, the preliminaries were fulfilled to perform Document Classification, as per Table 6.3.

Table 6.3: Final composition of .arff files dataset for Weka

Variable Traindataset.arff Testdataset.arff Mafia 14.851 3.989 No-Mafia 14.791 3.876 Total words 29.642 7.865

6.2.2 Weka settings and experiment

As mentioned in the preamble of this Chapter, StringToWordVector is the filter that “makes the attribute value in the transformed dataset 1 or 0 for all single-word terms, depending on whether the word appears in the document or not” [Witten et al., 2011, pp. 580-585]. The filter itself has different options, as explained in the mentioned MOOC tutorial:

• outputWordCounts: the output consists of word counts.

• IDFTransform and TFTransform: if both set to True, term frequencies are transformed into TF × IDF values (numerical statistic to reflect how important a word is to a document in a dataset).

• stemmer: different offer for word-stemming algorithms.

• useStopList: offers a stop-words personalisation. If no file is chosen, Weka considers them attributes too.

• tokeniser: allows different tokenisers, also for n-grams instead of single words.

For a visual step-by-step description of the procedure, each process has been ordered as follows:

– First: once the program has started, the Traindataset.arff was uploaded in the preprocess tab, as per Figure 6.16.

– Second: selection of the StringToWordVector option in the filter box. This part did not provide any further analysis or pre-process. This choice depended on the fact that if the Train file contained several vectors and attributes, different in name and number from the

56 Figure 6.16: Weka interface: Traindataset.arff uploaded

Test file (as it happens with words), the program cannot match them and gives an error. This happened because the vectors associated with each attribute were too many to handle, even for the software. This side effect is called a “mismatch” of attributes that should be the same to be compatible.

– Third: shifting to label Classify, the second tab in Figure 6.16, to select the option FilteredClassifier, which allows the analysis of word attributes. Inside Filtered Classifier, the J48, a specific Weka tree classifier, was chosen in the first place. The second classifier to be used together with J48 is StringToWordVector, which was also chosen. Inside StringToWordVector different options can be selected. After several attempts, the final setting consisted of TDF/IDF as True, with the intention that the same keywords detected in the previous analysis in Chapter 5 might be identified as relevant by the algorithm too. The TDF/IDF option works with DoNotOperateOnPerClassBasis as True and also with outputCountWords option as True. The Stemmer chosen was SnowBallStemmer, which did not work well with NLTK in Chapter 5 but worked better together with the researcher’s ItalianStopWordList. WordToTokenizer, one of the three options included in Weka, was

57 also selected.

– Fourth: upload the Testdataset.arff and run Weka for the first time. The first result was “?” as this was the first approach to create unclassified associations from the evaluating classifier. According to Witten, this is an expected output in Weka as first approach with the Test file.

– Fifth: apply the ten folds cross-validation. This step divides the data into ten parts. Extracts one portion from the rest and applies the filter on the remaining 9. This operation is repeated ten times (folds). The ten-fold cross-validation is the standard method for error rate prediction and ensures that the results “are the same as would be obtained on independent test sets” [Witten et al., 2011, p. 89], avoiding the repetition of the test with the same data from different researchers.

Figure 6.17: Weka interface: test classifiers settings

Once finished the training with Weka, the results obtained on two variable Mafia and no-Mafia dataset are being discussed in the next Chapter 7 “Results and Discussion”.

58 Chapter 7

Results and Discussion

The Mafia and no-Mafia document (and language) classification experiment based on Witten’s MOOC using Weka described in the previous Chapter, offers the output described in Table 7.1 and Figure 7.1.

Table 7.1: Weka experiment success rate for detecting Mafia Language

Instances 3.495 Correctly classified 70.21 % Incorrectly Classified 29.78 %

The results evidenced by Weka offer a 70.21% success rate in detecting variable 1, cor- responding to the Mafia language. This result validates the alternative hypothesis that Mafia language is different from no-Mafia language. At this stage, it could be objected that Weka offers instance-based outputs, when the files’ data, as per Tables 6.2 and 6.3 are word-based. Since the beginning, the importance of keywords was considered as the relevant feature for both datasets. This meant that the approach should have been done with a pure “bag of word” approach, having the present approach no previous references, this option has been considered the most appropriate for a first attempt, From Figure 7.1 the other relevant information, apart from the results of the Test, is the confusion matrix: an inner evaluation of Weka’s performance with the J48 Classifier. “In multiclass prediction, the result on a test set is often displayed as a two-dimensional confusion matrix with a row and column for each class. Each matrix element shows the number of test examples for which the actual class is the row and the predicted class is the column. Good results correspond to large numbers down the main diagonal and small, ideally zero, off-diagonal

59 elements [Witten et al., 2011, p. 164].” How to read the confusion matrix? The columns verify how the model (Classifiers J48+StringToWordVector and the settings as per Figure 6.17) predicted the instances.

Table 7.2: Weka experiment confusion matrix results

a=0 b=1 722 627 414 1.732

In the first column, corresponding to no-Mafia variable 0, the sum of 722+414=1.136 instances corresponds to what the model thinks are no-Mafia instances. In the second column, corresponding to Mafia variable 1, the sum of 627+1.732=2.359 instances corresponds to what the model considers are Mafia variable 1. If columns are predictions, rows represent reality. Thus, the first raw contains all the “a” samples, which are no-Mafia variable 0: 722+627= 1.349 instances. The second raw contains all the “b” samples, or Mafia variable 1: 414+1.732= 2.146. Once read the confusion matrix, it has to be interpreted1:

• Top left, 722 instances that the model thinks are “a” that are correctly classified as “a”. So 722 are correctly classified as no-Mafia.

• Bottom left, 414 instances the model thinks are “a” but really are “b”, so 414 are classified as no-Mafia: they are confused instances.

• Top right, 627 instances that the model thinks are “b” but really are “a”. So it is another mis-classification of 627 no-Mafia sentences, classified as Mafia in variable 1. These instances are confused as well.

• Bottom right, 1.732 instances that the model thinks are “b”, which are “b,” so are Mafia instances.

• Top-left 722 and bottom-right 1.732 of the matrix show instances that Weka’s model J48 classifies correctly.

1To interpret the confusion matrix correctly a relevant contribution in Stackoverflow has been essential: https://stackoverflow.com/questions/15214179/how-to-read-the-classifier-confusion-matrix-in-weka.

60 • Bottom-left 414 and top-right 627 of the matrix show where the model is confused.

The confusion matrix results (414/627 confused instances) offer an element of analysis that deserves some insight by the researcher. The relatively high number of misclassified instances is an aspect that was taken into account in the experiment design. It should not be forgotten that the transcripts used are a reflection of real conversations. In these conversations, there are repeated phrases such as pleasantries or welcome phrases, collocations, interjections and onomatopoeias such as eh, ehhh, ohhh, uhum, which have not been eliminated deliberately. The reason is that if NLP is to be applied to “real” NL, not manipulated to ease the task or to obtain better outputs. All these pragmatic elements exist in all conversations and should not be eliminated from an experiment that aims to set a valid and authentic precedent for the analysis of criminal language. For example, a 1 is assigned to all “Ciao, come va?” in the Mafia dataset, and a 0 is assigned to all “Ciao, come va?” in the no-Mafia dataset. It is to expect that an algorithm does not know how to classify “Ciao, come va?” because it appears in both datasets. In the entire dataset, there are a few “Ciao, come va?” phrases. Another point for reflection is that many of the transcribed conversations do not contain complete sentences due to interference in the wiretaps, noise or overlapping conversations because the interlocutors are various and interrupt each other. This implies that the dataset sentence structure is often incomplete, if compared ideally to a written text whose structure is supposed to be more rigid. To a certain extent, it may be that this incompleteness confuses the classification of one variable or the other. To further evaluate Weka’s performance, a simple baseline has been calculated: “The purpose of baseline assessment is to establish a point from which future measurements and predictions can be calculated.2” In other words, a baseline is the result of a very basic model/solution, as the name implies. It is calculated as a starting point before applying more complex models to a database to get more robust results, as in the Weka experiment described in Chapter 6. If the experiment scores better than the baseline, the experiment has been successful. In this case, it assumes that all the instances of the Test dataset are 0, corresponding to the no-Mafia variable. By checking if the random variable is correct or wrong, a percentage is calculated. The baseline result is reported in Table 7.2. The baseline indicates that, if calculated randomly, there is almost no difference between the

2Retrieved from https://www.oxfordreference.com/view/10.1093/oi/authority.20110803095449856

61 Table 7.3: Success rate for the baseline results (majority class approach)

Variable 0 Variable 1 54.5% 45.5% two variables 0 for no-Mafia and 1 for Mafia. Therefore, the result reported by Weka is of great relevance, as it indicates that there is a 70% chance of detecting what Mafia language is, from what is not.

Figure 7.1: Weka final results

Finally, the 20% on top of the baseline result is good enough to offer a new line of research for studying criminal language through NLP. New utilities can be developed to improve ML and document classification application in the field of investigation and forensic linguistics to obtain judicially relevant material, its analysis, classification, and interpretation.

62 Chapter 8

Conclusions

The lengthy process of proving the alternative hypothesis, which began with criminal and mafia conversations collected from court proceedings records, went through several steps in an NLP- based prototype experiment. The purpose set out in principle has been fulfilled: the alternative hypothesis has been validated, as per Chapter 7, with a 70% of correctly classified instances. This result has to be interpreted in perspective, as the material used was collected via the Internet and thanks to the help of lawyer Falleti. Better and more comprehensive results could have been obtained with qualitatively richer material, as available to law enforcement agencies. However, Italian law prevents the use of such sensitive information, even for research purposes. It would be wonderful to enhance the experiment with qualitatively and quantitatively stronger material, compiling a corpus of criminal language from all over Italy. This result opens up new perspectives in language analysis with ML tools for forensic linguis- tics, legal and criminal proceedings. With proper training and supervision by multidisciplinary teams with IT specialists, legal experts, law enforcement and linguists, ML tools can directly analyse electronic eavesdropping files, and give relevance only to those conversations that may raise a high percentage of suspicious elements, without the need for transcription and “translation” into Italian. In the selection and reading of thousands of words, ML tools would collect, analyse and classify relevant information faster and better. The automatic analysis would provide more reliable, more authentic results, with better cross-validation to extract meaningful data and more likely to be admitted as evidence in court. Audio could be later automatically transcribed and even “translated” into Italian with downstream MT modules. Several limitations can be objected to this study, mainly due to its pioneering approach, with

63 hardly any academic support. The first objection might be the choice of word as a parameter, while Weka works with instances. It can be said that from the researcher perspective, based on the background bibliography on Mafia and no-Mafia literature, the semantic value of keywords was paramount. As no reference on Weka document classification based on words only was found, but with variable sentence classification and word attributes, this latter has been embraced for the experiment. A new line of research could be testing Weka on valued words, considering that the same word, as seen with “discorso" might be misleading because used in both. In this study, an unannotated corpus was used, but this does not mean that Weka’s performance, or any other NLP tool, cannot be improved with an annotated corpus. Using all the knowledge obtained from the content processing and analysis as per Chapter 5, with Lemmatisation, PoS Tag and Parsing, future studies could implement this information as extra attributes in Weka and test if the results offered by the algorithms improve with more information on the linguistic features.

Apart from the obvious limitations and suggestions for improvement, this investigation has clearly established that statements in a courtroom environment do not confer much relevance for linguistic analysis due to their lack of spontaneity and constraint. In this sense, the use of electronic surveillance transcripts is much more effective. Also, it has been established that mandamento and capomandamento were an exclusive feature of the Mafia dataset. On the contrary, the keyword compare was exclusive to the No-Mafia dataset. The inclusion of certain words from Mafia language, as picciotti and mandamento inside the no-Mafia dataset evidence the connections between different criminal associations, that can be demonstrated with the linguistic analysis with NLP and ML tools.

It might be said that 20% margin is not an inviting result. For example, one might have expected an overwhelming ratio of 90/10. However, as mentioned above, it was clear from the beginning that the linguistic substrate of both languages is Italian, with dialectal variants of Sicilian in the case of the Mafia dataset. The two languages have many attributes in common, but not so many as to be indistinguishable from each other. In fact, there are 20% of attributes to differentiate the Mafia variable from the no-Mafia variable, not only a few words. An example of this singularity can be the common Italian word, discorso, has a unique semantic value inside the Mafia context. This singularity could also be a justification in case that 20% could be represented by the Sicilian dialect feature. But again, discorso e mandamento are not exclusive of Sicilian dialect, but are Italian words. So that 20% difference between the two dataset cannot

64 be represented by the Sicilian dialect vocabulary. To deepen into conclusions, this study claims to acknowledge the professional role of the linguist, beyond the academic sector, and specifically in Forensic Linguistics and Law. In fact, the chapters of analysis have been relevant to identify some valuable elements of the Mafia dataset and for the Weka experiment and the only viable approach is from the linguistic perspective. In this respect, it would be interesting to include formal language courses such as Python or R in the humanities curricula. Due to the increasing amount of data, the future is destined to automation of processes analysis. Linguistics cannot afford to lag in the compartmentalization of knowledge, and linguists must rid themselves of the idea that pure sciences are at odds with Linguistics and Philology. Classification and text analysis with NLP tools are here to stay. Future generations will be better prepared if they are equipped with valuable tools for this purpose and gain flexibility in text analysis impossible to achieve manually. Another element in favour of this study, and those that will eventually follow it, is that mafias are globalizing, and their structures are becoming more fluid and standardized. The effectiveness and immediacy of analysis with NLP utilities could speed up collecting relevant information, valid for investigations. Due to Cosa Nostra fluidity in adapting to time constraints, electronic surveillance transcriptions of this kind, analysed with NLP tools, would greatly facilitate the first investigative steps of collecting and classifying relevant data as mentioned in Chapter 6. Maybe, similar studies with other mafias as the mentioned Italian ’Ndrangheta and Camorra or the Japanese, the Balkans, the Russian could offer further insights on their structures, connections and activities to eradicate them. In this regard, the NLP application to studies about languages other than English must be applauded. Multilingualism has to be applied in computer science as well. Prosecutor Falcone was the first man of Law to give relevance to language for a better understanding and fighting Cosa Nostra. His legacy was to try and defeat the Mafia with its own words, he was partially successful. Because of this, he was killed on 23rd May 1992, together with his wife and escort. He might have died, but not his legacy.

65 Bibliography

[Anthony, 2019] Anthony, L. (2019). AntConc.

[Bose et al., 2020] Bose, R., Vashishtha, S., and Allen, J. (2020). Improving Semantic Parsing Using Statistical Word Sense Disambiguation (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 34(10):13757–13758.

[Cambria, 2014] Cambria, Erik; White, B. (2014). 1. Jumping NLP Curves: A Review of Natural Language Processing Research. IEEE Computational Intelligence Magazine, 14(May):48–57.

[Camilleri, 2009] Camilleri, A. (2009). Voi non sapete. Gli amici, i nemici, la mafia, il mondo nei pizzini di Bernardo Provenzano. Arnoldo Mondadori Editore Srl, Milano, 2nd edition.

[Cavallaro et al., 2020] Cavallaro, L., Ficara, A., de Meo, P., Fiumara, G., Catanese, S., Bag- dasar, O., and Liotta, A. (2020). Disrupting resilient criminal networks through data analysis: The case of sicilian mafia. arXiv.

[Ceruso, 2010] Ceruso, V. (2010). Dizionario Mafioso-Italiano, Italiano-Mafioso. Newton Compton Editori Srl, Roma, 1st edition.

[Chaski, 1997] Chaski, C. (1997). Who wrote it. Technical report, National Institute of Justice, Washington, DC.

[Chung et al., 2019] Chung, Y.-L., Kuzmenko, E., Sinem Tekiro˘ Glu, S., and Guerini, M. (2019). CONAN-COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2819–2829.

[Ciconte, 2017] Ciconte, E. (2017). Dall’omertà ai social, come cambia la comunicazione della mafia. Edizioni Santa Caterina, Pavia, 1st edition. [Clark and Shalom, 2010] Clark, Alexander; Fox, C. and Shalom, L. (2010). The Handbook of Computational Linguistics and Natural Language Processing. Blackwell Publishing Ltd, West Sussex, UK, 1 edition.

[Cremades et al., 2017] Cremades, S. Z., Gomez Soriano, J. M., and Navarro-Colorado, B. (2017). Diseño, compilacion y anotacion de un corpus para la deteccion de mensajes suicidas en redes sociales. Procesamiento de Lenguaje Natural, 59:65–72.

[Espunya i Prat, 1994] Espunya i Prat, A. (1994). Computational linguistics: a brief introduction. Links & letters, 1(1):9–23.

[Falcone and Padovani, 1991] Falcone, G. and Padovani, M. (1991). Cose di Cosa Nostra. BUR Rizzoli, Milano, 6th edition.

[Francolini, 2020] Francolini, G. (2020). La prova nel procedimento di prevenzione: identità, alterità o somiglianza con il processo penale? Sistema Penale, 1(10):5–50.

[Grice, 1975] Grice, H. (1975). Logic and conversation. In Syntax and Semantics: Speech Acts, volume 3, pages 41–54. Elsevier, New York, NY: Academic Press.

[Kastrati and Imran, 2019] Kastrati, Z. and Imran, A. S. (2019). Performance analysis of ma- chine learning classifiers on improved concept vector space models. Future Generation Computer Systems, 96:552–562.

[Lee, 2020] Lee, R. S. T. (2020). Artificial Intelligence in Daily Life. Springer Singapore.

[Mitkov, 2003] Mitkov, R. (2003). The Oxford Handbook of Computational Linguistics. Oxford University Press, Oxford, UK.

[Navarro-Colorado, 2021] Navarro-Colorado, B. (2021). Compilación de corpus para lenguas de especialidad.

[Palazzolo and Prestipino, 2017] Palazzolo, S. and Prestipino, M. (2017). Il codice Provenzano. Giuseppe Laterza & Figli, Bari-Roma, 6th edition.

[Pontrandolfo, 2012] Pontrandolfo, G. (2012). Legal Corpora: An Overview. 14/2012:121–136. [Rani et al., 2020] Rani, P., Suryawanshi, S., Goswami, K., Chakravarthi, B. R., Fransen, T., and Mccrae, J. P. (2020). A Comparative Study of Different State-of-the-Art Hate Speech Detection Methods for Hindi-English Code-Mixed Data. Technical report, European Language Resources Association (ELRA), .

[Razavi et al., 2010] Razavi, A. H., Inkpen, D., Uritsky, S., and Matwin, S. (2010). Offensive language detection using multi-level classification. Lecture Notes in Computer Science (in- cluding subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6085 LNAI(May):16–27.

[RStudio Team, 2020] RStudio Team (2020). RStudio: Integrated Development for R. RStudio.

[Searle, 1980] Searle, J. R. (1980). Minds, brains, and programs. In Behavioral and Brain Sciences, volume 3, page 417–457. BBS.

[Tabbert, 2015] Tabbert, U. (2015). Crime and corpus: The linguistic representation of crime in the press., volume 20. John Benjamins Publishing Company, Amsterdam/Philadelphia.

[Talib et al., 2016] Talib, R., Kashif, M., Ayesha, S., and Fatima, F. (2016). Text Mining: Techniques, Applications and Issues. International Journal of Advanced Computer Science and Applications, 7(11):414–418.

[Torrealta, 2010] Torrealta, M. (2010). La trattativa. RCS Libri SpA, Milano, 1st edition.

[Turing, 1950] Turing, A. M. (1950). Computing Machinery and Intelligence. Mind, 59(236):433–460.

[Università Cattolica del Sacro Cuore and Transcrime Department, 2019] Università Cat- tolica del Sacro Cuore, M. and Transcrime Department (2019). ROXANNE.

[Wang et al., 2009] Wang, H., Ma, C., and Zhou, L. (2009). A brief review of machine learning and its application. Proceedings - 2009 International Conference on Information Engineering and Computer Science, ICIECS 2009, pages 19–22.

[Witten, 2005] Witten, F. E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2nd edition. [Witten et al., 2011] Witten, I. H., Frank, E., and Hall, M. A. (2011). Data Mining. Technical report, Elsevier, Burlington, USA.