JUNE 23, 2020

REPUTATION CHECK WITH NLP AND TOPIC MODELLING CAPSTONE PROJECT FOR MS IN FINANCE PROGRAM PUBLIC PROJECT SUMMARY CEU eTD Collection

STUDENT: BERMET ALIBEKOVA FACULTY SUPERVISOR: PROF.PETER SZILAGYI Table of contents:

1. Introduction 2. Background and objectives 3. Methodology 4. Conclusion

CEU eTD Collection

i

1.INTRODUCTION Due diligence is necessary for a wide array of companies to mitigate risk. The financial sector uses them to screen clients to avoid financial crime, recruitment agencies doing executive searches, investments to mitigate financial loss, high risk industries to screen counterparties in order to mitigate risk and comply with regulations.

Usually due diligence is very time consuming and requires a lot of manual work, in most cases using simple Google search. Millions of articles and news are published on the internet every day and this makes it difficult to find necessary information among tons of unstructured data. Moreover, due diligence is a very costly process for many companies. Thus, the main aim of this project is to develop an automated adverse media check tool, which would help banks to identify companies with bad reputation. The project will help financial firms to find information in the media that is not officially in corporate court yet, but can help to predict the future of company’s performance. We used Topic Modelling, which is a technique that automatically analyzes texts and identify topics used in a collection of texts. In order to enhance the Topic Modelling results for the articles about certain person or a company it was necessary to use lemmatization, which is widely used tool that returns the base or dictionary form of a word.

2.BACKGROUND AND OBJECTIVES Adverse media is any kind of news published in media, traditional and online, with negative information on individuals or companies. For example, if a company would like to take a loan from a bank, the bank can find a lot of information about the company’s credit history and registry, which can indicate that this is a stable company with high revenues. However, this information can be limited as it doesn’t show if company is engaged in criminal activities, selling fake medicine for cancer or a company manager is under investigation. As banks and financial service institutions are risk averse and want to mitigate all the risks, they would like to see a thorough investigation even there is no court hearings yet. CEU eTD Collection The project is done for a fintech start-up, which joined CEU InnovationsLab in 2020. The company is offering an automated tool which can produce higher quality reports in a short time, using least amount of manual labour, well-established due diligence processes and the latest data science tools. Automated internet checks tool is one of its projects, which will give opportunity

1

to its clients to extract adverse media information from Google. This includes thousands of articles and blog posts accessed, categorized and filtered into relevant categories within a short time.

3.METHODOLOGY We used Google API tool which automatically searches for keywords such as money laundering or investigation and then returns the whole article or a blog post about subject company or person. The first part of the project was to prepare lemmatization script for this process and based on it to work on Topic Modelling. Topic Modelling has many advantages such as discovering hidden topic patterns across a given group of articles, using the topics extracted to classify or group the articles and also summarizing large texts by using the main topics found by Topic Model algorithm.

Lemmatization is one of the Text Normalization techniques used in NLP, it returns dictionary forms of the words using their morphological features and vocabulary. Lemmatization script for the returned articles needed to meet following requirements: 1) written in Spacy library (Python) 2) responsive for different languages (English, Hungarian, Polish, Russian, French, German) 3) returns a list of lemma words. Since this tool will be applied for articles in multiple languages, language detection function was created using “Language Detection” feature of Spacy library. Based on the language of the article relevant language model for lemmatization will be applied. The texts of the articles were cleaned by removing unnecessary characters, including numbers, capitalization, punctuation and stop words.

In order to identify phrases, we used Python’s library, which finds phrases in texts. are two words frequently occurring together in the text and are three words frequently occurring together. For example, it returns phrases like “European_commission”, “Civil_society”, “Money_Laundering”. To receive reliable results, we also worked on eliminating the words that were used rarely and words that were used too frequently. Thus, we created a list of unique words with their frequencies and removed the words used in more than 70 percent of the articles. The least common words were removed if their length was less or equal to 2. CEU eTD Collection

As company and person names are important in this tool, we used entity recognition function of Python as well. Python’s spaCy library was the most effective among the libraries that

2

we tested. In the result we received a list of the entities that were added to the words in the data frame for the analysis.

Topic Modelling, which creates different clusters for the articles according to the words and their frequency, was second part of the project. Latent Dirichlet Allocation (LDA) model was used for the topic modelling as it is one of the powerful tools for unsupervised machine learning and also was advised by the client company. First, we created the dictionary and corpus, which are main inputs to the LDA topic model. The model was built with different numbers of topics where each of them is a combination of keywords and each of the keywords contributes a certain weightage to topic. A model with the highest coherence score was chosen as an optimal model. In the result we received topics in the articles and the top words occuring in that topics.

4.CONCLUSION Due diligence services are costly and usually require time to perform a check for a person or a company. The adverse media check tool was created using latest Machine Learning and NLP solutions and has a great potential on the market as it is very useful for financial firms and can help them mitigate their risks with lower costs and reduce time spent on due diligence checks. In fact, this tool would be useful not only for financial institutions, but also recruitment agencies and other high-risk industries. The projects limitation was that for LDA model it was difficult to identify optimal number of topics as it is time consuming to check all the variations. Also, we encountered difficulties with availability of Python tools for different languages as the tool uses multiple languages.

CEU eTD Collection

3

CEU eTD Collection

4

CEU eTD Collection

5

CEU eTD Collection

6

CEU eTD Collection

7

CEU eTD Collection

8

CEU eTD Collection

9

CEU eTD Collection

10

CEU eTD Collection

11

CEU eTD Collection

12