Pages 1–6

An statistical analysis of Twitterati’s reaction, across geographies and gender to the Leak Pruthvi N. Shetty1, Revathy Sridharan2, Snehil Vishwakarma3, Srikanth Kanuri4 and Vikas Jangam5 [email protected] [email protected] [email protected] [email protected] [email protected] Date 05-04-2016

Professor: Muhammad Abdul-Mageed, [email protected]

ABSTRACT Naturally, this led to a public outrage on at an unprecedented Motivation: The Panama Papers are a set of 11.5 million leaked global scale, reaching up to 400K tweets in the first 72 hours. We documents that detail financial and attorneyclient information for more saw this as an opportunity to obtain a high quality, data diverse than 214,000 offshore companies associated with the Panamanian data source to harvest the sentiment and identify the polarity of law firm and corporate service provider, Mossack Fonseca. the twitter users, across geographies and gender. Also, since many The leaked documents contain the identities of the companies’ of the accused individuals are active twitter users, this gave us shareholders and directors, as well as some financial transactions. the opportunity to analyze twitter’s reaction to each of the accused Among other things, they illustrate how wealthy individuals, including individual.[4] public officials, can keep personal financial information private.[1] At the time of publication, the papers identified five then-heads of state or government leaders from Argentina, Iceland, Saudi Arabia, Ukraine, and the United Arab Emirates as well as government officials, close relatives, and close associates of various heads of government of more than forty other countries. The British Virgin Islands was home to half of the companies listed and Hong Kong contained the most affiliated banks, law firms, and middlemen. The names of several national leaders appear in the documents, including presidents Khalifa bin Zayed Al Nahyan of the United Arab Emirates, Petro Poroshenko of Ukraine, King Salman of Saudi Arabia, and the Prime Minister of Iceland, Sigmundur Dav Gunnlaugsson. Former heads of state mentioned in the papers include Sudanese president Ahmed al-Mirghani; the Emir of Qatar Hamad bin Khalifa Al Thani; prime ministers Bidzina Ivanishvili of Georgia, Ayad Allawi of Iraq, and Ali Abu al-Ragheb of Jordan; former prime ministers Hamad bin Jassim bin Jaber Al Thani of Qatar, Pavlo Lazarenko of Ukraine, and Ion Sturza of Moldova.[2] The leaked files identified 61 Fig. 1. Countries affected by the Panama Paper leak [4] family members and associates of prime ministers, presidents and kings, including the deceased father of British prime minister David Cameron; the brother-in-law of China’s paramount leader Xi Jinping; the son of Malaysian prime minister Najib Razak; the children of Pakistani prime minister Nawaz Sharif; the children of Azerbaijani president Ilham Aliyev; Clive Khulubuse Zuma, the nephew of Results: The data we collected reflects a largely negative result, with South African president Jacob Zuma; Nurali Aliyev, the grandson of respect to the sentiment of the twitter users. Also, we see a sharp Kazakh president Nursultan Nazarbayev; Mounir Majidi, the personal rebuttal of the accused individual from different corners of the world. secretary of Moroccan king Mohammed VI; Kojo Annan, the son of Based on gender, we have see that male users tend to be more former United Nations Secretary-General Kofi Annan; Mark Thatcher, critical in their tweets. Likewise, we have seen a larger outrage from the son of former British prime minister Margaret Thatcher; and the countries which have been directly affected by the scandal, over other ”favourite contractor” of Mexican president Enrique Pea Nieto.[3] countries. Keywords: Twitter, Panama Papers, Sentiment, Gender, Geography, Social Media

c . 1 Project Report - PanamaPaperLeaks

1 INTRODUCTION (e.g. Tweets, fan page posts and comments, YouTube In this project, we mined twitter for tweets with the hash video comments). tags ”PanamaPapers” and ”panamapapersleak”. In this direction, we pulled the data in two different categories - With location information and without location information. We leveraged the use of packages ‘tweepy’ for Python and ‘twitteR’ & ‘socialmedialab’ for R. Upon collecting over 10,000 tweets per case, we proceed to analyze the data in terms of gender and location to study the polarity and sentiment of the twitter users. For this purpose, we employ packages such as ‘sentiment’, ‘TM’,‘qdap’ and ‘ggplot’.

2 CHALLENGES Sentiment and opinion mining can be useful in several ways. It can help marketers evaluate the success of an ad campaign or new product launch, determine which versions of a product or service are popular and identify which demographics like or dislike particular product features. For example, a review on a website might be broadly positive about a digital camera, but be specifically negative about how heavy it is. Being able to identify this kind of information in a systematic way gives the vendor a much clearer picture of public Fig. 2. Sentiment analysis across six variants. opinion than surveys or focus groups do, because the data is created by the customer.

There are several challenges in opinion mining. The first is that 3.3 TM & Sentiment a word that is considered to be positive in one situation may be considered negative in another situation. Take the word ”long” for R provides two packages for working with unstructured text TM instance. If a customer said a laptop’s battery life was long, that and Sentiment. TM can be installed in the usual way. Unfortunately, would be a positive opinion. If the customer said that the laptop’s Sentiment has been archived in 2012, and is therefore more difficult start-up time was long, however, that would be is a negative opinion. to install. However, it can still be installed from an external These differences mean that an opinion system trained to gather repository. Sentiment package contains two handy functions serving opinions on one type of product or product feature may not perform our purposes: very well on another. • classify emotion: This function helps us to analyze some text In this project, we had to carefully examine the tweets collected and classify it in different types of emotion: anger, disgust, fear, from users across countries and gender to form a credible corpus of joy, sadness, and surprise. The classification can be performed data to be used for further analysis. Also, we had to refine the train using two algorithms: one is a naive Bayes classier trained on data, adding certain key words specific to the Panama paper leak to Carlo Strapparava and Alessandro Valituttis emotions lexicon; accurately identify what the tweet implies. the other one is just a simple voter procedure. • classify polarity: In contrast to the classification of emotions, the classify polarity function allows us to classify some text as 3 PACKAGES positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebes 3.1 twitteR subjectivity lexicon; or by a simple voter algorithm. twitteR is an R package which provides access to the Twitter API. Most functionality of the API is supported, with a bias towards 3.4 Wordcloud API calls that are more useful in data analysis as opposed to As the name suggests, it’s used to build a word cloud. This R daily interaction. It R Based Twitter Client Description Provides an package takes in text data as input and builds word clouds. We will interface to the Twitter web API.It is authored by Jeff Gentry. perform a series of operations on the text data to simplify it. First, we need to create a corpus.Next, we will convert the corpus to a plain 3.2 SocialMediaLab text document. Next, we will remove all punctuation and stopwords. VOSON SocialMediaLab is an R package that provides a suite of Stopwords are commonly used words in the English language such tools for collecting and constructing networks from social media as I, me, my, etc. data. It provides easy-to-use functions for collecting data across There are a few ways to customize it. popular platforms (Instagram, Facebook, Twitter, and YouTube) and generating different types of networks for analysis. SocialMediaLab • scale: This is used to indicate the range of sizes of the words. also collects the associated text data from social media platforms

2 Analysis of Panama Paper Leaks in Twitter

• max.words and min.freq: These parameters are used to limit parameters. With the RESTful API, we cannot crawl for data older the number of words plotted. max.words will plot the specified than 15 days from current date.There are also other limitations when number of words and discard least frequent terms, whereas, using RESTful API for crawling data. min.freq will discard all terms whose frequency is below the specified value. 3.5.2 OAuth Authentication: Tweepy tries to make OAuth as painless as possible for you. To begin the process we need to register • random.order: By setting this to FALSE, we make it so that the our client application with Twitter. Create a new application and words with the highest frequency are plotted first. If we dont set once you are done you should have your consumer token and secret. this, it will plot the words in a random order, and the highest Keep these two handy, youll need them. The next step is creating an frequency words may not necessarily appear in the center. OAuthHandler instance. Into this we pass our consumer token and secret which was given to us in the previous paragraph: • rot.per: This value determines the fraction of words that are auth = tweepy.OAuthHandler(consumer token, consumer secret) plotted vertically. 3.5.3 Streaming API: The Twitter streaming API is used to • colors: The default value is black. If you want to use different download twitter messages in real time. It is useful for obtaining a colors based on frequency, you can specify a vector of colors, high volume of tweets, or for creating a live feed using a site stream or use one of the pre-defined color palettes. or user stream. See the Twitter Streaming API Documentation. The streaming api is quite different from the REST api because the REST api is used to pull data from twitter but the streaming api pushes messages to a persistent session. This allows the streaming api to download more data in real time than could be done using the REST API.

In Tweepy, an instance of tweepy.Stream establishes a streaming session and routes messages to StreamListener instance. The on data method of a stream listener receives all messages and calls functions according to the message type. The default StreamListener can classify most common twitter messages and routes them to appropriately named methods, but these methods are only stubs.

4 AIMS AND GOALS We aim to use the tweets collected to extract meaningful patterns about twitter user’s reaction to the Panama Papers leak. Here, we consider two main criteria - Geography and Gender. We intend to show the distribution of sentiment of twitter users across countries and the difference between the reaction of genders to the issues Fig. 3. Wordcloud of ’s reaction to the Panama Paper Leak. affecting their respective countries. We would also want to contrast on the twitter reaction between the countries which have been affected heavily versus the ones which have been relatively less 3.5 TweePy affected. For example compare the twitter reactions between Russia vs USA. Tweepy is a python wrapper package to help with crawling tweets from Twitter. Before using Tweepy, we need to setup the Twitter account and register an app. We get Access Token, Token Secret, Consumer Key and secret once we register our project. These 5 LITERATURE REVIEW credentials should be used for authenticating the program to crawl As we decided to view the difference in opinions between the male data from Twitter. The API class provides access to the entire twitter and female tweeters on Twitter. For this, we needed to identify RESTful API methods. Each method can accept various parameters the gender of the Twitter users. Twitter, however does not provide and return responses. For more information about these methods gender information in its public feeds, in fact, gender is not even please refer to API Reference. When we invoke an API method required for creating a Twitter account. To overcome this problem, most of the time returned back to us will be a Tweepy model class we developed a methodology to first abstract the first names of users instance. This will contain the data returned from Twitter which we and then using twitter libraries we were able to identify over 30 can then use inside our application. There are 2 main APIs to crawl percent of tweets gender. data from Twitter: Prominent works have been done on detecting gender on twitter. 3.5.1 RESTful API: The RESTful API is used to crawl the Burger et al, have built a large multilingual dataset labeled data from Twitter based on the query text and also a few other with gender based on various attributes pulled from twitter user

3 Project Report - PanamaPaperLeaks

accounts. Apart from exploiting the meta-data information, they remove retweets, stop words, numbers, whitespaces and non-ASCII also performed a large scale assessment using Amazon mechanical characters from the text corpus. Turk[5]. Also, Ciot et al performed an interesting work on inferring gender from tweets which are in languages other than English [6]. Although, we restricted our work to English language as we could not retrieve tangible amount of tweets from other languages and given the duration of our project.

For our work, it is important to be able to take a directed approach in order to perform ”sentiment analysis” on the tweeter’s reaction to the Panama issue. Owed to Twitter’s restriction on tweet size, this compounds the problem with identifying sentiment detection on twitter is the fact that tweeters usually use internet slang in their tweets. However, prominent works made by Agarwal et al [7] prove that it is still possible to accurately detect emotion on twitter tweets.

Paul Ekman, in his prominent work on identifying emotions talks a great deal about understanding human emotions, by classifying them into 6 categories [8] [9]. Based on his work we have initially analyzed twitter tweets and identified various detectable emotions. Based on this information, we established the six baseline emotions we have detected in our work.

Fig. 4. Histogram on the polarities of Tweets. One of our main motivators for working on this topic was that there was no work done in this area since the issue is still very fresh and an increasing usage of Twitterti on voicing their opinion about this issue. This was an interesting situation to assess the gender and sentiment analysis of the tweets. Though the output is expected to 6.2 Classification on Gender be largely negative, we wanted to analyze the finer nuances in the We employ the ‘qdap’ package and leverage the use of the negative emotions. ‘name2sex’ function to identify the gender of the twitter user based on their first name. ‘qdap’ (Quantitative Discourse Analysis Package) is an R package designed to assist in quantitative discourse 6 IMPLEMENTATION analysis. The package stands as a bridge between qualitative 6.1 Classification on Sentiment & Analysis transcripts of dialogue and statistical analysis & visualization. As we know, not all twitter users provide their real names, and hence After cleaning tweets to remove retweets, the data is classified we had work on filtering the data to condense it to a reliable list using NCR lexicons. We perform the sentiment analysis, classifying of tweets with genuine names. The function classifies the names as tweets using a Bayesian analysis. Sentiments are mapped across six Female or Male. Using this data, we are able to study the tweeting variants: habits and trends based on gender.

• Anger • Happy 6.3 Classification on Country • Disgust We use a separate set of crawled data for classification based on • Shock the country. For this classification, we pull only those tweets which • Fear have country specific information in the tweet. This is present in the field ”tweet.place”. The ”place” field contains various sub-fields • Sad like coordinates, country, place and place-type which give more information about the location. We used these tweets to identify We work on classifying the tweets based on 3 polarities. the country specific sentiment and emotion analysis. We used a combination of data crawled through R and Python for this, removed • Positive retweets, cleaned the tweets by eliminating stop words and then • Negative performed a analysis using the above mentioned implementation • Neutral for sentiment and emotion. In addition to this, we implemented a visualization using ‘OpenStreetMap’ API v0.6 to provide a color- Finally, the tweet, emotion, and polarity are combined in a single coded visualization of the world map on the basis of the sentiment dataframe. Here, we use the INRIA train data to learn Positive, analysis of a specific country. We also performed a separate analysis Negative and Neutral words and test our program with the text for sentiment classification on the basis of gender and visualized it corpus generated from the tweets. Before analyzing the corpus, we on a map.

4 Analysis of Panama Paper Leaks in Twitter

more than women on a particular trend over a given period of time by, leading by almost twice the number.

7.1 Sentiment Visualization of Male population across the globe

Fig. 5. World Map Visualization on the polarities of Tweets.

Fig. 7. Sentiment Visualization of Male population across the globe.

7.2 Sentiment Visualization of Female population across the globe

Fig. 6. Trend Analysis of Tweet Sentiments. Fig. 8. Sentiment Visualization of Female population across the globe.

7 RESULTS In this project, we see that overall, the reaction of twitter users 8 CONCLUSION reflects a negative and critical trend on a global scale. In terms From this study, we have found that there has been a never-seen- of specific geography, countries which are directly affected by the before reaction from twitter users in terms of the sheer frequency scandal were found to have a very critical reaction, whereas those of tweets at a global scale. Likewise, data shows us that the not affected directly were not as scathing. With respect to gender, affected countries have been condemnatory in rebutting the accused. we found that males were pre-dominantly more vocal in voicing However, we see a slight discrepancy in the overall results and the their opinion about the scandal, accounting for about 65% of the gender specific results. This is because less than 30% of the twitter tweets whereas women were relatively less accusatory, making up users’ gender is available and the current gender predictors are not for the 35% of the tweets. Thus, we found that men tend to tweet as accurate as desirable. This leads to a possible disparity in the

5 Project Report - PanamaPaperLeaks

two visualizations. Also, twitter data is available going back only 10 ACKNOWLEDGEMENT to the last fifteen days. Hence, data collected during the peak of the We would like to thank Professor Muhammad Abdul-Mageed for scandal gives the most accurate reflection of the analysis, whereas encouraging us to take up this project, continuously pushing us to the tweets scraped a month later might provide a different view. We achieve more than what we set out for. Thank you for your constant can also conclude that more and more countries are witnessing an feedback while we were progressing through this project. You have increased online presence from its citizens, and globally we moving inspired us to do more than just understanding the course material. towards a more connected world. At the same time, we witness the power of Social Media, which can unite citizens and even force regimes out of power, as we have seen in Iceland. REFERENCES [1]Juliette Garside, Holly Watt, and David Pegg. The panama papers: how the world’s rich and famous hide their money offshore, Apr 2016. [2]Panama papers - all articles by sddeutsche zeitung. 9 FUTURE WORK [3]Richard Bilton. Panama papers: Mossack fonseca leak reveals elite’s tax havens. • Moving forward, we hope to expand the purview of the project, [4]Wikipedia: Panama papers, May 2016. [5]Morgan Sonderegger Morgane Ciot and Derek Ruths. Gender inference of twitter not just to Twitter, but also to other social media sites such as users in non-english contexts. Facebook, Pinterest, Instagram and Quora. In this approach, [6]George Kim John D. Burger, John Henderson and Guido Zarrella. Discriminating we can leverage the features provided by the ‘SocialMediaLab’ gender on twitter. library which provides a aggregated platform to connect with [7]Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, and Rebecca Passonneau. multiple social networking sites simultaneously. Sentiment analysis of twitter data. [8]P. Ekman. Cognition & emotion: An argument for basic emotions. Lawrence • At the same time, we would like to improve the gender Erlbaum Associates Limited, 1992. prediction by using sources of data that have a clearly defined [9]P. Ekman. Handbook of cognition and emotion: Basic Emotions. Wiley & Sons, gender field, such as Facebook. We would like to better the 1999. gender inference from Twitter Usernames so that we could make a better analysis of the data we collected.

6