<<

Web Scraping versus Twitter API: A Comparison for a Credibility Analysis

Irvin Dongo Yudith Cadinale Ana Aguilera Universidad Católica San Pablo Universidad Católica San Pablo Facultad de Ingeniería, Escuela de Arequipa, Perú Arequipa, Perú Ingeniería Informática, Universidad Univ. Bordeaux, ESTIA INSTITUTE Universidad Simón Bolívar de Valparaíso OF TECHNOLOGY Caracas, Venezuela Chile Bidart, France [email protected] [email protected] [email protected] Fabiola Martínez Yuni Quintero Sergio Barrios Universidad Simón Bolívar Universidad Simón Bolívar Universidad Simón Bolívar Caracas, Venezuela Caracas, Venezuela Caracas, Venezuela [email protected] [email protected] [email protected]

ABSTRACT ACM Reference Format: Twitter is one of the most popular information source available Irvin Dongo, Yudith Cadinale, Ana Aguilera, Fabiola Martínez, Yuni Quin- on the Web. Thus, there exist many studies focused on analyzing tero, and Sergio Barrios. 2020. versus Twitter API: A Com- parison for a Credibility Analysis. In The 22nd International Conference on the credibility of the shared information. Most proposals use ei- Information Integration and Web-based Applications & Services (iiWAS ’20), ther Twitter API or web scraping to extract the to perform November 30-December 2, 2020, Chiang Mai, Thailand. ACM, New York, NY, such analysis. Both extraction techniques have advantages and USA, 11 pages. https://doi.org/10.1145/3428757.3429104 disadvantages. In this work, we present a study to evaluate their performance and behavior. The motivation for this research comes 1 INTRODUCTION from the necessity to know ways to extract online information in order to analyze in real-time the credibility of the content posted Social network platforms, such as Twitter, Facebook, and Instagram on the Web. To do so, we develop a framework which offers both have considerably increased their number of users in last years. alternatives of and implements a previously pro- These platforms share contents, opinions, news, and sometimes fake posed credibility model. Our framework is implemented as a content. In particular, Twitter is a worldwide social network which Chrome extension able to analyze tweets in real-time. Results re- has more than 600 million users and it is one of the most widely used port that both methods produce identical credibility values, when platform during relevant events [13], such as natural disasters [14], a robust normalization process is applied to the text (i.e., tweet). brands and products [25], and presidential elections [6]. Moreover, concerning the time performance, web scraping is faster However, the information shared on Twitter, and in other social than Twitter API, and it is more flexible in terms of obtaining data; networks, is not completely reliable [31]. The information posted however, web scraping is very sensitive to changes. on social networks must be reliable since their content can help people during crisis situations, influence the crowds, and even can be useful as a mean to help companies in decision making. Thus, CCS CONCEPTS there exist many studies focused on analyzing the credibility of the • General and reference → Reliability; • Information systems shared information [1, 3, 9, 13]. → Social networks; Data extraction and integration. In social networks, the credibility study is affected by different factors [9], such as: (i) the veracity of the text, content with mis- KEYWORDS spellings and bad words; (ii) users on the network who generated content; (iii) the quantity of data related to the information to be API, Web Scraping, Twitter, Credibility validated. As more data is available, more features can be extracted and thus, a better analysis can be performed. In the state-of-the-art, two main extraction methods have been used: (i) web scraping, which consists of the website (HTML) to obtain data by Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed using the tags; and (ii) API, which is an interface provided by social for profit or commercial advantage and that copies bear this notice and the full citation media platforms to retrieve specific information. on the first page. Copyrights for components of this work owned by others than ACM This work presents a comparison among the two data extraction must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a techniques. To do so, we extend an application that instantiates our fee. Request permissions from [email protected]. credibility analysis model, both presented in [9]. The credibility iiWAS ’20, November 30-December 2, 2020, Chiang Mai, Thailand model is adaptable to various social networks and it is based on © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-8922-8/20/11...$15.00 several features, such as verified account, number of following and https://doi.org/10.1145/3428757.3429104 followers, to compute three credibility measures: Text, User, and iiWAS ’20, November 30-December 2, 2020, Chiang Mai, Thailand I. Dongo, Y. Cardinale, A. Aguilera, F. Martínez, Y. Quintero, S. Barrios

Social credibilities. The application is extended by updating the extraction include different methods or a combination of them. Next web scraping method and adding the Twitter API option. section describes these data extraction methods. In order to compare both data extraction techniques, an experi- mental evaluation is presented. Three different languages (Spanish, 2.1 Data Extraction Methods English, and French) are used to evaluate their impact on the Text Nowadays, data extraction from web sources is a vital task for most credibility. Moreover, different types of text such as short, long, of business process, researches, studies, and others. The process use of bad words, misspelling and emoticons are analyzed. Social of data extraction consists of obtaining relevant data or credibility is also evaluate by using two types of accounts: common useful for diverse purposes. Three well-known methods have been accounts (less than one thousand of followers) and famous accounts applied for this: web scraping, , and manual extraction. Web (more than nine hundred thousands of followers). Additionally, the scraping and APIs are automated techniques and the most practical execution time to extract the features is also reported. ways of data harvesting [24]. They allow to collect data from various Experiments show that a robust normalization process on the text website pages and repositories, at a high speed and accurately. The obtained by the extraction methods, produces identical credibility data is then saved and stored for further use and analysis. Manual results. Also, the language has no impact on the credibility, nor extraction is more susceptible to human errors and time consuming. does the type of text. Moreover, the number of followers obtained by the extraction methods have a minor difference for famous 2.1.1 Web Scraping, use and limitations. Web can accounts, since the number of followers is constantly growing in be defined as the process of extracting and combining contents of real-time. Additionally, data extraction with web scraping is faster interest from the Web in a systematic way. In such a process, a soft- than with Twitter API, since for the former, only local extraction (in ware agent, a web robot or a script, mimics the browsing interaction the Twitter website) is needed, while for the API, a local extraction between the web servers and the human in a conventional web to obtain the user_id and tweet_id is required to latter perform traversal. Step by step, the robot/script accesses as many a remote API request. as needed, parses their contents to find and extract data of interest This paper is organized as follows. First in Section 2, the topics and structures those contents as desired [12]. Web scrapers are use- related to this work such as data extraction methods and the social ful when retrieving and processing large amounts of data quickly network Twitter, are described. Data extraction techniques are from a specific website. Thus, if the information is displayed ona explained in Section 3. The credibility model used in this work is browser, it can be accessible via a robot/script to extract the data explained in Section 4. In Section 6, we present an experimental and store them in a for future analysis and use [20]. evaluation and a discussion about both extraction methods. Finally, Web scraping is used commonly on website pages that use we conclude in Section 7. markup languages such as HTML or eXtensible HyperText Markup Language (XHTML). In this case, scraping consists on parsing hy- 2 SOCIAL NETWORKS AS INFORMATION pertext tags and retrieving plain text information embedded onto SOURCES: PRELIMINARY them. The web data scraper establishes communication with the Since the creation of social networks, they have become a new target website through the HTTP protocol and extracts the contents channel for communication and socialization [2]. The types of in- of interest. Some matching could be necessary formation that are exchanged on social media platforms range from along with additional logic [12]. The network speed may be a limi- personal contexts, with the exchange of personal conversations tation or disadvantage for web scraping, since it affects when and through messaging, expressions of feelings, photographs, videos, to how the data is displayed. Another problem regarding this method official and institutional contexts. Hence, social networks become is the often changes of the web pages format. Web scraping involves popular sources of real-time information. In 2020, among the most site-specific programming and does not comply with expectable popular social networks in terms of the number of users are Face- changes in the HTML source [12]. When this happens, the whole book with 2,449 million, YouTube with 2,000 million, Instagram script that gives life to the scraper must be updated and adapted to with 2,000 million, Twitter with 340 million, and other large Chi- the new format or layout of the information. Web scraping mainte- nese social networks, such as QQ/Qzone, Sina Weibo, and Baidu nance is critical and could involve time consuming programming Teiba, that together accumulate more than 1,000 million followers1. tasks. Web scraping developers should take in consideration legal Social networks have a considerable influence on the formation and policy issues about the information they are extracting to pre- of opinions due to their expansion and dissemination capacity. vent copyright infringement [12]. There is no difference between However, as well as they circulate correct content, they can be a visiting or scraping a website, in both cases the user is a guest and way to transmit fake news and rumors. For this reason, the misuse thus, he or she is not the owner of the data extracted. of these platforms for harmful activities and the dissemination of 2.1.2 API as a tool, use and limitations. An API is a component disinformation is a relevant issue. of object-oriented programming languages that allows developers Due to the importance of the information, which cannot be to build software for a particular application through a reference underestimated at the same time it is generated, many researchers program library [5]. The API is prescribed by a device’s operating have studied the credibility on social networks, mainly based on system or an application program in which a requester (another metadata provided by themselves [2, 31]. The techniques for data device or a client user) can make requests expecting responses from 1https://wearesocial.com/blog/2020/01/digital-2020-3-8-billion-people-use-social- them. APIs facilitate interaction between different software pro- media grams and the access to their services. It includes the specification Web Scraping versus Twitter API: A Comparison for a Credibility Analysis iiWAS ’20, November 30-December 2, 2020, Chiang Mai,Thailand of data structures, protocols, object classes, and runtimes to com- information posted on Twitter can be spread quickly in contrast to municate the consumer with the resources offered by the API22 [ ]. other social networks. Developers can build new classes or extend existing ones to add Twitter API provides three types of products to extract data from new features or functionalities. A client API is called through an tweets and accounts: Standard, Premium, and Enterprise. Each one endpoint, which is a component that listens when a request is being has a variety of endpoints to extract the information posted on made from the client side of the communication to the server side Twitter through requests. Table 1 shows certain limitations and via HTTP, expecting a response to be returned. differences among the products offered by Twitter. Depending on Concerning social networks and web information sources, many the developer or project needs, the information provided by Stan- of them do not offer an API to access the available information, for dard API may be sufficient, but the negative point of this product several reasons [20], such as: (i) the data that is wanted is small is the access only to the last 7 days. For companies and important or uncommon; (ii) the source does not have the infrastructure or research projects, Twitter offers other products that can be adapted technical ability to create an API; (iii) the data is valuable or pro- to customer needs such as Premium, where a subscription of $99, tected and not intended to be spread widely; and (iv) even when has 100 request per month, 500 tweets per request and a full his- an API does exist, there may be request volume and rate limits; tory access, while for $149, 500 requests per month, 500 tweets per also, the types and format of the data that it provides might be request, and last 30 days of access are provided. A similar scenario insufficient for the purpose. Furthermore, there are limitations on can be observed for Enterprise product, which can be customized the API that include the rejection of access, if the use of the infor- according to the requirements of the users. mation is not enough or properly demonstrated. Data protection Not all APIs offered by social networks are feasible for the extrac- laws related to privacy and most of TOS (Terms of Service) limit tion of their data. However, it has been demonstrated that Twitter also their access [9]. API is simple and easy to use; only knowing the tweet ID or user ID, it is possible to obtain a variety of information. Twitter has 2.1.3 Manual extraction. Other extraction method includes the different products to get their data and its structure is less complex manual or human extraction of features for credibility assessment. than other social networks [15]. Twitter provides only one profile It consists on the perception of credibility of users based on visible and the user could choose to be a private or a public. characteristics of posts. The study presented in [10], describes an experimental design to test whether the Twitter verification mark contributes to perceptions of information and account credibility, 3 RELATED WORK among organizations of news. Authors show also how account In this section, we describe some studies that have based their ambiguity and account congruence with political beliefs determine credibility analysis on Twitter either using web scraping or the API. this relationship. Results of this study suggest that little attention is More related studies to our work are the ones that compare both paid to the verification mark when judging credibility, even when extraction techniques in any context on social networks. We found little other information is provided about the account or the content. few of them, which are presented at the end of this section. Instead, account ambiguity and congruence dominate credibility assessments of news organizations. Other study in this sense is pre- 3.1 Web Scraping sented in [27], which investigates if the account verification affects In a previous study presented in [9], we describe a generic frame- the believe content of tweets. Authors found that - in the context work to calculate the credibility of several social networks. The of unfamiliar accounts — most users can effectively distinguish credibility model proposed in that work, is based on attributes of between authenticity and credibility. The presence or absence of the post and the user account to measure three levels of credibil- an authenticity indicator has no significant effect on willingness to ity: text, user, and social impact. We present an implementation to share a tweet or take action based on its contents. analyze credibility in Twitter in real-time, as Google Chrome ex- Following section describes Twitter as an information source, tension application. The analysis of the texts (i.e., tweets) is further which is used in this study to analyze the tweet credibility. done through filters that detect SPAM, bad words, and misspelling. Attributes such as followers, following, the joined year, and the veri- 2.2 Twitter fied account, are considered to evaluate user and social credibility. Twitter has demonstrated to be one of the most populated micro Due to Twitter API limitations, web scraping was used to extract blogging site in the world, with hundreds of millions of users as an from the web pages of Twitter all these attributes. In this paper, we excellent spread information platform in real-time [2]. Twitter is a extend that implementation, by updating the scraper and including social network in which users share text stories, with a maximum the use of the Twitter API to perform the comparative evaluation of 280 characters, and they can comment or retweet (i.e., by sharing analysis of both extraction methods. the publication of another user). Hoaxy [23] is a web platform for tracking social news sharing. Twitter defines itself as "what is happening in the world and The system collects data from two main sources: news websites the issues that people are talking about." It is currently one of the and social media, by using different technologies (i.e., web scraping, most used social networks by the media due to its flexibility of use web syndication, and, where available, APIs of social networking for both users and researchers. Twitter is defined as a news media platforms) to populate a dataset over the course of several months. channel as much as a social platform, since the relation between The extracted data is focused on Twitter content, considering URLs users does not have to be bi-directional [2]. This platform is self- and the social aspects to track the activity of the user and the tweet. publishing based on the immediacy of users’ messages. Thus, the This dataset is analyzed considering the temporal relation between iiWAS ’20, November 30-December 2, 2020, Chiang Mai, Thailand I. Dongo, Y. Cardinale, A. Aguilera, F. Martínez, Y. Quintero, S. Barrios

Table 1: Twitter Products

Product Price per month Level of usage Tweets per request Frequency Time frame Standard Free - Depends on endpoints 15 request per 15 minutes last 7 days From $149 From 500 request/month 500 10 RPS, 60 RPM Last 30 days Premium From $99 From 100 request / month 500 10 RPS, 30 RPM Full history APIs allow you to Customized with Each customer has a defined - retrieve up to 500 results per response Last 30 days Enterprise predictable pricing rate limit for their endpoints for a given timeframe Customized with Each customer has a defined - Default 120 RPM Full history predictable pricing rate limit for their endpoints the spread of misinformation and fact checking, and the differences component; (ii) a credibility classifier engine; (iii) a user experience in how users share them. component; and (iv) a feature rank algorithm. They use the Twitter A systematic methodology is presented in [18], aimed at min- API to extract the features used by the system. ing language features like people’s opinion, find witness accounts, Considering the topic feature, a technique to enhance the ability derive underlying belief from messages, use sourcing, network prop- of social network users to identify relevant sources of information agation, credibility, and other user and meta features to debunk (relevant, expert, and useful users to follow) for a given topic is rumors. Due to Twitter API does not let them track the propaga- proposed in [7]. This work generates a ranked list of relevant users tion of retweets in as much detail, author use a web scraper to get in combining a basic text search with an analysis of the social struc- full history and to download all tweets automatically. Their sys- ture of the network. The Twitter API is used to execute this search. tem continues monitoring the rumor event and generates dynamic In [8], the authors propose an automatic method for assessing the real-time updates based on any additional information received. credibility of a set of tweets related to “trending” topics. The tweets A web interface framework, implemented as a web plug-in sys- were collected using Twitter Monitor2. The dataset was manually tem is proposed in [26]. The aim is to analyze, in real-time, the tagged as credible or not credible. After, a supervised classifier was credibility of tweets regarding to a specific topic. Only the text of trained to predict credibility levels on Twitter events. Also, ana- each tweet is analyzed to be classified as being “entailment”, “neu- lyzing the topic feature, TwitterBOT [19] assess the credibility of tral”, or “contradiction” with respect the topic. The system shows a tweets. The data collecting is based on gathering real tweets posted list of news information related to the topic, thus users can decide on Twitter on one particular subject (nature environment preser- the veracity of the tweet in question. The Twitter API is used to vation). A manual tagging is done on each tweet in the dataset as collect tweets, web scraping is used to get the URLs referenced in "highly credible", "highly not credible", "neutral", and "controversial". tweets, and the Bing news API is used to find articles and retrieve A supervised learning model is developed using a RandomForest news headlines related to the topic. classifier to execute an automated credibility assessment. Another work related to topics is the one proposed in [21]. Cred- 3.2 Twitter API ibility is assessed calculating the ratio of the positive opinions to A real-time system to calculate the credibility of tweets was develo- all opinions about a topic. A sentiment analysis is performed us- ped in [13]. Authors collect data from Twitter streaming API for a ing a semantic orientation dictionary to identify the opinions as set of predefined keywords in the context of six crisis events ofthe positive and negative. This work considers too user’s knowledge world during 2013. A tweet downloaded from Twitter API contains (expertise) in the information credibility assessment. Authors do a series of fields in addition to the text, such as posting dateand not declare that Twitter API was used to collect the tweets, but followers/following of the user at the time of the tweet. This work their system architecture shows a collector module from Twitter. made a total of 1’300,000 API requests. Another real-time work In [28], a framework for credibility analysis on Twitter data, with is CredFinder [1], which consists on a front-end in the form of a disaster situation awareness is proposed. This framework is able Chrome extension and a web-based back-end. The former collects to calculate topic-level credibility (i.e., emergency situations), in tweets in real-time from a Twitter search or a user-timeline page real-time, by analyzing the text, linked URLs, number of retweets, and the latter analyzes the collected tweets, assesses their credibility, and geographic information extracted from both post text and ex- and computes a credibility score. Using the Twitter streaming API, ternal URLs. Thus, an event with a higher credibility score indicates tweets and their metadata are obtained. In the same context of real- that there are more tweets, more linked URLs, and more retweets time applications, a framework for credibility analysis is described mentioning this event. Data is collected through Twitter API, to in [17]. This framework considers text and user credibility, aimed to get the information of the tweets and Google Maps Geocoding API identify fake users and fake news, based on neural network models. to obtain geolocalization information. The Twitter API is used to retrieve information regarding retweets, In a recent context of diseases, the work presented in [29] iden- favorites, and date of post. Also, the text is analyzed to count the tifies low-credibility sources. Authors collect 570 low-credibility number of words, number of characters, identify stop-words, etc. sources, labeled as: (i) low-credibility; (ii) “Black” or “Red” or “Satire”; A credibility analysis system for assessing information credibil- (iii) “fakenews” or “hyperpartisan”; or (iv) “extremeleft” or “ex- ity on Twitter to prevent the proliferation of fake information is tremeright”. They use the API from the Observatory on SocialMedia proposed in [3]. This work measures and characterizes the content to collect tweets in the Covid-19 context, and the API to extract the and sources of tweets. For that, authors designed an automated 2Twitter monitor was an on-line monitoring system which detected sharp increases classification system with four components: (i) a reputation-based (“bursts”) in the frequency of sets of keywords found in messages. Web Scraping versus Twitter API: A Comparison for a Credibility Analysis iiWAS ’20, November 30-December 2, 2020, Chiang Mai,Thailand

URL, combined with regular expressions to extract any URL-like SPAM Filter: SPAM messages are usually unwanted advertising strings from the tweet text. For retweets, they include the URLs in messages, which usually use hyperbolic language, excess capitaliza- the original tweets using the same method. They also detect bots tion, and accentuation. If the analyzed text has these characteristics, to reduce the credibility of the tweet using BotometerLite [30]. its credibility level may decrease. Def. 4.1 introduces the SPAM The previous works are proposed for English language; however, filter function. the Arabic language has been also studied. An automatically mea- Definition 4.1. SPAM Filter (푖푠푆푃퐴푀). Given a text of a tweet, suring the credibility of Arabic News content published in Twitter is 푡.푡푒푥푡, the SPAM filter is a function, denoted as 푖푠푆푃퐴푀(푡.푡푒푥푡), that presented in [4]. Authors select a set of features for evaluating Twit- returns a measure ∈ [0, 100], indicating the probability of 푡.푡푒푥푡 being ter credibility based on tweet content and author. The extraction a SPAM, defined as: method is the Twitter API. The features used are similarity with ver- 푖푠_푆푃퐴푀 (푡.푡푒푥) = 푑푒푡푒푐푡_푠푝푎푚(푡.푡푒푥푡) ified content, inappropriate words, linking to authoritative/credible, 3 where 푑푒푡푒푐푡_푠푝푎푚 is an algorithm that analyzes 푡.푡푒푥푡 and returns news sources, account verification, TwitterGrader.com degree. The the probability of 푡.푡푒푥푡 being a SPAM. system architecture of this work consists of four main components: text pre-processing, features extraction and computation, credibility Bad Words Filter: It is used to detect those publications that have calculation, and credibility assignment and ranking. a high content of bad words, in which case their credibility should None of these studies perform a comparative evaluation of both be reduced. Def. 4.2 presents how this filter is used to calculate the extraction techniques; however, they are a clear evidence of the proportion of bad words in the text. need of having different options to extract different data todo Definition 4.2. Bad Words Filter (푏푎푑_푤표푟푑푠). Given a text of credibility analysis on Twitter. a tweet, 푡.푡푒푥푡, the bad words filter is a function, denoted as 푏푎푑_푤표푟푑푠(푡.푡푒푥푡), that returns a measure ∈ [0, 100], indicating the 3.3 Studies about Comparison proportion of bad words in 푡.푡푒푥푡, defined as: In other context different to credibility analysis, few researches ( ) = − ( 푒푥푡푟푎푐푡_푏푤 (푡.푡푒푥푡) ) × 푏푎푑_푤표푟푑푠 푡.푡푒푥푡 100 푒푥푡푟푎푐푡_푡푤 (푡.푡푒푥푡) 100 have focused on comparing both extraction techniques. With the where 푒푥푡푟푎푐푡_푏푤 and 푒푥푡푟푎푐푡_푡푤 are algorithms that counts the aim of obtaining an unlimited volume of tweets, authors in [16] number of bad words and total words in 푡.푡푒푥푡, respectively. bypass date ranges limitations of the Twitter API, by using web scraping, with faster results. The comparison shows that the total Misspelling Filter: It is used to detect syntax errors in writing text of retrieved tweets with the Twitter API is always less than with the (see Def. 4.3). Credibility will decrease as the number of misspellings Twitter Scrapy4. Authors establish that even though most works use found. the Twitter streaming API for collecting data, a limitation occurs Definition 4.3. Misspelling Filter (푚푖푠푠푝푒푙푙푖푛푔). Given a text when queries exceed rating intervals and time ranges. Authors of a tweet, 푡.푡푒푥푡, the misspelling filter is a function, denoted as manipulate Twitter’s search query URL with the keywords they 푚푖푠푠푝푒푙푙푖푛푔(푡.푡푒푥푡), that returns a measure ∈ [0, 100], indicating are interested in, adding also the range of dates when the tweets the proportion of misspelled words in 푡.푡푒푥푡, defined as: were made, which is a major limitation when using Twitter API. ( ) = − ( 푒푥푡푟푎푐푡 (푡.푡푒푥푡)_푚푝 ) × 푚푖푠푠푝푒푙푙푖푛푔 푡.푡푒푥푡 100 푒푥푡푟푎푐푡_푡푤 (푡.푡푒푥푡) 100 After the results page appear, the HTML is redirected to a where 푒푥푡푟푎푐푡_푚푝 and 푒푥푡푟푎푐푡_푡푤 are algorithms that counts the web scraping engine, where unprocessed data containing tweets is number of misspelled words and total words in 푡.푡푒푥푡, respectively. complemented in order to strip hypertext tags and objects, known as tag - selectors. Results of this study show that using this proposed Based on these three filters, the text credibility is defined as methodology, more tweets are obtained in less significant seconds. shown by Def. 4.4. User decides the importance of each filter. Following section describes the credibility model proposed in [9]. Definition 4.4. Text Credibility (푇푒푥푡퐶푟푒푑). Given a text of a tweet, 푡.푡푒푥푡, the Text Credibility is a function, denoted as 푇푒푥푡퐶푟푒푑(푡.푡푒푥푡), that returns a measure ∈ [0, 100], defined as: 4 CREDIBILITY ANALYSIS MODEL FOR 푇푒푥푡퐶푟푒푑(푡.푡푒푥푡) = 푤푆푃퐴푀 × 푖푠푆푝푎푚(푡.푡푒푥푡)+ TWITTER 푤퐵푎푑푊 표푟푑푠 × 푏푎푑_푤표푟푑푠(푡.푡푒푥푡)+ In order to calculate the credibility of a tweet with web scraping and 푤푀푖푠푠푝푒푙푙푒푑푊표푟푑푠 × 푚푖푠푠푝푒푙푙푖푛푔(푡.푡푒푥푡) Twitter API, we use the credibility model proposed in a previous where 푤푆푃퐴푀,푤퐵푎푑푊 표푟푑푠 , and 푤푀푖푠푠푝푒푙푙푒푑푊표푟푑푠 represent the work [9]. This model considers text, user, and social credibilities, weights that the user gives to each filter, respectively, such as: regardless of the topic treated. The definitions that conform the 푤푆푃퐴푀 + 푤퐵푎푑푊 표푟푑푠 + 푤푀푖푠푠푝푒푙푙푒푑푊표푟푑푠 = 1. model required in this implementation are resumed as shown below. 4.2 User Credibility 4.1 Text Credibility To calculate this credibility measure on Twitter, we consider whether The text credibility uses syntactic analysis techniques through the account is verified or not (Def. 4.5) and the activity time ofthe SPAM, bad words, and misspelling filters: account since its creation (Def. 4.6). Definition 4.5. Verified Account Weight푉푒푟푖푓 ( _푊 푒푖푔ℎ푡). Given 3 Twittergrader.com is a service to retrieve the grade of any Twitter user via its an account of a tweet, 푡.푢푠푒푟, the weight of a Verified Account is a username 4Twitter Scrapy is an open source and collaborative framework for extracting data function, denoted as 푉푒푟푖푓 _푊 푒푖푔ℎ푡 (푡.푢푠푒푟), that returns a measure from websites. https://scrapy.org/ ∈ {0, 50}, defined as: iiWAS ’20, November 30-December 2, 2020, Chiang Mai, Thailand I. Dongo, Y. Cardinale, A. Aguilera, F. Martínez, Y. Quintero, S. Barrios

( 50 푖푓 푡.푢푠푒푟 ℎ푎푠 푏푒푒푛 푣푒푟푖푓푖푒푑, Definition 4.10. Social Credibility (푆표푐푖푎푙퐶푟푒푑). Given a set of Verif_Weight(t.user) = 0 otherwise. social metadata of a tweet, 푡.푠표푐푖푎푙, the Social Credibility is a function, denoted as 푆표푐푖푎푙퐶푟푒푑(푡.푠표푐푖푎푙), that returns a measure ∈ [0, 100], Account Creation Weight ( _ ). Definition 4.6. 퐶푟푒푎푡푖표푛 푊 푒푖푔ℎ푡 defined as: Given an account of a tweet, 푡.푢푠푒푟, the weight of the Account Creation 푆표푐푖푎푙퐶푟푒푑(푡.푠표푐푖푎푙) = 퐹표푙푙표푤푒푟푠퐼푚푝푎푐푡 (푡.푠표푐푖푎푙푢푠푒푟 )+ is a function, denoted as 퐶푟푒푎푡푖표푛_푊 푒푖푔ℎ푡 (푡.푢푠푒푟), that returns a 퐹퐹푃푟표푝표푟푡푖표푛(푡.푠표푐푖푎푙푢푠푒푟 ) measure ∈ [0, 50], defined as: ( ) = 퐴푐푐표푢푛푡_퐴푔푒 (푡.푢푠푒푟) × 퐶푟푒푎푡푖표푛_푊 푒푖푔ℎ푡 푡.푢푠푒푟 푀푎푥_퐴푐푐표푢푛푡_퐴푔푒 (푡.푢푠푒푟) 50 4.4 Total Credibility Level where: With the three credibility measures the credibility level of a tweet • 퐴푐푐표푢푛푡_퐴푔푒 (푡.푢푠푒푟) = 퐶푢푟푟푒푛푡_푌푒푎푟 − 푌푒푎푟_퐽 표푖푛푒푑 (푡.푢푠푒푟) is calculated, as shown in Def. 4.11. • 푀푎푥_퐴푐푐표푢푛푡_퐴푔푒 (푇 ) = 퐶푢푟푟푒푛푡_푌푒푎푟− 푇 푤푖푡푡푒푟_퐶푟푒푎푡푖표푛- _푌푒푎푟 Definition 4.11. Tweet Credibility Level (푇퐶푟푒푑). Given a tweet, where 푇 푤푖푡푡푒푟_퐶푟푒푎푡푖표푛_푌푒푎푟 is the year in which Twitter was cre- 푡, the Tweet Credibility Level is a function, denoted as 푇퐶푟푒푑(푡), that ated (2006). returns a measure ∈ [0, 100], of its level of credibility, defined as: 푇퐶푟푒푑 (푡) = 푤푒푖푔ℎ푡푡푒푥푡 × 푇푒푥푡퐶푟푒푑 (푡.푡푒푥푡) + 푤푒푖푔ℎ푡푢푠푒푟 While the year account was joined (푌푒푎푟_퐽 표푖푛푒푑) is closer to the × 푈 푠푒푟퐶푟푒푑 (푡.푢푠푒푟) + 푤푒푖푔ℎ푡 × 푆표푐푖푎푙퐶푟푒푑 (푡.푠표푐푖푎푙) Twitter creation date (2006), the account credibility is greater. The 푠표푐푖푎푙 where: maximum points obtained by this criterion is 50, since the other 50 is for the verified account weight (푉푒푟푖푓 _푊 푒푖푔ℎ푡). Thus, the user • 푤푒푖푔ℎ푡푡푒푥푡 , 푤푒푖푔ℎ푢푠푒푟 , and 푤푒푖푔ℎ푡푠표푐푖푎푙 represent the weights credibility is calculated as shown in Def. 4.7. that the user gives to text credibility, user credibility, and social credibility, respectively, such as: 푤푒푖푔ℎ푡푡푒푥푡 + 푤푒푖푔ℎ푡푢푠푒푟 + Definition 4.7. User Credibility (푈푠푒푟퐶푟푒푑). Given an account 푤푒푖푔ℎ푡푠표푐푖푎푙 = 1; of a tweet, 푡.푢푠푒푟, the User Credibility is a function, denoted as 푈푠푒푟- • 푇푒푥푡퐶푟푒푑(푡.푡푒푥푡), 푈푠푒푟퐶푟푒푑(푡.푢푠푒푟), and 푆표푐푖푎푙퐶푟푒푑(푡.푠표- ( ) ∈ [ ] 퐶푟푒푑 푡.푢푠푒푟 , that returns a measure 0, 100 , defined as: 푈 푠푒푟- 푐푖푎푙) represent the credibility measure related to the text, the ( ) = ( ) + ( ) 퐶푟푒푑 푡.푢푠푒푟 푉푒푟푖푓 _푊 푒푖푔ℎ푡 푡.푢푠푒푟 퐶푟푒푎푡푖표푛_푊 푒푖푔ℎ푡 푡.푢푠푒푟 user, and the social impact of 푡, respectively.

4.3 Social Credibility The parameters considered in the filters and the weight that The social credibility measures the impact of a tweet on the author’s each filter represents in the final credibility calculation are values social network based on his/her popularity. This credibility is calcu- provided by users. lated based on the influence of the account considering the number of followers (Def. 4.8) and the proportion of followers and following 5 A FRAMEWORK FOR TWITTER calculated by the ratio between them (Def. 4.9). Both definitions CREDIBILITY ANALYSIS are shown as follows. In a previous work, we propose a credibility model for social net- Definition 4.8. Followers Impact (퐹표푙푙표푤푒푟푠퐼푚푝푎푐푡). Given a works [9] (see a briefly description in Section 4).9 In[ ], we also set of social metadata of an account from a tweet 푡 is published, present an implementation of such as model to perform real-time 푡.푠표푐푖푎푙푢푠푒푟 , the social Followers Impact is a function, denoted as credibility analysis on Twitter, based on web scraping and imple- 퐹표푙푙표푤푒푟푠퐼푚푝푎푐푡 (푡.푠표푐푖푎푙푢푠푒푟 ), that returns a measure ∈ [0, 50], de- mented as a Google Chrome extension. In this work, in order to fined as: make the comparative evaluation of both data extraction meth- ( ) = 푡.푠표푐푖푎푙푢푠푒푟 .푓 표푙푙표푤푒푟푠 × ods, we extend the implementation, by updating the scraper, since 퐹표푙푙표푤푒푟푠퐼푚푝푎푐푡 푡.푠표푐푖푎푙푢푠푒푟 푀퐴푋 _푓 표푙푙표푤푒푟푠 50 the Twitter website changed its HTML tags and structure, and by where 푡.푠표푐푖푎푙푢푠푒푟 .푓 표푙푙표푤푒푟푠 is the number of followers of 푡.푢푠푒푟 and 푀퐴푋_푓 표푙푙표푤푒푟푠 is a user-defined parameter. incorporating the use of the Twitter API, as an alternative data extraction technique. Thus, the current version of our framework The maximum value for 푀퐴푋_푓 표푙푙표푤푒푟푠 is a parameter pro- can be configured to use web scraping or Twitter API. The frame- vided by the system user (for this work we consider MAX_follo- work is divided into two modules: the front-end, where the data wers=2 millions). The influence of accounts that exceed the value extraction is performed in the Google Chrome extension, and the of followers is calculated based on that maximum established value. back-end, which calculates the credibility. This perspective allows Definition 4.9. Followers-Following Proportion (퐹퐹푃푟표푝표푟- improving the credibility process in the future, without modifying 푡푖표푛). Given a set of social metadata of an account from tweet 푡 is the extension for users who already have it installed. published, 푡.푠표푐푖푎푙푢푠푒푟 , the Followers-Following Proportion is a func- Figure 1 shows the overflow of the implementation. Five features tion, denoted as 퐹퐹푃푟표푝표푟푡푖표푛(푡.푠표푐푖푎푙푢푠푒푟 ), that returns a measure are extracted by using either web scraping or Twitter API. Then, ∈ [0, 50], defined as: the text, user, and social credibilities are calculated. The current im- 퐹퐹푃푟표푝표푟푡푖표푛(푡.푠표푐푖푎푙 ) = 푡.푠표푐푖푎푙푢푠푒푟 .푓 표푙푙표푤푒푟푠 ×50 푢푠푒푟 푡.푠표푐푖푎푙푢푠푒푟 .푓 표푙푙표푤푒푟푠+푡.푠표푐푖푎푙푢푠푒푟 .푓 표푙푙표푤푖푛푔 plementation of our framework uses the following libraries, taken 5 where 푡.푠표푐푖푎푙푢푠푒푟 .푓 표푙푙표푤푒푟푠 and 푡.푠표푐푖푎푙푢푠푒푟 .푓 표푙푙표푤푖푛푔 are the num- from a JavaScript development platform (NPM ): Simple Spam Filter, ber of followers and following of 푡.푢푠푒푟, respectively. Bad Words, and Nspell, for SPAM, Bad Words, and Miss Spelling Finally, social credibility is calculated according to Def. 4.8 and filters, respectively. Def. 4.9, each one with a maximum weight of 50. Def. 4.10 shows the calculation of social credibility. 5https://www.npmjs.com/ Web Scraping versus Twitter API: A Comparison for a Credibility Analysis iiWAS ’20, November 30-December 2, 2020, Chiang Mai,Thailand

Data To obtain the user_id and tweet_id as parameters for the Twit- Twitter API Extraction Features ter API, the values have to be scraped from the user profile and main API conection homepage timelines, respectively. Table 2 shows the five attributes HTML Scraper used to instantiate the credibility model. The simplicity of Twitter

API syntax with respect to the web scraping is clearly shown in this table; while for Twitter API, a split of the created_at value Text Account age Verified Followings Followers is only needed, for web scraping, a regex and split operations are required. Moreover, we extended the previous version of our framework, Text Credibility User Credibility Social Credibility by adding dictionaries for an automatic language recognition, dic- tionary-en-us, dictionary-fr, and dictionary-es libraries, also from NPM. Using Twitter API, we extracted the same data as Tweet Credibility Level the web scraping extraction method. However, Twitter manages dates, numbers, emoticons, etc., in a different way than how it is shown on the browser, which can affect the text credibility; thus, a Figure 1: Implementation of the Credibility Model normalization process on the text is required. The normalization process consists of removing and modifying some special characters using JavaScript regular expressions. First, In the old Twitter website, there were several easy-to-catch URLs are removed from the text (tweet), as well as mentions, by HTML elements due to its well-described classes such as Profile- detecting them using the regular expressions shown in Table 3. HeaderCard-joinDate, which could be used to obtain the joined Then, hashtags and punctuation marks are located in the text and date. With the new interface, the scraping process is more difficult are deleted from it. Finally, to remove the emoticons, we use the since the class names do not have user-friendly descriptions. For library emoji-strip also provided by NPM. example, to obtain the joined date, we scrap the Twitter website using the following query selector div[data-testid="UserPro- fileHeader_Items"] to obtain the children elements, from where Table 3: Regular Expressions to Normalize a Text a regex is used to filter the word “Joined” in the following way: textContent.match(/(Joined)/)ˆ , where textContent has the Type Regular Expression text extracted from the scraping. For the other cases such as number URL /(https?://[ˆ\s]+)/g Mention /\B@[a-z0-9_-]+\s/gi of retweets and tweet ID, the same solution was used. Hashtag /#$/ For the implementation of the API extraction method, we re- Punctuation /( |‘|!|@|#|$|%|ˆ|&|*|(|)|{|}|[|]|;|:|"|’|<|,|.|>|?|/|\|||-|_|+|=)/g quested permissions to use the Twitter API. To obtain the API Key, a Twitter account using an institutional mail was opened, then it was registered as a developer account in developer.twitter.com 6 EXPERIMENTAL COMPARISON BETWEEN site. A form was filled to request the API Key, explaining the motives WEB SCRAPING AND TWITTER API of the investigation. Once Twitter granted permissions and its API Key was obtained, the data of interest was successfully extracted: To compare web scraping and Twitter API methods, we perform a (i) using user_id to obtain data from the user; (ii) using tweet_id battery of experiments using the credibility model proposed in our to obtain data from the tweet, with an additional parameter called previous work [9]: tweet_mode = ’extended’ to extract more data (i.e., an indication • Test 1 is intended to analyze the use of special characters to not truncate tweets larger than 120 characters). which can affect the credibility (in Text Credibility). • Since the User Credibility only takes into account the verified account and joined date, and the obtained values are iden- Table 2: Data Extraction Attributes tical for both extraction methods, their credibilities are the

Attributes API Scraper same and does not merit an evaluation. Array.from(document.querySelectorAll • Test 2 is to study the Social Credibility, where two different client.get(’statuses/show’, { id: tweetId, Text (’div[data-testid="tweet"]’))[i]. tweet_mode: ’extended’ }).data.full_text Twitter accounts (famous and common) are considered for a children[1].children[1].children[0].innerText client.get(’users/show’, document.querySelector better analysis. Verified { user_id: userId }).data.verified (’svg[aria-label="Verified account"]’) • Test 3 is to apply the whole credibility model to four tweets let x = document.querySelector client.get(’users/show’, (’div[data-testid= of each Twitter account, evaluated in Test 2. Account age { user_id: userId }). "UserProfileHeader_Items"]’).children[i] • Test 4 is the last experiment to compare the execution time data.created_at.split(’ ’).pop() if x.textContent.match(/^(Joined)/) { x.textContent.split(’ ’)[2]} of the two extraction methods. const followingPath = window.location. client.get(’users/show’, pathname + ’/following’ Followings All tests were applied to three different Twitter account lan- {user_id: userId }).data.friends_count document.querySelector(‘a[href= "${followingPath}"]‘).getAttribute(’title’) guages: Spanish, English, and French. Experiments were under- const followersPath = window.location. taken in a shared VPS on Digital Ocean with 1 GB Memory, 20 GB client.get(’users/show’, pathname + ’/followers’ Followers { user_id: userId }).data.followers_count document.querySelector(‘a[href= Disk, and Ubuntu 14.04.4 x64. The VPS is located in New York City. "${followersPath}"]‘).getAttribute(’title’) The values of the features correspond to the date July 28, 2020. iiWAS ’20, November 30-December 2, 2020, Chiang Mai, Thailand I. Dongo, Y. Cardinale, A. Aguilera, F. Martínez, Y. Quintero, S. Barrios

Table 4: Test1: Text Credibility - Spanish

Extraction Bad Miss Text Type Twitter Tweet ID Info SPAM (%) Method Words (%) Spelling (%) Credibility (%) Web scraping 100 100 16.6666 72.5 Short @yuniquintero XXXXX54127684513792 12 words Twitter API 100 100 16.6666 72.5 Web scraping 100 100 62.9629 95.7692 Long @nanutria XXXXX53653623304192 43 words Twitter API 100 100 62.9629 95.7692 Web scraping 0 100 0 33 SPAM @farmarato XXXXX20036716687360 2 words Twitter API 0 100 0 33 16 words Web scraping 100 94.1176 17.6470 70.8823 Bad Words @yuniquintero XXXXX77135991562240 1 emoticon Twitter API 100 94.1176 100 98.0588 Web scraping 100 100 60 86.8 Misspelling @eldtwiter XXXXX35165294841857 5 words Twitter API 100 100 60 86.8 22 words Web scraping 100 100 94.7368 98.2631 Emoticons @fabi_ad XXXXX45111642243079 2 emoticons Twitter API 100 100 94.7368 98.2631 Text Credibility = 0.34 × SPAM + 0.33 × Bad Words + 0.33 × Miss Spelling

Table 5: Test1: Text Credibility - English

Extraction Bad Miss Text Type Twitter Tweet ID Info SPAM (%) Method Words (%) Spelling (%) Credibility (%) Web scraping 100 100 100 100 Short @elonmusk XXXXX18109651431427 6 words Twitter API 100 100 100 100 Web scraping 100 100 97.2222 99.0833 Long @elonmusk XXXXX75982297051142 37 words Twitter API 100 100 97.2222 99.0833 Web scraping 0 100 86.6666 61.6 SPAM @maheshfantrends XXXXX41467998105601 14 words Twitter API 0 100 86.6666 61.6 Web scraping 0 83.3333 100 60.5 Bad Words @chen_bichan XXXXX51993006546944 6 words Twitter API 0 83.3333 100 60.5 3 words Web scraping 0 66.6666 33.3333 32.9999 Misspelling @gitlost XXXXX31390801793024 14 emoticons Twitter API 0 66.6666 33.3333 32.9999 2 words Web scraping 0 100 100 66 Emoticons @elonmusk XXXXX97906536361987 1 emoticon Twitter API 0 100 100 66 Text Credibility = 0.34 × SPAM + 0.33 × Bad Words + 0.33 × Miss Spelling

Table 6: Test1: Text Credibility - French

Extraction Bad Miss Text Type Twitter Tweet ID Info SPAM (%) Method Words (%) Spelling (%) Credibility (%) Web scraping 100 100 100 100 Short @_rapvibes XXXXX37215263985664 8 words Twitter API 100 100 100 100 Web scraping 0 100 28.8888 42.5333 Long @jml_932 XXXXX21504796459011 43 words Twitter API 0 100 28.8888 42.5333 Web scraping 0 100 56.25 51.56 SPAM @opcrotte XXXXX79367138025473 8 words Twitter API 0 100 56.25 51.56 Web scraping 100 95.2390 90.4761 95.2857 Bad Words @_rapvibes XXXXX31103454228482 21 words Twitter API 100 95.2390 90.4761 95.2857 Web scraping 100 100 91.1764 97.9705 Misspelling @cocopericaud XXXXX11753465520129 33 words Twitter API 100 100 91.1764 97.9705 5 words Web scraping 0 100 100 66 Emoticons @antogriezmann XXXXX91677961048067 2 emoticons Twitter API 0 100 100 66 Text Credibility = 0.34 × SPAM + 0.33 × Bad Words + 0.33 × Miss Spelling

6.1 Test 1: Text Credibility respectively. Moreover, a short or long text does not have impact on In order to study the impact of text retrieval using web scraping and the credibility. For example, the credibility value for long-type text Twitter API, we consider scenarios where different types of tweets is bigger than the one of the short-type text (95.7692% and 72.5%, such as short, long, use of SPAM, bad words and miss spelling, and respectively), in the case of Spanish language, while for English, the emoticons, are presented. credibility value of long-type text is less than the one of short-type Table 4 shows the results obtained for the Spanish tweets. Since a text (95.7692% and 100%, respectively). Bad words and Misspelling normalized process is applied to the text obtained by web scraping filters provided low credibility values for the types related to them, and Twitter API, consisting of removing links, emoticons, and other which proves their functionality. steps (see more details in Section 5), the Text Credibility values are the same for both extraction methods. A similar scenario can be observed for English and French tweets, Table 5 and Table 6, Web Scraping versus Twitter API: A Comparison for a Credibility Analysis iiWAS ’20, November 30-December 2, 2020, Chiang Mai,Thailand

Table 7: Test2: Social Credibility Analysis

Max Extraction Followers FF Social Language Twitter Followers Following Followers Method Impact (%) Proportion (%) Credibility (%) Web scraping 411 266 0.0204 60.7090 30.3647 @yuniquintero 2M Twitter API 411 266 0.0204 60.7090 30.3647 Spanish Web scraping 933,099 277 46.6548 99.9702 73.3126 @presidenciaperu 2M Twitter API 933,100 277 46.6548 99.9702 73.3126 Web scraping 368 851 0.0184 30.1886 15.1035 @chen_bichan 2M Twitter API 368 851 0.0184 30.1886 15.1035 English Web scraping 37,172,852 97 100 100 100 @elonmusk 2M Twitter API 37,172,865 97 100 100 100 Web scraping 571 696 0.0284 43.0670 22.5478 @cocopericaud 2M Twitter API 571 696 0.0284 43.0670 22.5478 French Web scraping 7,015,909 10 100 100 100 @antogriezman 2M Twitter API 7,015,914 10 100 100 100 Social Credibility = 0.50 × Followers Impact + 0.50 × FF Proportion

6.2 Test 2: Social Credibility Table 9: Test 3: Max Followers: 2M - English For this test, we select two Twitter accounts for each language, tak- Total Text User Social Extraction ing into account the number of followers (i.e., common and famous Tweet ID Credibility Credibility Credibility Credibility Method accounts). For Spanish language, we selected @YuniQuintero and (%) (%) (%) (%) @chen_bichan @presidenciaperu; for English, @chen_bichan and @elonmusk; Web scraping 39.9441 95.875 7.1428 15.1197 XXXXX97846390792192 while for French, @cocopericau and @antogriezmann. A value of Twitter API 39.9441 95.875 7.1428 15.1197 Web scraping 41.3466 100 7.1428 15.1197 XXXXX62917602504705 2 million is used for the Max Followers parameter, as we proposed Twitter API 41.3466 100 7.1428 15.1197 Web scraping 40.3266 97 7.1428 15.1197 XXXXX61693427703810 in [9]. Table 7 shows the results obtained for this test. Similarly to Twitter API 40.3266 97 7.1428 15.1197 Web scraping 40.1859 96.5862 7.1428 15.1197 Test 1, the results obtained for web scraping and Twitter API are the XXXXX60908505718784 Twitter API 40.1859 96.5862 7.1428 15.1197 same. However, there is a small difference of followers between web @elonmusk Web scraping 95.6628 97.6428 89.2857 100 scraping and Twitter API, for famous accounts, since this number is XXXXX75982297051142 Twitter API 95.6628 97.6428 89.2857 100 constantly increasing and the tests were performed consecutively Web scraping 84.9042 66 89.2857 100 XXXXX69404874088448 (one after the other). For example, for @elonmusk, the number of Twitter API 84,9042 66 89.2857 100 Web scraping 95.5292 97.25 89.2857 100 XXXXX91044466724866 followers for web scraping was 37’172,852, while for Twitter API Twitter API 95.5292 97.25 89.2857 100 Web scraping 96.4642 100 89.2857 100 was 37’172,865. Note that the web scraping method was performed XXXXX77935580160000 Twitter API 96.4642 100 89.2857 100 before its respective Twitter API test. Total Credibility = 0.34 × Text Credibility + 0.33 × User Credibility + 0.33 × Social Credibility

6.3 Test 3: Tweet Credibility Table 10: Test 3: Max Followers: 2M - French

Total Text User Social Extraction Table 8: Test 3: Max Followers: 2M - Spanish Tweet ID Credibility Credibility Credibility Credibility Method (%) (%) (%) (%) @cocopericaud Total Text User Social Extraction Web scraping 45.6120 94.9230 17.8571 22.5617 Tweet ID Credibility Credibility Credibility Credibility XXXXX72931879002112 Method Twitter API 45.6120 94.9230 17.8571 22.5617 (%) (%) (%) (%) Web scraping 46.648174 97.9705 17.8571 22.5617 @YuniQuintero XXXXX11753465520129 Twitter API 46.648174 97.9705 17.8571 22.5617 Web scraping 43.8218 61.2857 39.2857 30.3647 XXXXX29977786449921 Web scraping 30.2209 49.6551 17.8571 22.5617 Twitter API 56.9846 100 39.2857 30.3647 XXXXX08508903104512 Twitter API 30.2209 49.6551 17.8571 22.5617 Web scraping 41.3969 54.1538 39.2857 30.3647 XXXXX81616200945665 Web scraping 45.4682 94.5 17.8571 22.5617 Twitter API 56.9846 100 39.2857 30.3647 XXXXX67680429191169 Twitter API 45.4682 94.5 17.8571 22.5617 Web scraping 41.9723 55.8461 39.2857 30.3647 XXXXX57961848909826 @antogriezman Twitter API 56.9846 100 39.2857 30.3647 Web scraping 80.9370 64.7307 78.5714 100 Web scraping 41.6846 55 39.2857 30.3647 XXXXX55081208217602 XXXXX72551944265728 Twitter Api 80.9370 64.7307 78.5714 100 Twitter API 34.2046 49.5 39.2857 30.3647 Web scraping 81.3685 66 78.5714 100 @presidenciaperu XXXXX91677961048067 Twitter Api 81.3685 66 78.5714 100 Web scraping 78.2113 75.6842 85.7142 73.3122 XXXXX44070680453122 Web scraping 72.9535 41.25 78.5714 100 Twitter API 83.8895 92.3846 85.7142 73.3122 XXXXX75192480796684 Twitter Api 72.9535 41.25 78.5714 100 Web scraping 78.5426 76.6585 85.7142 73.3122 XXXXX13251207360514 Web scraping 81.3685 66 78.5714 100 Twitter API 84.6087 94.5 85.7142 73.3122 XXXXX04159384465408 Twitter Api 81.3685 66 78.5714 100 Web scraping 78.0637 75.25 85.7142 73.3122 XXXXX92509447139328 Total Credibility = 0.34 × Text Credibility + 0.33 × User Credibility + 0.33 × Social Credibility Twitter API 85.0762 95.875 85.7142 73.3122 Web scraping 78.2911 75.9189 85.7142 73.3122 XXXXX91373758885891 Twitter API 83.8213 92.1842 85.7142 73.3122 Total Credibility = 0.34 × Text Credibility + 0.33 × User Credibility + 0.33 × Social Credibility English, and French, respectively. As previous tests, web scraping and Twitter API extractions methods produce the same credibility By using the six previous Twitter accounts, we select randomly values. Moreover, we can observe that famous accounts have bet- four of the most recent tweets for each account. We calculate the ter credibility than the common ones due to their user and social tweet credibility and we report the text, user, and social credibil- credibilities. The social credibility for famous account is around ities. Table 8, Table 9, and Table 10 show the results for Spanish, 100% avg., while for the other ones is 22% avg. The famous accounts iiWAS ’20, November 30-December 2, 2020, Chiang Mai, Thailand I. Dongo, Y. Cardinale, A. Aguilera, F. Martínez, Y. Quintero, S. Barrios are all verified which has a huge impact on the total credibility perform the comparative evaluation. After finished the experiments, (16.67%). we wanted to validate some of the results, but the Twitter HTML format changed again and we could not do it. This is the main 6.4 Test 4: Performance disadvantage of web scraping, data can be easily accessed but the Finally, we measured the processing time of the two extraction extraction functions have to follow the current HTML tags. methods, from the initial request until all five features for the cred- On the other hand, Twitter API manages dates, numbers, and ibility model are obtained. The average of 10 repetitions for each emoticons in a different way than how they are shown in the tweet is reported in Table 11. Results show that web scraping is browser; however, by applying a normalization process on the 40 times faster than Twitter API since to obtain the user_id and text, we are able to obtain identical texts and thus, the same text tweet_id, web scraping is also used; thus, a Twitter API call con- credibility values. This is one of the most important steps during sists of a local processing (web scraping) and an API request. the extraction of data, since the text is more complex than the oth- Following section discusses about the differences between the ers aspects that matter for credibility (e.g., verified account is just a two techniques, web scraping and Twitter API. Furthermore, the boolean value, account age is just an integer, see Table 2) due to the results obtained of the performed tests are also analyzed. presence of special characters. Test 1 shows the effectiveness of the text normalization process. The text credibility values are identical Table 11: Extraction Time Comparison for both extraction methods. For Test 2, a minor different among the number of followers, for Extraction famous account, can be observed but irrelevant for the final score. Twitter Tweet ID Time (ms) Method This behavior is caused by the rapid and constant growth of the Web scraping 4.18 @equipedefrance XXXXX66822152019968 followers through time. For example, @elonmusk had 37’172,852 for Twitter API 222.81 Web scraping 5.74 web scraping and 37’172,865 for Twitter API. Test 3 confirms the @elonmusk XXXXX71909906702336 Twitter API 203.25 previous tests and the total credibility conformed by the text, user Web scraping 5.76 @presidenciaperu XXXXX66760141852673 and social credibilities, are identical for both extraction methods. Twitter API 199.56 Finally, Test 4 demonstrated that web scraping is faster than Twitter API since the latter is composed by a web scraping process to obtain the user_id and tweet_id, and the API call. 6.5 Discussion Data extraction methods have been wide used in social networks, in special on Twitter. When there are API limitations, some works 7 CONCLUSIONS AND FUTURE WORK come up with alternatives and bypasses the APIs, using web scrap- In this paper, we present a comparison between two data extraction ing to gather the data needed. techniques on the Web, web scraping and API, by using an existing Web scraping is more flexible than API extraction because itcan real-time credibility model applied to Twitter. To do so, we imple- be used on most web pages, not just those that offer APIs11 [ ]. Web mented a framework as an extension of Google Chrome consisting scraping would not be necessary if each website provides an API of a front-end and a back-end, which performs the credibility anal- to share their data in a structured format. However, some websites ysis in real-time and can be configured to use either web scraping o have APIs, but they are restricted by what data is available and Twitter API to gather the needed data to feed the credibility model. how frequently it can be accessed [9]. For web scraping, the speed The credibility model, previously proposed in [9], computes post’s can affect the data extraction when the page is not displayed, but credibility based on Text Credibility, User Credibility, and Social its use is free for some cases where policies allow. Moreover, the Credibility. Results show that a robust normalization process on constant change of the web pages affects this method. With APIs, the text obtained by the extraction methods, produces identical the access to extract data is limited and it is not free; however, its use credibility results. Moreover, the number of followers obtained by is independent of the information displayed on the website. Then, the extraction methods have a minor different for famous accounts we can observe clearly two main differences between web scraping since the number of followers is constantly growing. Additionally, and APIs, the speed of getting the data (network connection) and web scraping is faster than Twitter API since for this latter, the use the access (number of requests). The variety of the products and of web scraping is required to obtain the user_id and tweet_id the information that can be extracted from the API enables to have before to perform the API request. We are currently working on a question whether APIs can be a feasible method of extracting improvement the credibility model, by considering Topic Credibil- data posted on the Web. ity and extending Text, User, and Social Credibility with other data, Regardless of the method used, we assume that the data obtained such as semantic analysis, retweets, and bot detection. should be similar since there must be a consistency between the displayed information (web scraping) for users and the one obtained by the API, for developers. To affirm or deny our assumption, we ACKNOWLEDGEMENTS evaluated, using our proposed model, the tweet credibility in real- We thank Sergio Medina, David Cabeza, Germán Robayo, José time for both methods, web scraping and Twitter API. Acevedo, and Nairelys Hernández who have contributed to the The scraper developed for our previous work [9], did not work implementation and the experiments. This research was supported when we started this study because of changes in the Twitter HTML. by the FONDO NACIONAL DE DESARROLLO CIENTÍFICO, TEC- Then, we developed another one following the new HTML tags, to NOLÓGICO Y DE INNOVACIÓN TECNOLÓGICA - FONDECYT Web Scraping versus Twitter API: A Comparison for a Credibility Analysis iiWAS ’20, November 30-December 2, 2020, Chiang Mai,Thailand as executing entity of CONCYTEC under grant agreement no. 01- Somayajulu (Eds.). Springer Intl. Publishing, Cham, 170–187. 2019-FONDECYT-BM-INC.INV in the project RUTAS: Robots para [16] A. Hernandez-Suarez, G. Sanchez-Perez, K. Toscano-Medina, V. Martinez- Hernandez, V. Sanchez, and H. Perez-Meana. 2018. A Web Scraping Methodology centros Urbanos Turísticos Autónomos y basados en Semántica. for Bypassing Twitter API Restrictions. arXiv:1803.09875 [cs.IR] [17] A. Iftene, D. Gîfu, A. Miron, and M. Dudu. 2020. A Real-Time System for Credibil- REFERENCES ity on Twitter. In 12th Language Resources and Evaluation Conference. 6166–6173. [18] X. Liu, A. Nourbakhsh, Q. Li, R. Fang, and S. Shah. 2015. Real-time rumor [1] M. Alrubaian, M. Al-Qurishi, M. Al-Rakhami, M. Hassan, and A. Alamri. 2016. debunking on twitter. In Intl. Conf. on Information and Knowledge Management. CredFinder: A real-time tweets credibility assessing system. In Intl. Conf. on 1867–1870. Advances in Social Networks Analysis and Mining. 1406–1409. [19] K. Lorek, J. Suehiro-Wiciński, M. Jankowski-Lorek, and A. Gupta. 2015. Auto- [2] M. Alrubaian, M. Al-Qurishi, A. Alamri, M. Al-Rakhami, M. Hassan, and G. mated Credibility Assessment on Twitter. Science 16, 2 (2015). Fortino. 2019. Credibility in Online Social Networks: A Survey. IEEE Access 7 [20] R. Mitchell. 2015. Web Scraping with Python: Collecting Data from the Modern (2019), 2828–2855. Web (1st ed.). O’Reilly Media, Inc. [3] M. Alrubaian, M. Al-Qurishi, M. Hassan, and A. Alamri. 2016. A credibility analy- [21] Yoshimi Namihira, Naomichi Segawa, Yukino Ikegami, Kenta Kawai, Takashi sis system for assessing information on twitter. IEEE Transactions on Dependable Kawabe, and Setsuo Tsuruta. 2013. High precision credibility analysis of infor- and Secure Computing 15, 4 (2016), 661–674. mation on Twitter. In Intl. Conf. on Signal-Image Technology & Internet-Based [4] H. Al-Khalifa and R. Al-Eidan. 2011. An experimental system for measuring the Systems. 909–915. credibility of news content in Twitter. Intl. J. of Web Information Systems 7, 2 [22] D. Salt and A. Sellhorn. 2014. Method, system and product for (2011), 130–151. a client application programming interface (API) in a service oriented architecture. [5] M. Boillot. 2012. Application programming interface (API) for sensory events. US Patent 8,701,128. US Patent 8,312,479. [23] C. Shao, G. Ciampaglia, A. Flammini, and F. Menczer. 2016. Hoaxy: A Platform [6] A. Bovet and H. Makse. 2019. Influence of fake news in Twitter during the 2016 for Tracking Online Misinformation. In 25th Intl. Conf. Companion on World Wide US presidential election. Nature Communications 10 (Jan 2019). Web. 745–750. [7] K. Canini, B. Suh, and P. Pirolli. 2010. Finding Relevant Sources in Twitter Based [24] C. Slamet, R. Andrian, D. Maylawati, Suhendar, W. Darmalaksana, and MA. on Content and Social Structure. In NIPS Workshop. Ramdhani. 2018. Web Scraping and Naïve Bayes Classification for Job Search [8] C. Castillo, M. Mendoza, and B. Poblete. 2011. Information credibility on twitter. Engine. IOP Conf. Series: Materials Science and Engineering 288 (jan 2018), 012038. In Intl. Conf. on WWW. 675–684. [25] JF. Sánchez-Rada and C. Iglesias. 2019. Social context in sentiment analysis: [9] I. Dongo, Y. Cardinale, and A. Aguilera. 2019. Credibility Analysis for Available Formal definition, overview of current trends and framework for comparison. Information Sources on the Web: A Review and a Contribution. In 4th Intl. Conf. Information Fusion 52 (2019), 344 – 356. on System Reliability and Safety (ICSRS). 116–125. [26] S. Tan. 2017. Spot the lie: Detecting untruthful online opinion on Twitter. De- [10] S. Edgerly and E. Vraga. 2019. The Blue Check of Credibility: Does Account Veri- partment of Computing, Imperial College London, Master Thesis. fication Matter When Evaluating News on Twitter? Cyberpsychology, Behavior, [27] T. Vaidya, D. Votipka, M. Mazurek, and M. Sherr. 2019. Does Being Verified Make and Social Networking 22, 4 (2019), 283–287. You More Credible?: Account Verification’s Effect on Tweet Credibility. In Conf. [11] D. Freelon. 2018. Computational Research in the Post-API Age. Political Commu- on Human Factors in Computing Systems. Article 525, 13 pages. nication 35, 4 (Oct 2018), 665–668. [28] J. Yang, M. Yu, H. Qin, M. Lu, and C. Yang. 2019. A twitter data credibility [12] D. Glez-Peña, A. Lourenço, H. López-Fernández, M. Reboiro-Jato, and F. Fdez- framework—Hurricane harvey as a use case. ISPRS Intl. J. of Geo-Information 8, 3 Riverola. 2013. Web scraping technologies in an API world. Briefings in Bioinfor- (2019), 111. matics 15, 5 (04 2013), 788–797. [29] K. Yang, C. Torres-Lugo, and F. Menczer. 2020. Prevalence of low-credibility infor- [13] A. Gupta, P. Kumaraguru, C. Castillo, and P. Meier. 2014. TweetCred: Real-Time mation on twitter during the covid-19 outbreak. arXiv preprint arXiv:2004.14484 Credibility Assessment of Content on Twitter. Springer Intl. Publishing, Cham, (2020). 228–243. [30] K. Yang, O. Varol, C. Davis, E. Ferrara, A. Flammini, and F. Menczer. 2019. Arm- [14] A. Gupta, H. Lamba, P. Kumaraguru, and A. Joshi. 2014. Analyzing and Measuring ing the public with AI to counter social bots. CoRR abs/1901.00912 (2019). the Spread of Fake Content on Twitter during High Impact Events. In Security arXiv:1901.00912 and Privacy Symposium 2014, CSE-IIT-Kanpur. [31] S. Zannettou, M. Sirivianos, J. Blackburn, and N. Kourtellis. 2019. The Web of [15] S. Gupta, S. Sachdeva, P. Dewan, and P. Kumaraguru. 2018. CbI: Improving Cred- False Information: Rumors, Fake News, Hoaxes, Clickbait, and Various Other ibility of User-Generated Content on Facebook. In Big Data Analytics, Anirban Shenanigans. J. Data and Information Quality 11, 3, Article 10 (May 2019), Mondal, Himanshu Gupta, Jaideep Srivastava, P. Krishna Reddy, and D.V.L.N. 37 pages.