Web Scraping Versus Twitter API: a Comparison for a Credibility Analysis

Web Scraping versus Twitter API: A Comparison for a Credibility Analysis Irvin Dongo Yudith Cadinale Ana Aguilera Universidad Católica San Pablo Universidad Católica San Pablo Facultad de Ingeniería, Escuela de Arequipa, Perú Arequipa, Perú Ingeniería Informática, Universidad Univ. Bordeaux, ESTIA INSTITUTE Universidad Simón Bolívar de Valparaíso OF TECHNOLOGY Caracas, Venezuela Chile Bidart, France [email protected] [email protected] [email protected] Fabiola Martínez Yuni Quintero Sergio Barrios Universidad Simón Bolívar Universidad Simón Bolívar Universidad Simón Bolívar Caracas, Venezuela Caracas, Venezuela Caracas, Venezuela [email protected] [email protected] [email protected] ABSTRACT ACM Reference Format: Twitter is one of the most popular information source available Irvin Dongo, Yudith Cadinale, Ana Aguilera, Fabiola Martínez, Yuni Quin- on the Web. Thus, there exist many studies focused on analyzing tero, and Sergio Barrios. 2020. Web Scraping versus Twitter API: A Com- parison for a Credibility Analysis. In The 22nd International Conference on the credibility of the shared information. Most proposals use ei- Information Integration and Web-based Applications & Services (iiWAS ’20), ther Twitter API or web scraping to extract the data to perform November 30-December 2, 2020, Chiang Mai, Thailand. ACM, New York, NY, such analysis. Both extraction techniques have advantages and USA, 11 pages. https://doi.org/10.1145/3428757.3429104 disadvantages. In this work, we present a study to evaluate their performance and behavior. The motivation for this research comes 1 INTRODUCTION from the necessity to know ways to extract online information in order to analyze in real-time the credibility of the content posted Social network platforms, such as Twitter, Facebook, and Instagram on the Web. To do so, we develop a framework which offers both have considerably increased their number of users in last years. alternatives of data extraction and implements a previously pro- These platforms share contents, opinions, news, and sometimes fake posed credibility model. Our framework is implemented as a Google content. In particular, Twitter is a worldwide social network which Chrome extension able to analyze tweets in real-time. Results re- has more than 600 million users and it is one of the most widely used port that both methods produce identical credibility values, when platform during relevant events [13], such as natural disasters [14], a robust normalization process is applied to the text (i.e., tweet). brands and products advertising [25], and presidential elections [6]. Moreover, concerning the time performance, web scraping is faster However, the information shared on Twitter, and in other social than Twitter API, and it is more flexible in terms of obtaining data; networks, is not completely reliable [31]. The information posted however, web scraping is very sensitive to website changes. on social networks must be reliable since their content can help people during crisis situations, influence the crowds, and even can be useful as a mean to help companies in decision making. Thus, CCS CONCEPTS there exist many studies focused on analyzing the credibility of the • General and reference ! Reliability; • Information systems shared information [1, 3, 9, 13]. ! Social networks; Data extraction and integration. In social networks, the credibility study is affected by different factors [9], such as: (i) the veracity of the text, content with mis- KEYWORDS spellings and bad words; (ii) users on the network who generated content; (iii) the quantity of data related to the information to be API, Web Scraping, Twitter, Credibility validated. As more data is available, more features can be extracted and thus, a better analysis can be performed. In the state-of-the-art, two main extraction methods have been used: (i) web scraping, which consists of parsing the website (HTML) to obtain data by Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed using the tags; and (ii) API, which is an interface provided by social for profit or commercial advantage and that copies bear this notice and the full citation media platforms to retrieve specific information. on the first page. Copyrights for components of this work owned by others than ACM This work presents a comparison among the two data extraction must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a techniques. To do so, we extend an application that instantiates our fee. Request permissions from [email protected]. credibility analysis model, both presented in [9]. The credibility iiWAS ’20, November 30-December 2, 2020, Chiang Mai, Thailand model is adaptable to various social networks and it is based on © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-8922-8/20/11...$15.00 several features, such as verified account, number of following and https://doi.org/10.1145/3428757.3429104 followers, to compute three credibility measures: Text, User, and iiWAS ’20, November 30-December 2, 2020, Chiang Mai, Thailand I. Dongo, Y. Cardinale, A. Aguilera, F. Martínez, Y. Quintero, S. Barrios Social credibilities. The application is extended by updating the extraction include different methods or a combination of them. Next web scraping method and adding the Twitter API option. section describes these data extraction methods. In order to compare both data extraction techniques, an experimental evaluation is presented. Three different languages (Spanish, 2.1 Data Extraction Methods English, and French) are used to evaluate their impact on the Text Nowadays, data extraction from web sources is a vital task for most credibility. Moreover, different types of text such as short, long, of business process, researches, studies, and others. The process use of bad words, misspelling and emoticons are analyzed. Social of data extraction consists of obtaining relevant data or metadata credibility is also evaluate by using two types of accounts: common useful for diverse purposes. Three well-known methods have been accounts (less than one thousand of followers) and famous accounts applied for this: web scraping, APIs, and manual extraction. Web (more than nine hundred thousands of followers). Additionally, the scraping and APIs are automated techniques and the most practical execution time to extract the features is also reported. ways of data harvesting [24]. They allow to collect data from various Experiments show that a robust normalization process on the text website pages and repositories, at a high speed and accurately. The obtained by the extraction methods, produces identical credibility data is then saved and stored for further use and analysis. Manual results. Also, the language has no impact on the credibility, nor extraction is more susceptible to human errors and time consuming. does the type of text. Moreover, the number of followers obtained by the extraction methods have a minor difference for famous 2.1.1 Web Scraping, use and limitations. Web data scraping can accounts, since the number of followers is constantly growing in be defined as the process of extracting and combining contents of real-time. Additionally, data extraction with web scraping is faster interest from the Web in a systematic way. In such a process, a soft- than with Twitter API, since for the former, only local extraction (in ware agent, a web robot or a script, mimics the browsing interaction the Twitter website) is needed, while for the API, a local extraction between the web servers and the human in a conventional web to obtain the user_id and tweet_id is required to latter perform traversal. Step by step, the robot/script accesses as many websites a remote API request. as needed, parses their contents to find and extract data of interest This paper is organized as follows. First in Section 2, the topics and structures those contents as desired [12]. Web scrapers are use- related to this work such as data extraction methods and the social ful when retrieving and processing large amounts of data quickly network Twitter, are described. Data extraction techniques are from a specific website. Thus, if the information is displayed ona explained in Section 3. The credibility model used in this work is browser, it can be accessible via a robot/script to extract the data explained in Section 4. In Section 6, we present an experimental and store them in a database for future analysis and use [20]. evaluation and a discussion about both extraction methods. Finally, Web scraping is used commonly on website pages that use we conclude in Section 7. markup languages such as HTML or eXtensible HyperText Markup Language (XHTML). In this case, scraping consists on parsing hy- 2 SOCIAL NETWORKS AS INFORMATION pertext tags and retrieving plain text information embedded onto SOURCES: PRELIMINARY them. The web data scraper establishes communication with the Since the creation of social networks, they have become a new target website through the HTTP protocol and extracts the contents channel for communication and socialization [2]. The types of in- of interest. Some regular expression matching could be necessary formation that are exchanged on social media platforms range from along with additional logic [12]. The network speed may be a limi- personal contexts, with the exchange of personal conversations tation or disadvantage for web scraping, since it affects when and through messaging, expressions of feelings, photographs, videos, to how the data is displayed. Another problem regarding this method official and institutional contexts. Hence, social networks become is the often changes of the web pages format. Web scraping involves popular sources of real-time information. In 2020, among the most site-specific programming and does not comply with expectable popular social networks in terms of the number of users are Face- changes in the HTML source [12].

Web Scraping Versus Twitter API: a Comparison for a Credibility Analysis

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support