Design of Imacros-Based Data Crawler and the Behavioral Analysis of Facebook Users

Design of iMacros-based Data Crawler and the Behavioral Analysis of Facebook Users Mudasir Ahmad Wani Nancy Agarwal Suraiya Jabin Syed Zeesahn Hussain Research laboratory Research laboratory Department of Computer Department of Computer Department Computer Department Computer Science Science Science Science Faculty of Natural Science Faculty of Natural Science Faculty of Natural Science Faculty of Natural Science Jamia Millia Islamia (A Jamia Millia Islamia (A Jamia Millia Islamia (A Jamia Millia Islamia (A Central University) Central University) Central University) Central University) New Delhi, India New Delhi, India New Delhi, India New Delhi, India [email protected] [email protected] [email protected] [email protected] Abstract Obtaining the desired dataset is still a prime challenge faced by researchers while analyzing Online Social Network (OSN) sites. Application Programming Interfaces (APIs) provided by OSN service providers for retrieving data impose several unavoidable restrictions which make it difficult to get a desirable dataset. In this paper, we present an iMacros technology-based data crawler called IMcrawler, capable of collecting every piece of information which is accessible through a browser from the Facebook website within the legal framework which permits access to publicly shared user content on OSNs. The proposed crawler addresses most of the challenges allied with web data extraction approaches and most of the APIs provided by OSN service providers. Two broad sections have been extracted from Facebook user profiles, namely, Personal Information and Wall Activities. The present work is the first attempt towards providing the detailed description of crawler design for the Facebook website. Keywords: Online Social Networks, Information Retrieval, Data Extraction, Behavioral Analysis, Privacy and Security. Introduction Online Social Network (OSN) is a web application which focuses on the social life of netizens and provides them an excellent platform to build a network of relationships and share their social life. OSNs such as Facebook, LinkedIn and Twitter are increasingly becoming popular among internet users and turning to be an essential medium for people’ s daily activities. Since their structure significantly reflects the real-life communities and they contain a huge amount of user’ s personal and social information, they are of scientific importance to the researchers and disciplines in different domains including marketing, sociology, politics etc. Marketers study OSNs to design a viral marketing strategy (Staab et al., 2005), sociologists use them to study the human behavior (Christakis et al., 2013) and politicians use them to empower their political campaigns (Vergeer et al., 2013). For conducting any kind of analysis, sufficient amount of data is the preliminary requirement (Wani et al., 2018). OSNs like Facebook contain billions of user profiles and the service providers ensure that their data is protected which makes the process of data collection very challenging for researchers. OSN datasets are not publicly available because of privacy reasons and since the data is enormous, collecting it manually makes the task complex and time-consuming. However, most popular social networking sites like Facebook, Twitter, Flickr, etc. provide methods for retrieving information from the network through their own well defined Application Programming Interfaces (APIs) like Graph-API1, REST-API2 etc., but these APIs are associated with several unavoidable 1 https://developers.facebook.com/docs/graph-api 2 https://dev.twitter.com/rest/public constraints such as data request rate restrictions, selective data access, etc. Web scrapping provides an alternative solution by automatically extracting the information from the web pages in a systematic manner. Although it can solve the problem of data collection to a great extent but writing a scrapper is itself a challenging task. Generally, social media platforms including Facebook have inbuilt bot detection mechanism (Stein et al., 2011) which can recognize an automated activity on their platforms and, therefore data collection by means of software programs may lead to the suspension of the user account which is being used for data extraction. The feature of dynamic loading of content via various web technologies (e.g. JavaScript and Ajax) further complicates the task since this information is not available in the source code of a web page. Moreover, the call to dynamic content on the web page is mostly triggered by user interactions with the page which implies that there should be a mechanism to automate these interactions in order to load this dynamic content into the parent HTML document. Therefore, there is a need for a tool which is capable of circumventing the API constraints and can overcome the data scrapping barriers. The paper focuses on the designing of data crawler, IMcrawler for a Facebook network that addresses the above-discussed challenges and assists the end user in extracting the data in an efficient and convenient manner. Facebook is one of the topmost networking sites and has the most complex privacy policy structure. It’s API can be used to extract data from only those users who are already registered with the application. Unlike Twitter API, the Facebook API has to explicitly ask for permissions from its members to access their data. Users’ privacy settings and privileges granted to the application will decide the data that can be scrapped from their profiles. Facebook has also imposed constraints on the maximum request rate to limit the amount of data being scrapped. The crawler proposed in this paper does not face any of these hurdles and is able to extract every information of any amount which appears on the user profile. It is also capable of interacting with Timeline to load the dynamic content. The complete framework of the data collection is also described with step wise processes followed from crawling the network to obtain the featured dataset in a useful format. The crawler has been designed to extract two broad sections viz profile information (profile features) and wall activities (post features), particularly from the friends of a profile. Profile information consists of details provided by the users about themselves and wall activities consist of actions performed by the users on the Timeline. The collected information is divided into two datasets. The first dataset represents the attributes of a profile and the second dataset holds the post information. The remaining article is structured as follows: Initially, in section 2, the work relevant to data extraction from social networks and analysis on the information revealed by users has been discussed. Section 3 describes the novel contribution of the proposed work. The design of the IMcrawler and it's working have been explained in section 4. Finally, the section 5 concludes the overall work and provides future directions. 2. Related Work Given the massive amount of online user data, its extraction is the key challenge for the researchers. One of the most recent studies (Zhou, D. et al., 2018) presented a detailed survey on the existing data collection methods, mechanisms and architectures along with some open challenges associated with the process of data collection. The authors (Ferrara et al., 2014) in one more study has presented a detailed discussion on web data extraction techniques along with the application domains in which they are applied. In general, APIs and HTML scrapping are the two popular ways to retrieve data from OSNs. Although APIs provide well-organized data but with several restrictions. The HTML scrapping techniques offer an alternative solution that can circumvent the limitations imposed by APIs but at the price of technical complexities. The paper (Maynard et al., 2017) presented a semantic- based framework for collecting the social media data using APIs and analyzing it by the open source tools provided by GATE family3. The authors in (Mislove et al. 2007) have analyzed the structure of several popular OSNs such as YouTube, Flicker, LiveJournal, etc. The study has used the APIs provided by these OSNs and HTML scrapping technique in order to extract the required data. Data extraction schemes significantly depend on the policy of online 3 https://gate.ac.uk/family social networks as in (Alim et al., 2009), the authors have presented the process of extracting personal attributes and the list of top friends from MySpace social network without being its member. A study (Rungsawang et al., 2005) highlights the pros and cons of the existing data crawlers and proposed an algorithm to resolve the issues associated with previous data crawlers. MySpace provides a rich source of data to the nonmembers as well, whereas networks like Facebook, Friendster etc. expose either no content or minimal content to the external users. Several studies have been carried out on the Facebook network exclusively since it is one of the most popular online social networking websites and has the most complex privacy mechanism (Dwyer et al., 2007, Acquisti et al., 2006). Netviz (Rieder, 2013) is a Facebook application designed to help researchers in collecting features of a profile including personal networks, groups, and pages. However, just like any other API, the working of Netvizz application is also limited by the permission and privacy model of the Facebook service. First, it requires logged in Facebook account. Second, it explicitly asks the user for the permission to access their different data and third, the user can further restrict the data availability to the application by their privacy settings. In (Wong et al., 2014), the authors have implemented an HTML-based crawler by using PhantomJS, a headless browser4, to extract the friend's network of Facebook users of a specific region in Macau. In addition, authors have also discussed the technical challenges and their viable solutions while designing the data extractor for OSNs. OSN sites offer a great medium to its users to share varied amount of social as well as personal details with either all the members of the network or a specific community of users (Javed, M.

Design of Imacros-Based Data Crawler and the Behavioral Analysis of Facebook Users

An Overview of the 50 Most Common Web Scraping Tools

Web Data Extraction

Leukemia Medical Application with Security Features

Computer Science) Thesis Title

Download Free Internet Explorer Latest Version 9

Ringer: Web Automation by Demonstration

Acamia: Automating Configuration Across Multiple Interactive

Syndicate Framework 55 Table of Contents Ix

Test Tools and Automation

Optimizing Software Quality Through Automation Testing Ankit Sharma

Internet Explorer

Web Automation by Demonstration