Social Media Analytics: a Survey of Techniques, Tools and Platforms

AI & Soc (2015) 30:89–116 DOI 10.1007/s00146-014-0549-4 OPEN FORUM Social media analytics: a survey of techniques, tools and platforms Bogdan Batrinca • Philip C. Treleaven Received: 25 February 2014 / Accepted: 4 July 2014 / Published online: 26 July 2014 Ó The Author(s) 2014. This article is published with open access at Springerlink.com Abstract This paper is written for (social science) scientists seeking to utilize social media scraping and researchers seeking to analyze the wealth of social media analytics either in their research or business. The data now available. It presents a comprehensive review of retrieval techniques that are presented in this paper are software tools for social networking media, wikis, really valid at the time of writing this paper (June 2014), but they simple syndication feeds, blogs, newsgroups, chat and are subject to change since social media data scraping APIs news feeds. For completeness, it also includes introduc- are rapidly changing. tions to social media scraping, storage, data cleaning and sentiment analysis. Although principally a review, the Keywords Social media Á Scraping Á Behavior paper also provides a methodology and a critique of social economics Á Sentiment analysis Á Opinion mining Á media tools. Analyzing social media, in particular Twitter NLP Á Toolkits Á Software platforms feeds for sentiment analysis, has become a major research and business activity due to the availability of web-based application programming interfaces (APIs) provided by 1 Introduction Twitter, Facebook and News services. This has led to an ‘explosion’ of data services, software tools for scraping and Social media is defined as web-based and mobile-based analysis and social media analytics platforms. It is also a Internet applications that allow the creation, access and research area undergoing rapid change and evolution due to exchange of user-generated content that is ubiquitously commercial pressures and the potential for using social accessible (Kaplan and Haenlein 2010). Besides social media data for computational (social science) research. networking media (e.g., Twitter and Facebook), for con- Using a simple taxonomy, this paper provides a review of venience, we will also use the term ‘social media’ to leading software tools and how to use them to scrape, encompass really simple syndication (RSS) feeds, blogs, cleanse and analyze the spectrum of social media. In wikis and news, all typically yielding unstructured text addition, it discussed the requirement of an experimental and accessible through the web. Social media is especially computational environment for social media research and important for research into computational social science presents as an illustration the system architecture of a that investigates questions (Lazer et al. 2009) using social media (analytics) platform built by University Col- quantitative techniques (e.g., computational statistics, lege London. The principal contribution of this paper is to machine learning and complexity) and so-called big data provide an overview (including code fragments) for for data mining and simulation modeling (Cioffi-Revilla 2010). This has led to numerous data services, tools and ana- & B. Batrinca Á P. C. Treleaven ( ) lytics platforms. However, this easy availability of social Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK media data for academic research may change significantly e-mail: [email protected] due to commercial pressures. In addition, as discussed in B. Batrinca Sect. 2, the tools available to researchers are far from ideal. e-mail: [email protected] They either give superficial access to the raw data or (for 123 90 AI & Soc (2015) 30:89–116 non-superficial access) require researchers to program charge a premium for access to their data. In contrast, analytics in a language such as Java. Twitter has recently announced the Twitter Data Grants program, where researchers can apply to get access to 1.1 Terminology Twitter’s public tweets and historical data in order to get insights from its massive set of data (Twitter has We start with definitions of some of the key techniques more than 500 million tweets a day). related to analyzing unstructured textual data: • Data cleansing—cleaning unstructured textual data (e.g., normalizing text), especially high-frequency • Natural language processing—(NLP) is a field of streamed real-time data, still presents numerous prob- computer science, artificial intelligence and linguistics lems and research challenges. concerned with the interactions between computers and • Holistic data sources—researchers are increasingly human (natural) languages. Specifically, it is the bringing together and combining novel data sources: process of a computer extracting meaningful informa- social media data, real-time market & customer data tion from natural language input and/or producing and geospatial data for analysis. natural language output. • Data protection—once you have created a ‘big data’ • News analytics—the measurement of the various resource, the data needs to be secured, ownership and qualitative and quantitative attributes of textual IP issues resolved (i.e., storing scraped data is against (unstructured data) news stories. Some of these attri- most of the publishers’ terms of service), and users butes are: sentiment, relevance and novelty. provided with different levels of access; otherwise, • Opinion mining—opinion mining (sentiment mining, users may attempt to ‘suck’ all the valuable data from opinion/sentiment extraction) is the area of research the database. that attempts to make automatic systems to determine • Data analytics—sophisticated analysis of social media human opinion from text written in natural language. data for opinion mining (e.g., sentiment analysis) still • Scraping—collecting online data from social media raises a myriad of challenges due to foreign languages, and other Web sites in the form of unstructured text and foreign words, slang, spelling errors and the natural also known as site scraping, web harvesting and web evolving of language. data extraction. • Analytics dashboards—many social media platforms • Sentiment analysis—sentiment analysis refers to the require users to write APIs to access feeds or program application of natural language processing, computa- analytics models in a programming language, such as tional linguistics and text analytics to identify and Java. While reasonable for computer scientists, these extract subjective information in source materials. skills are typically beyond most (social science) • Text analytics—involves information retrieval (IR), researchers. Non-programming interfaces are required lexical analysis to study word frequency distributions, for giving what might be referred to as ‘deep’ access to pattern recognition, tagging/annotation, information ‘raw’ data, for example, configuring APIs, merging extraction, data mining techniques including link and social media feeds, combining holistic sources and association analysis, visualization and predictive developing analytical models. analytics. • Data visualization—visual representation of data whereby information that has been abstracted in some 1.2 Research challenges schematic form with the goal of communicating information clearly and effectively through graphical Social media scraping and analytics provides a rich source means. Given the magnitude of the data involved, of academic research challenges for social scientists, visualization is becoming increasingly important. computer scientists and funding bodies. Challenges include: 1.3 Social media research and applications • Scraping—although social media data is accessible through APIs, due to the commercial value of the data, Social media data is clearly the largest, richest and most most of the major sources such as Facebook and dynamic evidence base of human behavior, bringing new Google are making it increasingly difficult for academ- opportunities to understand individuals, groups and society. ics to obtain comprehensive access to their ‘raw’ data; Innovative scientists and industry professionals are very few social data sources provide affordable data increasingly finding novel ways of automatically collect- offerings to academia and researchers. News services ing, combining and analyzing this wealth of data. Natu- such as Thomson Reuters and Bloomberg typically rally, doing justice to these pioneering social media 123 AI & Soc (2015) 30:89–116 91 applications in a few paragraphs is challenging. Three • Social media programmatic access—data services illustrative areas are: business, bioscience and social and tools for sourcing and scraping (textual) data from science. social networking media, wikis, RSS feeds, news, etc. The early business adopters of social media analysis These can be usefully subdivided into: were typically companies in retail and finance. Retail • Data sources, services and tools—where data is companies use social media to harness their brand aware- accessed by tools which protect the raw data or ness, product/customer service improvement, advertising/ provide simple analytics. Examples include: Google marketing strategies, network structure analysis, news Trends, SocialMention, SocialPointer and Social- propagation and even fraud detection. In finance, social Seek, which provide a stream of information that media is used for measuring market sentiment and news aggregates various social media feeds. data is used for trading. As an illustration,

Social Media Analytics: a Survey of Techniques, Tools and Platforms

A Study on Social Network Analysis Through Data Mining Techniques – a Detailed Survey

Social Media Analytics

Mining Social Networking Sites: a Specific Area of Facebook

How to Prove Your Social Media Works

Social Media Monitoring During Elections

The Fundamentals of Social Media Analytics Table of Contents

Social Networking: a Guide to Strengthening Civil Society Through Social Media

Social Media Analytics: Driving Better Marketing Decisions Insights Into Customer Behavior Introduction Influenced by Economic, Social, Or Political Happenings

Analysis of News Sentiment and Its Application to Finance

Text and Context: Language Analytics in Finance

Social Media Define the Era in Digital Media

Predictive Analytics of Social Networks