DELIVERABLE SUBMISSION SHEET

To: Susan Fraser (Project Officer) EUROPEAN COMMISSION Directorate-General Information Society and Media EUFO 1165A L-2920 Luxembourg

From: Project acronym: PHEME Project number: 611233 Project manager: Kalina Bontcheva Project coordinator The University of Sheffield (USFD)

The following deliverable: Deliverable title: Market Watch – Final Version Deliverable number: D9.5.2 Deliverable date: 30 June 2016 Partners responsible: ATOS Spain SA Status:  Public  Restricted  Confidential

is now complete.  It is available for your inspection.  Relevant descriptive documents are attached.

The deliverable is:  a document  a Website (URL: ...... )  software (...... )  an event  other (...... )

Sent to Project Officer: Sent to functional mail box: On date: [email protected] [email protected] 30 June 2016

D9.5.2 / Market Watch

FP7-ICT Strategic Targeted Research Project PHEME (No. 611233) Computing Veracity Across Media, Languages, and Social Networks

D9.5.2 Market Watch – Final Version Tomás Pariente Lobo (ATOS) Belén Gallego (ATOS)

Abstract FP7-ICT Strategic Targeted Research Project Pheme (No. 611233) Deliverable D9.5.2 (WP 9)

This document provides the final version of the Market Watch Analysis, which aims to detect possible business opportunities for the PHEME project results, explores the current targeted market and identifies potential competitors as well as synergies that can be converted into real business cases. The present analysis takes as a starting point the Initial Market Watch deliverable D9.5.1, expanding substantially on the information presented there and outlining the methodological aspects related to the information gathering. The deliverable identifies the potential targeted markets for PHEME, by exploring the competition, existing projects and looking at predictions from market analysts, as a necessary steps in the way to the definition of the business plans for the project.

Keyword list: dissemination, exploitation, PHEME project results

Nature: Report Dissemination: CO Contractual date of delivery: 30/06/2016 Actual date of delivery: 30/06/2016 Reviewed By: Kalina Bontcheva, Geraldine Wong Sak Hoi Web links:

1 D9.5.2 / Market Watch

CHANGES

Version Date Author Changes 0.1 01.04.2016 Tomás ToC, responsibilities and initial content Pariente 0.2 31.05.2016 Belén First Draft Gallego, Tomás Pariente 0.3 09.06.2016 Tomás Final draft for internal review Pariente 1.0 28.06.2016 Tomás Final version including update after internal Pariente review

2 D9.5.2 / Market Watch

PHEME Consortium

This document is part of the PHEME research project (No. 611233), partially funded by the FP7-ICT Programme.

University of Sheffield Universitaet des Saarlandes Department of Computer Science Language Technology Lab Regent Court, 211 Portobello St. Campus Sheffield S1 4DP, UK D-66041 Saarbrücken Tel: +44 114 222 1930 Germany Fax: +44 114 222 1810 Contact person: Thierry Declerck Contact person: Kalina Bontcheva E-mail: [email protected] E-mail: [email protected]

MODUL University Vienna GMBH Ontotext AD Am Kahlenberg 1 Polygraphia Office Center fl.4, 1190 Wien 47A Tsarigradsko Shosse, Austria Sofia 1504, Bulgaria Contact person: Arno Scharl Contact person: Georgi Georgiev E-mail: [email protected] E-mail: [email protected]

ATOS Spain SA King’s College London Calle de Albarracin 25 Strand 28037 Madrid WC2R 2LS London Spain United Kingdom Contact person: Tomás Pariente Lobo Contact person: Robert Stewart E-mail: [email protected] E-mail: [email protected]

iHub Ltd. SWI swissinfo.ch NGONG, Road Bishop Magua Building Giacomettistrasse 3 4th floor 3000 Bern 00200 Nairobi Switzerland Kenya Contact person: Peter Schibli Contact person: Rob Baker E-mail: [email protected] E-mail: [email protected]

The University of Warwick Kirby Corner Road University House CV4 8UW Coventry United Kingdom Contact person: Rob Procter E-mail: [email protected]

3 D9.5.2 / Market Watch

Executive Summary

This document provides the final version of the Market Watch Analysis, which aims to detect possible business opportunities for the PHEME project results, explores the current targeted market and identifies potential competitors as well as synergies that can be converted into real business cases. The present analysis takes as a starting point the Initial Market Watch deliverable D9.5.1. (2015), expanding substantially on the information presented there and outlining the methodological aspects related to information gathering. The deliverable identifies the potential targeted markets for PHEME, by exploring the competition, existing projects and looking at predictions of market analysts, all necessary steps to define the business plans for the project.

4 D9.5.2 / Market Watch

Contents

PHEME Consortium ...... 3 Executive Summary ...... 4 Contents ...... 5 1. Relevance to PHEME ...... 6 1.1 Purpose of this document ...... 6 1.2 Relevance to project objectives ...... 6 1.4 Relation to other workpackages ...... 6 1.5 Structure of the document ...... 6 2. PHEME in context ...... 7 2.1 What is social media? ...... 7 2.2 Social media focus for PHEME ...... 7 2.3 How to measure veracity in social media? ...... 7 2.4 Focus on and specific feeds ...... 8 2.5 PHEME expected results ...... 8 3. Market and business potential ...... 10 3.1 General overview of the PHEME market ...... 10 3.2 Potential PHEME usage scenarios ...... 18 3.2.1 Personalized Health ...... 18 3.2.2 Digital journalism ...... 19 3.2.3 Other potential users and scenarios ...... 19 3.3 Market size ...... 20 4. Market watch ...... 21 4.1 Methodological approach to the market watch ...... 21 4.2 Research projects on veracity ...... 25 4.3 Existing tools and competitors ...... 32 4.4 Other resources ...... 37 5. Conclusion ...... 47 6. Bibliography and references ...... 48

5 D9.5.2 / Market Watch

1. Relevance to PHEME

1.1 Purpose of this document

The purpose of this document is to provide the consortium partners with an overview of the market in relation to PHEME, and a review of existing tools and initiatives in the field of veracity in social networks. In order to do that, this document covers the following aspects:

• Recap of the main social media and veracity concepts related to PHEME. • Summary of the main results expected from PHEME as a prerequisite for surveying the market. • Identification of the market potential for PHEME results. • Survey of similar tools and competitors.

1.2 Relevance to project objectives

Watching the development of market opportunities and competitors is of great importance for any kind of business, but it is a key aspect for a R&D project to achieve any kind of success. The market is not static, but highly dynamic. New ideas, tools, solutions and companies arise almost daily, especially around a hot topic such as veracity in social media. A project like PHEME must be aware of the new tools and developments, in order both to assess its success and to keep its objectives and methods up to date. Therefore, this deliverable aims to provide help to the rest of the project, in order to check the validity of the research objectives, as well as assessing potential market opportunities.

1.4 Relation to other workpackages

As stated above, following developments in the market to drive the research in the right and timely direction, and surveying similar initiatives, companies and tools to assess the business potential of PHEME is of paramount importance for the overall project. Therefore, there is a clear benefit to be gained by the rest of the project in following closely the results of this deliverable.

1.5 Structure of the document

The document is organised as follows:

• Section 1 gives a brief introduction, outlines the major purpose of the document and explains its relevance to PHEME. • Section 2 provides a general background of the main concepts related to social media and veracity in the scope of PHEME. • Section 3 provides a preliminary assessment of the market potential of PHEME. • Section 4 offers the methodology followed to scan the market and an overview of existing tools, companies or initiatives related to the PHEME objectives. • Section 5 of the document concludes with consolidated findings.

6 D9.5.2 / Market Watch

2. PHEME in context

2.1 What is social media?

As its name suggests, social media is a social instrument of communication, based on web applications and technologies, which allows individuals to create, share, and/or exchange information in various formats (text, video, etc.). The objective of these internet applications is not only to inform, as traditional media does, but also to have users as content providers, in such a way that information flows in two directions. Applications such as or Twitter allow users to exchange photos, videos and text, features that can be useful for both personal and entertainment purposes as much as for business. In this last use, we can find, for example, marketers that search, track, and analyse conversations on the web about their brand or topics of interest, or businesses that find out that the best way to conduct business online is to speak to their customers.

2.2 Social media focus for PHEME

The PHEME project aims to build new methods that will semi-automatically verify online rumours as they spread across media, languages, and social networks. The term phemes comes to describe internet memes, which are enhanced with truthfulness information.

Social media poses three major computational challenges, dubbed by Gartner the 3Vs of big data: volume, velocity, and variety. PHEME is focusing its research on a fourth crucial, but largely unstudied, big data challenge: veracity.

The research being carried out in PHEME is focusing on providing a set of methods and tools to discover, acquire, reason with, and visualise veracity intelligence from social networks in multiple languages. Discovering and assessing the veracity of rumours is a challenging objective that requires the participation of an interdisciplinary team in order to advance the state-of-the-art. Therefore, tracking new ventures and developments outside the project boundaries is an exercise that is not only recommended, but necessary.

2.3 How to measure veracity in social media?

The original idea in PHEME is to classify online rumours into four types: speculation, controversy, misinformation (something untrue is spread unwittingly) and disinformation (false information spread with malicious intent). This original idea has shifted during the project life- span as some types of rumours are difficult for both people and machines to separate.

In order to do that, PHEME needs to acquire data from social networks in real time. The system will also automatically categorize sources to assess their authority, such as news outlets, individual journalists, experts, potential eyewitnesses, etc. In some particular domains (i.e. the healthcare use case), it also looks to historical resources and background, to help spot where Twitter accounts or Reddit conversations help users to understand past events or trends. PHEME searches for sources that corroborate or deny the information, and plots how the conversations evolve, using all of this information to assess whether it is likely to be true or false. The results are displayed to the users in a set of visual dashboards (the Journalist and the PHEME dashboards), to enable them to see easily whether a rumour is taking hold and then dig deeper into the data.

7 D9.5.2 / Market Watch

2.4 Focus on Twitter and specific feeds

The focus of cross-media research in PHEME is on combining authoritative sources (e.g. newspaper articles, transcriptions of news podcasts, scientific papers); user-generated content (e.g. forums, blogs); and social networks (e.g. Twitter, Reddit). In the case of social networks, PHEME explores several sources, Twitter being the most prominent, analysing not just the shared content but also the graph structure (who is connected to whom), the user profiles, and the information exchange networks (e.g. who commented on which post, which user name is mentioned by which users).

New authoritative content and online media is acquired during the project from a variety of sources, including Twitter, political and medical blogs, newspapers, SWI’s own published news and podcasts. The consortium already has a significant volume of historical data, both authoritative content (e.g. 230,000 news texts, podcasts, and forums published in 9 languages, among others), and it is gathering real-time data for the two use cases. In the healthcare use case an important resource is Reddit, as there are several medical forums hosted in that social network and their historical data can be accessed via monthly dumps.

2.5 PHEME expected results

PHEME will deliver several results for veracity and rumour detection by covering aspects such as: i) acquisition of data from social networks, ii) automatic classification and NLP algorithms, iii) a rumour annotation process, iv) semantic enrichment of the data, v) two visualization dashboards, and finally vi) several datasets providing labelled and human annotated data.

PHEME deliverable D9.3 (2015) provides a first detailed breakdown of the different exploitable results expected from PHEME. A summary of the results can be seen below:

• PHEME veracity framework • Datasets:  Social media datasets for patient care.  Social media datasets for digital journalism.  Annotated veracity corpus for patient care.  Annotated veracity corpus for digital journalism. • Methods:  Multilingual methods for spatio-temporal grounding and user and content geolocation.  Methods for event detection in social media streams.  Methods for longitudinal modelling of users, trust, and authority.  Methods for cross-media linking.  Methods for detection of mis- and disinformation.  Controversy detection tools  Algorithms for implicit information diffusion networks. • Storage:  The large-scale semantic storage tools.  The large-scale content storage tools. • Tools:  The Capture Data Collection Tools for social networks and RSS.  The PHEME visual analytics dashboard.  Open source visualization widgets.

8 D9.5.2 / Market Watch

 Adaptations to the Ushahidi platform to form the Journalist Dashboard. • Use Cases:  Use case prototype in patient care.  Adaptation of the digital journalism dashboard to the swissinfo.ch infrastructure.

This list is being continuously updated by consortium members and the final version will be reported in an upcoming D9.4 deliverable at the end of the project.

9 D9.5.2 / Market Watch

3. Market and business potential

3.1 General overview of the PHEME market

The emergence of social media technologies took place some years ago (e.g. Twitter in 2006), but they only gained significant popularity a few years later, around 2009 and 2010, when they gained respect among consumers, brands and institutions. Experts and non-experts have come to the consensus that social media cannot be ignored, when considering matters of sentiment or communication with the wider public. Businesses have surrendered to the power of social media and accepted that social media marketing must be part of their marketing strategies.

Social Media has a series of characteristics that differentiate it from traditional media, such as:

• User content: Known as Web2.0, the current web allows users to participate actively, instead of just access web pages and get information. Users now can give their opinion, discuss or just share information, or in other words, create content. • Interaction: Anyone can post anything that can potentially initiate a conversation in which anyyone can take part, as much or as little as they want to. • Contact: In a personal context, it offers the possibility to keep in touch with old friends, and also make new friends based on similar interests and opinions. This, from a business point of view, can be seen as a possibility to contact new clients or keep current clients informed. • Communication ways: Social Media has changed the way we communicate with each other. Nowadays, writing a letter or even picking up the phone are becoming more and more uncommon. Instead, people use text messaging, emails or Twitter accounts to interact. • Share facilities: Information, whether true or false, can be shared quickly and easily, and can reach many more people at a time.. • Addictive: Statistical research has revealed that more than 95% of Facebook users log into their account every day. In Twitter’s case it is 60% and for LinkedIn, 30%.

Social media is one of the ways people connect to one another through computation. Mobile devices, social networks, email, texting and micro-blogging are some examples of the many ways in which people engage in computer mediated collective action. As people link, like, follow, friend, reply, retweet, comment, tag, rate, review, edit, update, and text one another (among other channels) they form collections of connections. These collections contain network structures that can be extracted, analysed and visualized. And so does the information generated from these social network connections. Extracting, analysing and visualizing this generated information can give as a result, insights that are recently becoming more and more rated. In recent years it is observed both at national and international levels, a greater sensitivity among businesses and other public and private organizations to the opinions of users and society in general, on social networks. People talk about the news of the day, celebrities, companies, technology, entertainment, and more. Opinions about companies, brands, products, and competitors, or what citizens think of their politicians, are only a few examples of this social trend.

Next we look at what business analysts say about the technologies related to Social Media and big data.

10 D9.5.2 / Market Watch

Figure 1 – Gartner Hype-cycle for emergent technologies 2014

Figure 1 above shows the landscape of emerging technologies in 2014 according to Gartner1. Technologies related to big data, content analytics and data science, stand with an appreciable degree of maturity, but there are between 5 and 10 years for big data and 2 and 5 years for the other two, to reach full maturity. According to Gartner, in its “Digital Marketing” section, it is expected that the opinion expressed in social networks by people pose a great impact in business, especially for brands, products or services. Gartner also observes a new trend where businesses are trying more sophisticated ways to reach consumers, who are willing to participate in connections and use social media marketing instead of traditional channels.

Figure 2 shows the evolution of the predictions of Gartner for 2015. The main difference regarding technologies used in PHEME is that Big data is no longer in the hype cycle, as in the previous year it was shown to enter a trough of disillusionment. It seems that Machine Learning is taking its place, as it enters past the peak of inflated expectations. This may mean that big data related technologies are now into practice and no more a hype. Gartner’s focus this year is in the quest towards the autonomous enterprises, with a journey that goes in three phases: Digital Marketing, Digital Business and Autonomous. The digital marketing stage sees the emergence of technologies related to PHEME (social, cloud, data and mobile). This is important as companies in this phase will focus on new and more sophisticated ways to reach consumers, with greater social connection. Assessing veracity might be key in some of these cases.

1 http://www.gartner.com/newsroom/id/2819918

11 D9.5.2 / Market Watch

Figure 2 – Gartner Hype-cycle for emergent technologies 20152

Figure 3 shows the Gartner3 hype cycle for media and entertainment. Social Media Analytics is now on “peak of the inflated expectations” and this is justified due to its broad adoption by those involved with social media. Products for social media monitoring are improving, shifting from finding comments to better linguistic analysis of what is being communicated.

It will take between 2 and 5 years for a clear adoption of these services. Using Social Media Analytics will improve the yield of social media initiatives by indicating what is working and what is not. Meanwhile, the Social Analytics space is marked by considerable interest from end users, and as such, the adoption of Social Media Analytics in marketing has risen significantly in 2015. As we can see above, this increase has driven Social Analytics further down the “peak of inflated expectations”, though penetration rates indicate that there is still room for the market to grow.

Gartner4 also stated in 2014 that the Content Analysis Market is in the “teen” phase, and its market inclusion now does not reach 20%. Market analysts are aware of this social trend and some of the most renowned analysts working on content analysis, social networks and technological trends are studying and making predictions about the future. However, there is no continuation in 2015 or 2016 of this hype cycle, although Gartner continues to pay attention to the topic. This might indicate that these technologies are no longer hype but reaching the market maturity sooner than expected.

2 Source: Gartner (August 2015) http://www.gartner.com/newsroom/id/3114217 3 https://www.gartner.com/doc/2797321/hype-cycle-media-entertainment- 4 http://www.gartner.com/newsroom/id/2819918

12 D9.5.2 / Market Watch

Figure 3 – Gartner Hype-cycle for media and entertainment 2014

It is interesting that Gartner has started to publish the hype cycle for Digital Maketing, as seen in Figure 4. Here is Gartner’s definition: “A digital marketing hub provides marketers and applications with standardized access to audience profile data, content, workflow elements, messaging and common analytic functions for orchestrating and optimizing multichannel campaigns, conversations, experiences, and data collection across online and offline channels, both manually and programmatically. It typically includes a bundle of native marketing applications and capabilities, but it is extensible through published services with which certified partners can integrate.”5 We strongly believe that in this particular market, veracity and fact-checking techniques will be needed to generate the necessary trust and automation that Gartner is hinting at.

5 http://digitaltechdiary.com/gartners-2015-hype-cycle-for-digital-marketing/2241/

13 D9.5.2 / Market Watch

Figure 4 – Gartner Hype-cycle for Digital Marketing 2015 Figure 5 shows that the ability to perform content and social analysis is becoming a more important dimension of a high-impact and well-rounded business intelligence and analytics program. Content and social analytics are also foundations for other programs, such as user experience, collaboration and compliance. With the desire to have actionable insights from a plethora of structured and unstructured data sources, many different buying centres are investing in content and social analytics tools.

Figure 5 – Gartner Hype-cycle for Content and Social Analytics 2014

14 D9.5.2 / Market Watch

Regarding the pure Social Media landscape relevant to PHEME, the past years have witnessed a fast growth of Social Media offers on the Internet. Three social networks come out on top, and still remain the most influential on the Social Media landscape now, and probably in the near future: Facebook, Twitter and Google+ . But recently they began to share their importance with a lot of mobile applications, such as WhatsApp or WeChat, and, according to Cavazza, referring to 20166: “If originally, social media were designed for conversation and sharing purpose, they evolved into mainstream information / communication / engagement channels. After years of adding new functionalities and buyout, major social platforms like Facebook or Twitter became 21’s century dominant media.”.

Figure 6 shows the Cavazza Social Media Landscapes for the period 2014-16. Comparing the Cavazza 20147, 20158, and 2016 diagrams, we can see why he makes this statement: In diagrams from previous years, Cavzazza included only the main three social platforms at the centre. However, starting in 2014, he includes the six most popular mobile apps, to show that their usage is increasing very fast, as in just a couple of years they have won hundreds of millions of users. This also highlights their importance, as a series of very significant investments are taking place with them (WhatsApp / Facebook, Tango / Alibaba, Viber / Rakuten, etc.).

Figure 6 – Evolution of Social Media Landscape 2014/15/16 (from Cavazza)

Cavazza decided to switch from the circled diagram from previous years to a hexagon-shaped diagram, in order to show the diversity of social media. Figure 77 shows the SM Landscape 2016 in more detail. Cavazza believes that Facebook, Twitter and Google will continue to be the dominant players with multiple services around them in the coming years. More interestingly, Reddit is listed by Cavazza as one of the main conversation platforms, which seems in accordance with the decision taken by PHEME to include this particular platform for our conversation-based algorithms in the scope of the medical use case.

6 http://www.fredcavazza.net/2016/04/23/social-media-landscape-2016/ 7 http://www.fredcavazza.net/2014/05/22/social-media-landscape-2014/ 8 http://www.fredcavazza.net/2015/06/03/social-media-landscape-2015/

15 D9.5.2 / Market Watch

Figure 7 - Social Media Landscape 2016 (from Cavazza)

In addition to social networks and the services around them gaining in importance, there is another aspect of this ecosystem, which is the development of applications that benefit from the huge amount of information generated by these networks, forming relationships with social networks that might be described as symbiotic. As an example, Twitter makes individual tweets and associated meta-information available via a specialized public API, which is precisely what we use in PHEME. In 2011, according to Twitter Director of Platform, Ryan Sarver, there were about 750,000 registered applications accessing the data streams provided by Twitter. There are no current figures, but Twitter keeps saying that they will continue supporting developers. Nevertheless, Twitter is now undergoing a major shift towards trying to be profitable, which may hamper this app ecosystem in case the company decides to change their priorities. For the near future, however, it does not seem to be a risk.

The list below shows examples of companies well-positioned in different market niches related to social media:

• Publisher tools: Companies such as SocialFlow help publishers to optimize how they use Twitter, leading to increased user engagement and the production of the right tweet at the right time.

16 D9.5.2 / Market Watch

• Curation: Attensity, Mass Relevance or Sulia, or in Spain, Websays or Sentisis, provide services for large media brands to select, display, and stream the most interesting and relevant tweets for a breaking news story, topic or event. • Real-time data signals: Hundreds of companies use real-time Twitter data as an input into ranking, ad targeting, or other aspects of enhancing their own core products. Klout is an example of a company which has taken this to the next level by using Twitter data to generate reputation scores for individuals. Similarly, Gnip syndicates Twitter data for licensing by third parties who want to use their real-time corpus for numerous applications (everything from hedge funds to ranking scores). Gnip was acquired in 2014 by Twitter, leaving only a couple of independent Twitter Firehose providers, one in Japan and another in Europe (DataShift). In a recent development, Twitter is cooperating closely with Thomson Reuters on sentiment analysis in financial markets. • Value-added content and vertical experiences: Emerging services such as Formspring, Foursquare, Instagram and Quora have built into Twitter by allowing users to share unique and valuable content with their followers, while, in exchange, the services get a broader reach, user acquisition, and traffic. In this regard, Twitter conducted, but ultimately shut down its own experiment with #Music, a platform to recommend and discover music through other Twitter users. • Social CRM, enterprise clients, and brand insights: Companies such as , CoTweet, Radian (now part of Salesforce exact target marketing cloud), , and Crimson Hexagon help brands, enterprises, and media companies tap into the zeitgeist about their brands on Twitter, and manage relationships with their consumers using Twitter as a medium for interaction.

Seth Grimes in his 2014 Breakthrough Analysis of the Text Analytics9 stated the following: “I reported positive technology and market outlooks in each of the last few years, in 2013 and in 2012. This year is a bit different. While technology development is strong, driven by the continued explosion in online and social text volumes, I feel that the advance of text analytics as a market category has stalled. The question is not business value. The question is data focus and analysis scope. Blame big data.” In this report, Grimes goes through a set of interviews with several vendors and start-ups in the text analytics field and focuses on technical aspects such as advances in deep learning techniques, scaling-out via parallelized or distributed technologies, usage of data as a service and cloud, in-memory and stream processing, etc. Grimes also comments that: “Still, whether within or across market-category boundaries, there are significant text-analytics technology, market, and community developments to report.”

More recently, in 2016, Seth Grimes has continued to interview vendors in the field10. It seems that NLP techniques are being streamlined by companies to ease their use by customers to focus in getting business insights rather than wasting time on pure technical matters.

In this scenario, one thing that must be taken into account is that in the same way in which information spreads on social media, so does misinformation, and it is this aspect on which PHEME puts its focus. Social media data is inherently uncertain, and this must be taken into account in order to make the best use of the information extracted from it. PHEME and other projects and initiatives are providing content analytics tools that will help to deliver insights based on Twitter (or other SN) data. This can help to detect what is true and what is not from

9 http://breakthroughanalysis.com/2014/04/11/text-analytics-2014/ 10 https://breakthroughanalysis.com/2016/

17 D9.5.2 / Market Watch all the information obtained, and once misinformation is separated, the resultant information has real value. PHEME’S most relevant benefit is the ability to monitor social media and respond to circulating misinformation and rumours. And this can give a unique competitive advantage, as PHEME will be able to check and show the degree of truthfulness of the information analyzed and visualized. PHEME focuses on veracity, called the fourth V of Big Data attributes:

Figure 8 - 4 V's of Big Data: Volume, Velocity, Variety and Veracity11

Considering that the market has not yet been fully exploited in relation to veracity and fact- checking, PHEME, as one of the promising content analytics tools which use data from one of the top social networks (Twitter) and technologies related to big data, content analytics and data science to identify rumours in near real time, has great potential for success.

3.2 Potential PHEME usage scenarios

PHEME is a research project, but since its very inception there was a lot of buzz in the media about its expected results. Therefore, PHEME is trying to take advantage of these expectations to explore potential exploitation avenues in different domains. PHEME’S initial target market is around areas such as digital journalism, personalized health, marketing, brand and reputation management, search and knowledge management, smart cities, emergencies and agriculture and food industries, and society and citizens. PHEME could also potentially be used by existing tools and applications related to social networks. Many of the algorithms and methods in PHEME will be delivered as open source, making it possible for other researchers or commercial organizations to make use of them in multiple environments.

3.2.1 Personalized Health

The main objective of the healthcare use case is to show how intelligence from social networks can potentially be integrated into public health monitoring and pharmacovigilance. The idea here is to monitor social networks to help healthcare professionals to find out hot topics in order to react and possibly prevent adverse situations involving patient care.

11 http://www.datasciencecentral.com/profiles/blogs/data-veracity

18 D9.5.2 / Market Watch

On healthcare, we can identify two use cases. In the first one, PHEME will provide rumour intelligence for direct use by clinical and public health practitioners. The ability to spot rumours as they appear (e.g. monitored at a national level for different areas of medical care) could lead to daily alerts of problematic cases that are likely to be raised by patients. Timely national media interventions will also be facilitated. For example, cases such as the controversy surrounding the MMR vaccine in the UK, or speculation around stigma can quickly appear in medical consultation rooms. The earlier that staff can revise clinical advice and practice, the more effective the patient-doctor interactions will be. In a secondary use case, social media analysis will be combined with analysis of the structured data and free text of the EPR, thus linking social media to, and correlating it with, aggregated patient records. This will enable health care practitioners to (i) examine the veracity of social media health topics in light of clinician-recorded patient encounters and (ii) access to information along a social dimension, alongside the usual clinical dimension.

South London and Maudsley NHS Foundation Trust (SLAM) is the largest health-care provider in Europe, serving a population of 1.1 million. SLAM’s Case Register Information System (CRIS) allows searches to be made of the 11 million textual patient notes and letters to physicians, and extracts anonymised data for secondary analysis. In 2011 SLAM assessed that 80% of the data they require for research and statistical purposes is hidden in these textual records. KCL and USFD have an ongoing collaboration with SLAM, around text mining and mental health research. The technology developed in PHEME will enable them to correlate the issues patients discuss with their physicians (as recorded in CRIS) against medical rumours and information broadcast in the press around the same time. The PHEME methods for contradiction detection will enable automatic detection of patient misconceptions as well. PHEME is also supported by the UK Health Protection Agency (HPA) and will be of relevance to Public Health England (which absorbed the HPA in April 2013). They have shown interest in PHEME’s rumour intelligence technology, to enable them to monitor patient forums and social networks and identify new rumours and scares, so appropriate counter-balancing actions and policies can be undertaken.

3.2.2 Digital journalism

The digital journalism use case seeks to discover rumours and assess their veracity to help journalists to report quickly and more accurately. The impact of PHEME in this particular use case will be the deployment and exploitation of a PHEME journalism dashboard in swissinfo.ch’s multilingual newsrooms. The goal is for PHEME to become the first resource used in the newsroom to identify newsworthy rumours, providing primary (automatic) veracity estimation and helping journalists to understand what is happening with specific rumours, using metadata and visualizations. It will also support further (manual) verification steps for content and sources along the journalism workflow.

It is worth noticing that other news organizations, such as The Guardian or the BBC, have shown interest in PHEME results. Therefore the use case for digital journalism is of particular interest for PHEME from the future business perspective.

3.2.3 Other potential users and scenarios

Besides healthcare and digital journalism, there are plenty of potential domains where PHEME technology can be applied. These applications may include business intelligence, market research, campaign and brand reputation management, customer relationship management,

19 D9.5.2 / Market Watch knowledge management, and semantic search, among others. Besides the complete PHEME veracity framework, many of the algorithms and methods researched in PHEME can potentially be of use to other organizations and researchers. Therefore the delivery of most of these methods and tools as open source falls within the scope of PHEME.

3.3 Market size

As stated before, previous predictions from Gartner and Rozwell and Sallam were that more than 30% of analytics projects will deliver insights based on structured and unstructured data by 2015, and 85% of Fortune 500 companies would require support with big data analysis. More recent figures, such as the IDC12 prediction is that the worldwide business analytics software market will grow more than 50%, from nearly $122 billion in 2015 to more than $187 billion in 2019, reaching around 10% compound annual growth rate (CAGR).

According to Dan Vesset, group vice president of analytics and information management at the IDC13: "Organizations able to take advantage of the new generation of business analytics solutions can leverage digital transformation to adapt to disruptive changes and to create competitive differentiation in their markets. These organizations don't just automate existing processes -- they treat data and information as they would any valued asset by using a focused approach to extracting and developing the value and utility of information.".

12 http://www.idc.com/getdoc.jsp?containerId=prUS41306516 13 http://www.idc.com/getdoc.jsp?containerId=IDC_P33195

20 D9.5.2 / Market Watch

4. Market watch

In order to achieve a crucial objective of PHEME, that is, to have the best possible impact both in the scientific and commercial communities, it is vital to examine the evolution of the market. Therefore this section is devoted to analysing and exploring existing initiatives related to PHEME, exploring the competition and identifying potential commercial opportunities. Due to the research nature of the project, the survey targets not only business and solutions, but also existing projects with a similar focus to PHEME. This is a task that runs continuously during the project life-span.

Besides projects and initiatives specifically targeting veracity in social networks, it is worth following closely existing projects and initiatives related to language technologies (LT) in Europe. Following the results of Coordination and Support Actions from FP7, such as MLi14, or H2020 CSAs such as LT Observatory15, or CRACKER16, networks of excellence such as META-NET17, or the industry forum LT-Accelerate,18 will give the opportunity to check the pulse of language technologies and the strategic research and/or innovation agendas for the LT industry. PHEME is also exploring the possibility of using resources from existing repositories and initiatives such as META-SHARE19. The EC is also issuing a set of calls for tender for LT solutions as an enabler for the digital single market in relation to the Connecting Europe Facility (CEF) programme20.

4.1 Methodological approach to the market watch

Following Gold and Baker (2012), we have acknowledged the need for a methodological approach to gathering evidence for our market watch, as recommended by the independent reviewers of the project. The previous version of the Market Watch (PHEME D9.5.1, 2014) followed a more expert-based approach, complemented by searches on general search engines, where the experts of the consortium pointed out interesting tools, projects and methods around the technologies and themes tackled in the project. This approach, while it showed good results, was based on assumptions and the expertise of a limited team of experts, rather than ona more thorough methodology. Therefore, in order to improve the evidence gathering approach, we have adapted the cited work from Gold and Baker (2012), focused on patents, to our needs, applying some of the proposed methods and a set of other techniques for enlarging the scope of the market watch. This was done in several iterations, so the initial ideas were tested and taken or discarded depending on the success of the tests.

Gold and Baker propose the following steps:

• Identify the key research objectives: State the research questions we want to answer. • Build the dataset: Get the information needed. For that, there is a need to set the the search approach (top down or bottom up), choose the search method(s) (keywords,

14 http://mli-project.eu/ 15 http://www.lt-observatory.eu/ 16 http://cracker-project.eu/ 17 http://www.meta-net.eu/ 18 http://www.lt-innovate.eu/ 19 http://www.meta-net.eu/meta-share 20 https://ec.europa.eu/digital-single-market/en/connecting-europe-facility

21 D9.5.2 / Market Watch

algorithms, etc.), select the data sources (search engines, patent databases, literature, LinkedIn, etc.) and curate the data retrieved. • Analyse the data: Get the answers to the research questions. It might follow some subjective analysis by experts, or more or less automated assessment of derived metrics.

Following the Gold and Baker approach, Figure 997 shows an overview of the main steps followed to get to our current methodology for approaching the market watch.

Figure 9 – Market Watch methodology life-cycle

The steps selected for PHEME shown in Figure 997 are the following:

• Identify the key research objectives: We would like to be able to give feedback to the project partners about how the market is reacting to our research field, and how others are dealing with similar topics. Therefore, we need to search and discover tools and methods for assessing veracity in social networks data. We would also like to be able to position the results of the project into a potential exploitation path. • Build the dataset: o Search approach: We are following a top-down approach. o Search method(s): We started by setting some keywords around veracity, translating them in order to broaden the search and starting the process of deciding the themes or materials we wanted to find (mainly patents, bibliography, web sites, tools, companies and related projects). o Data sources selection: Once we have decided on the type of material, we selected some places to search for it. Besides the obvious valuable information provided by the project partners on the subject, the main initial sources were search engines (e.g. Google), patent databases (i.e. Google Patent, Patentscope), proceedings of conferences, social networks (mainly LinkedIn and Twitter), news sites (e.g. Google News), and reports from analysts (i.e. Gartner).

22 D9.5.2 / Market Watch

o Curation of the data retrieved: Periodically we gathered the results from the search and started the curation process. In order to do so, we provided a shared spreadsheet to the consortium that was periodically updated. Revision of the results of previous searches was performed periodically in order to check the validity of the results, as some of the tools or methods were discontinued or changed over time. • Analyse the data: This step is done in order to understand whether the gathered data was a good basis for making decisions. The main analysis was done to produce the current document. Moreover, every time a new promising result emerged from the curation process, partners were informed to give them feedback on their research. • Evaluation of the method: We introduced a new step in the methodology to actually evaluate the method itself. As the figure above suggests, we followed an iterative approach to refine the method based on the lessons learned from the previous iteration, discarding some of the paths and reinforcing others.

Figure 10108 shows the main steps of the methodology explained above.

Figure 10 – Market Watch methodology steps

As a consequence of the feedback loop presented in Figure 997, we came up with a series of lessons learned:

• Some new results surfaced after applying the methodology, especially related to existing patents, emerging start-ups and news coming from social networks. However, the main source of information continued to come from experts in the consortium, as we have a good and heterogeneous team aware of the main research done in the field of veracity in social media. They are also active on social media around these topics, so receive relevant information via their own networks and social media feeds, which complement those used centrally in the project.

23 D9.5.2 / Market Watch

• Some dead ends. Some promising resources gave poor results. For instance, we retrieved very little from Google Alerts, some of the patent databases, and searches for recruitment criteria on LinkedIn. In this last case, the profile of the people dealing with the technologies used in the project matched those of data scientists and big data practitioners, which provide a lot of noise but no appropriate matches. • Difficulty with translations: We tried to translate keywords from English into several languages. It was a good exercise in the case of languages of project partners (German, Spanish, French, Bulgarian), but unfortunately this did not translate into good results. Most of the relevant publications turned out to be in English, or at least translated into English, so the effort of making the translation and assessing the results did not pay off. It was even worse in the case of other languages we tried, such as Chinese. In that case we used Google Translate to both translate the keywords and understand the results. The reality showed this approach to be rather difficult, as the quality of the translation, although not too bad, made it hard to understand fully the relevance of the result. Therefore we decided not to follow this approach after the first iteration. • We wanted to avoid incurring costs (e.g. some patent databases or paid studies). Therefore we used only free resources. • We found some good resources. Among them are Google Patent search and Patentscope for patents, LinkedIn to follow related projects (i.e. REVEAL), some LinkedIn groups, some interesting Twitter accounts, or Twitter lists, and proceedings of conferences dealing with similar topics (e.g. Social Media in the Newsroom workshop)21.

The results of the market watch will be explained in detail in the following subsections.

21 http://www.smnews.newslab.ie/accepted-papers/

24 D9.5.2 / Market Watch

4.2 Research projects on veracity

Veracity has recently received a lot of attention from the research community. Although these research projects can be seen as complementary as well as competitors to PHEME, for clarity they are separated from actual tools, initiatives or companies in the present section.

A few examples of relevant research projects on veracity are:

Table 1: Relevant research projects on veracity Name Description Main Focus Main functionalities Timeline It relies on a mix of human and machine processes. Uses Twitter searches, RSS feeds, Emergent.info site is up and tips from users, Google Alerts and other means running with various rumours to identify unconfirmed reports early on in their collected and tracked. It briefly lifecycle. Once it has identified an unconfirmed partnered with Digg to offer more report, it looks for news articles about it by in-depth analysis of selected This research project draws on searching Google News. rumours (http://digg.com/2014/the- Emergent is a data-driven, real- qualitative and quantitative data to test The reporting of rumours and misinformation in truth-behind-the-kurdish-female- time, web-based rumour tracker. It and analyse strategies and best practices the press is identified, and logged in a database. soldier-fighting-isis). is part of a research project at the for debunking misinformation. It Along with the URL, the headline, byline, news 2016 update: the site is online and Tow Center for Digital Journalism provides actionable guidance for outlet, body text and publication date and time working but no news are being at Columbia University that newsrooms on how best to debunk are captured. The headline and body text of the added. The last news uploaded was focuses on how unverified misinformation, while also offering an article are classified. Once entered in the in November 2014, and the blog Emergent information and rumours are overview of relevant psychological database, they track two types of changes to the has been updated in 2015. reported in the media. It aims to factors such as the backfire effect, articles over time : Craig Silverman published a report develop best practices for motivated reasoning, the illusion of truth 1. Updates to the headline/body text: A system on findings from emergent.info in debunking misinformation. and the hostile media effect, among capturing changes automatically, and a human February 2015: others. This helps journalists better checking back at intervals. When updates are http://towcenter.org/wp- http://www.emergent.info/ understand how our brain processes (and made, it’s checked if a change in the truthiness content/uploads/2015/02/LiesDam often rejects) contradictory information. rating is required. 2. Social shares. The number nLies_Silverman_TowCenter.pdf of social shares for the URL on Twitter, Facebook and Google Plus is captured, divided It seems that the project is no hour by hour in order to see which articles longer running. about a rumour generate the most shares, and whether a subsequent debunking or confirmation of the rumour attracts a similar

25 D9.5.2 / Market Watch

amount of social engagement. Once a rumour has been definitively debunked or confirmed, the overall claim state of the rumour is changed to reflect that (from Unverified to either Confirmed True or Confirmed to be False). This provides a marker against which subsequent changes and shares can be compared. Visualizing Rumours. There’s also a visualization tool, where a visualization of how rumours and the articles about them evolve is given. This is called a life cycle of a rumour. Each rumour’s page on the emergent site shows sharing statistics, articles and truthiness ratings for a given rumour. Also it allows clicking on a specific article headline to go to a page that shows a more detailed sharing breakdown, as well as a listing of the revisions to the article over time. It is a project of the Social By inputting a single tweet into the system, and Informatics Lab at Wellesley. selecting keywords relevant to the story being Twitter Trails is an interactive, Twitter Trails is an investigative and investigated, the system gathers a dataset of web-based tool that allows users to exploratory tool to analyse the origin and tweets through which the user can trace the investigate the origin and spread of a story on Twitter, making it story origin. After retrieving the investigative propagation characteristics of a easier to investigate a suspicious story. tweet, Twitter Trails provides a Keyword They have already produced a tool rumour and its refutation, if any, While it does not answer directly the Selection interface, to allow the user to that has been evaluated by users on Twitter. Visualizations of burst question of a story’s validity, it provides highlight words and phrases from the tweet as online. Around 400 Twitter stories activity, propagation timeline, information that a critically thinking keywords, or enter them manually. The system have been tracked and it is up-to- retweet and co-retweeted networks person can use to examine how a Twitter helps the user select the appropriate keywords date. However, the last blog entry Twitter help its users trace the spread of a audience reacts to the spreading of the in a variety of ways, to assist the user. Twitter was published on November 2015. Trails story. It collects relevant tweets story. It is a valuable tool for individual Trails provides an interface for the user to Some papers have been published and automatically answers several use, but especially for amateur and modify the inputs which go into determining the too, such as important questions regarding a professional journalists investigating relevant tweets as many times as she likes in http://cs.wellesley.edu/~pmetaxas/ rumour: its originator, burst recent and breaking stories. Further, its order to select the best set of relevant tweets. TwitterTrails-investigating-rumor- characteristics, propagators and expanding collection of investigated Propagation Graph: a novel visualization which propagation.pdf main actors according to the rumours can be used to answer questions shows who broke the story on Twitter, and audience. In addition, it computes regarding the amount and success of highlights influential and independent content and reports the rumour’s level of misinformation on Twitter. creators. The Propagation visualization gives a visibility and, as an example of the detailed look into a specific interval of time: power of crowdsourcing, the when the story broke; the Timeline visualization

26 D9.5.2 / Market Watch

audience’s scepticism towards it gives an overview of the whole story. The which correlates with the rumour’s Retweet Network and the Co-Retweeted credibility. Network, are other two visualizations that help to answer questions about the main actors who http://twittertrails.com/ were spreading information. The Tweeted Link Bibliography counts the most cited links, as well as how many users tweet the links, and provides an interface for exploring the tweets containing that link. The tool also shows the scepticism level, which may be important to know in relation to what the PHEME tool will do Truthy is a more general-purpose system which, despite its name, does not provide Truthy work to date includes a number of core explicit assessment of the veracity of the research themes: tracked memes.

One goal of this project is to study how 1. Study how individuals’ limited attention span social network structure, finite attention, affects what information Truthy propagates and popular sentiment, user influence, and what social connections it makes, and how the Expanding the platform to make other factors affect the manner in which structure of social networks can help predict the data derived from their analysis information is disseminated. A second which memes are likely to become viral. of meme diffusion and from their goal is to better understand how social Truthy, created by the U. of 2. Explore social science questions via social machine learning algorithms more media can be abused, for example by Indiana, is based on the concept of media data analytics. Examples of research to easily accessible and thus more malicious social bots, astroturf, memes that spread in the network. date include analysis of geographic and useful to social scientists, reporters, Truthy orchestrated campaigns, and online Such memes are detected and temporal patterns in movements such as and the general public. In 2016, it (Now hoaxes. followed over time to capture their Occupy Wall Street, societal unrest in Turkey, seems that the researchers are OsoMe) Truthy can enable a user to come to a diffusion patterns. polarization and cross-ideological working more on other projects certain conclusion on her own. communication in online political discourse, such as Hoaxy, see description The focus of this research project is http://truthy.indiana.edu/ . partisan asymmetries in online political below. understanding how information engagement, the use of social media data to OSoMe: The IUNI observatory on propagates through complex socio- predict election outcomes and forecast key social media (May, 3, 2016) technical information networks. market indicators, and the geographic diffusion https://peerj.com/preprints/2008/ Leveraging large-scale public data from of trending topics. online social networking platforms, they 3. Produce images, videos, and demos to are able to analyse and model the spread demonstrate applications of our data mining of information, from political discourse research, from visualizing meme diffusion to market trends, from news to social patterns to detecting social bots on Twitter. movements, and from trending topics to scientific results, in unprecedented

27 D9.5.2 / Market Watch

detail.

PHEME has already established REVEAL aims to develop tools and on-going collaboration with the services that aid in Social Media REVEAL project, to ensure verification. It looks at verification from complementarity and reuse of each a journalistic and enterprise perspective. other’s results. On April 2016, the REVEAL project

'trust model' for partially automating the REVEAL recently released a process of filtering useful information on They will employ techniques in Social Media corpus of data on Github. A demo social media by using trusted sources, graph analysis, privacy preserving data was tested with the terror attacks was presented at the Third Workshop of analytics, crowdsourcing through gamification that hit Paris on 13 November Social News on the Web in Montreal. and psychology-based sentiment analysis; 2015. When analysing eyewitness The trust model could help journalists to deploy intelligent methods for focused of all content, the team found that become both faster and more efficient types of content, strongly based on event untrusted sources generally share when sourcing content on breaking recognition and entity extraction techniques, so images earlier than trusted sources. stories and publish content with more REVEAL is a 3 year EU-funded that information can be clustered around They also found that trusted confidence that material sourced from project that started in October automatically extracted and dynamically sources are an indication for an social media is authentic, by maintaining 2013. It is concerned with evolving topics. New techniques as image to be authentic. Trusted a list of their sources and linking new verification of social media computational stylometry along with forensic sources that are related to user- Reveal content to authors. When tracking a content, from a journalistic and image and video analysis will be used. Also, generated content make it more news story on social media, content enterprise perspective. context-based content analysis, techniques for likely to be genuine. This is items are associated with authors and can discovering the provenance of information, typically the case 30 minutes after be filtered using predefined lists. For http://revealproject.eu/ advances in media content delivery through a photo has been published. each new content item, it becomes clear simple and appealing interfaces, crowdsourcing Consequently, if a journalist is immediately whether it is in some way mechanism, different visualizations based on prepared to wait, it can point them related to a source: if it has been posted device capabilities, and apply a personal data in the right direction for by that source, mentions that source or is regulatory framework for safeguarding conventional means of verification, attributed to it. The model additionally fundamental rights, freedom of expression, such as factual cross-checking or aims to help journalists quickly pick up cultural diversity, accuracy and credibility of contacting the source directly new eyewitness content. This does not information, privacy etc. through social media channels. The mean trending content from established team also found that for the news organisations or agencies, as the discovery of newsworthy content is no longer breaking. Instead, eyewitness content, it helps to filter this would be content that contains old content. This means a journalist eyewitness images or video that is less does not have to check potentially than five minutes old since publication thousands of social media URLs and is likely to still be unverified. but can focus on the top URLs. RumorLens RumorLens was a one-year The application would aid journalists The RumorLens application analyses the spread The site http://rumorlens.org/ is no

28 D9.5.2 / Market Watch

research project funded by and the general public in recognizing of rumours on Twitter, allowing users to submit longer functioning. The project Googleand partially supported by misleading information by showing the a tweet that characterizes a topic that interests likely ended in January 2015.. the National Science Foundation. . spread and reach of particular rumours them. The application retrieves additional Its aim was to build a tool that and their accompanying corrections on tweets related to the subject and prompts user would aid journalists in finding Twitter. feedback to classify results as propagating, posts that spread or correct a debunking or unrelated to the original rumour. particular rumour on Twitter, by It then uses a text classifier to garner more exploring the size of the audiences widespread results. that those posts have reached. At the end of the process, RumorLens presents the user with a set of tweets that either support https://www.si.umich.edu/research/ or discredit the rumour. The system also research-projects/rumorlens provides accompanying data visualizations so that the user can see how many people were http://rumorlens.org/ exposed to the original rumour, how many were exposed to the correction, and how many retweeted each. A three-year research project receiving funding from the German government that launched at the end of 2014. Partners include the Fraunhofer Institute for Intelligence Analysis and Information Systems, Neofonie (a The aim of the project is to create a real- tech service company), DPA time analysis and evaluation tool (or big Journalists can use the tool to search for (German press agency) and data infrastructure) that will allow information on a specific subject and the tool In terms of timeline, a prototype Deutsche-Welle (German journalists to access in a few clicks will provide a compact overview of the topic. was unveiled in February at CeBIT international broadcaster). thousands of pieces of the most relevant Users can see what people are saying or 2016 in Hannover. News http://newsstreamproject.org/ and reliable information from the vast discussing about the topic on blogs, Twitter and Stream 3.0 trove of data circulating online, other social media. They can also receive alerts http://newsstreamproject.org/news- Information on the project website including from video platforms, RSS when the topic comes up for debate in stream-auf-der-cebit-big-data-fuer- is available only in German. Some feeds, news streaming services from parliament or becomes the subject of a news journalisten-live-erleben/ info in English that gives more social networking sites and media programme, details about timelines, archives. achievements to date: http://www.cebit.de/en/news/stayin g-afloat-in-the-flood-of- information.xhtml; and http://www.iais.fraunhofer.de/pi_n ewsstream2015.html?&L=1

29 D9.5.2 / Market Watch

Researchers at the Indiana University Network Science Institute (IUNI) and the School of Informatics and Computing’s Center for Complex Networks and Systems Research (CNetS) are working on an open platform for the automatic tracking of both online fake news and fact-checking on social media. The goal of Platform is still available. Hoaxy A platform for the collection, the platform, named Hoaxy, is to reconstruct the partners published a research paper detection and analysis of online Study of the social dynamics of online Hoaxy diffusion networks induced by hoaxes and their outlining some of their findings: misinformation and its related fact news sharing corrections as they are shared online and spread https://arxiv.org/abs/1603.01511 checking efforts from person to person. Hoaxy will allow researchers, journalists, and the general public to study the factors that affect the success and mitigation of massive digital misinformation. Some of Truthy's researchers - a project previously mentioned – are also working on Hoaxy. InVID will automate verification processes to speed up and facilitate the workflow of professional journalists. To begin, topics that The aim of the InVID project is to InVID (In Video Veritas) is a represent breaking news on social media are identify news-relevant videos on social Horizon 2020 project that will being identified. Videos that are being uploaded networks, verify the presented content build a platform providing services and shared about these topics on social and professionally clear the usage rights InVID started in January 2016 and to detect, authenticate and check networks will be indexed, temporally with content creators. This will avoid the will run until December 2018. The InVID the reliability and accuracy of fragmented and annotated based on their inadvertent use of fake or manipulated consortium is currently completing newsworthy video files and video content. Based on the annotations and metadata, videos by serious media outlets and the user requirements (May 2006). content spread via social media. which include information about the user, alleviate the considerable effort involved location and time of recording, it will be in ensuring the authenticity of news http://www.invid-project.eu/ possible to carry out an initial ranking of the reports. videos. The subsequent verification process is focused on videos that are likely to be relevant and reliable. The aim was to develop tools and GLOCAL developed and combined tools and GLOCAL is a European project applications for exploiting multimedia techniques from different areas such as under FP7 programme, which event information in domains such as multimedia indexing, semantic schema GLOCAL started on December 1st, GLOCAL compares results of Google’s news, politics and entertainment, for a matching, recommender systems, and search. 2009 and finished on November image analysis with that of TinEye. variety of sources (content from different The main objectives were: (1) a common 30th, 2012 http://www.glocal-project.eu/ users, web repositories, web 2.0 indexing schema based on models of events environments and blogs) designed a priori. (2) a local indexing and

30 D9.5.2 / Market Watch

tagging methodology and algorithms used to populate event descriptions with media. (3) Global sharing and search algorithms based on the common event-driven understanding of user experience within the same event and across events. (4) A new type of search query, with any combination of these components: - standards keyword search; - search by example; e.g. using a photo; - search by event-dependent contextual parameters (e.g. location or a person or a property of a person) This pop-up fact-checking platform is innovative, comprehensive and different in By actively monitoring broadcast TV crucial ways. 1) For the first time, it will curate channels, online video streams, news a comprehensive repository of existing fact- articles, and social media, ClaimBuster checks that new apps can search and link to. 2) detects a claim as it appears in real time. It will actively monitor a variety of media for By instantly providing the voter a rating factual claims. 3) Upon a match, it will deliver The quest to automated fact about a claim accuracy if it has been pop-up fact-checks through social media and checking- University of Texas, fact-checked before, ClaimBuster smart TV apps in addition to browser Project still alive, papers and Claimbuster Due, Google, Stanford A demo is mitigates repeated false claims. By extensions. 4) It will rank claims by check- conference participation in available on: http://idir- recommending highly important new worthiness and suggest highly ranked new February 2016 server2.uta.edu/claimbuster factual claims to professional fact- claims to fact-checkers, for which there exists checkers, ClaimBuster will free no other tool. 5) Its collaboration platform will journalists from the time-consuming task engage the fact-checking community. 6) For the of finding check-worthy claims and help specific domain of congressional voting them prioritize their efforts in assessing records, it will provide data analysis tools that the veracity of claims. goes beyond matching existing facts, and allows for checking claims that lie in subtle ways with cherry-picking. Una Hakika is a Kenyan project One of the main interesting points dealing with misinformation and regarding this project from the technical WikiRumours is a web- and mobile-based disinformation. It operates in the point of view is their use of platform for moderating misinformation and Tana Delta, one of the least WikiRumours. This is a workflow and disinformation. The software is free and open Una Hakika developed areas in all of Kenya, technology platform designed to counter source under an MIT license, which means that yet mobile phone and internet the spread of false information through it can be used for open, commercial, or usage is still surprisingly high. transparency and early mitigation of proprietary use without mandatory attribution. conflict. http://www.unahakika.org/

31 D9.5.2 / Market Watch

In particular, TwitterTrails, ClaimBuster and Emergent were thoroughly examined by PHEME partners both as a source of inspiration and to compare their results with our own outcomes.

4.3 Existing tools and competitors

This section reports on existing tools on “Rumour Intelligence”, veracity and fact checking.

Table 2: Tools and competitors Name Description Main Focus Main functionality Timeline A suite of tools for journalists and brand management. Some of the tools in the suite are: With respect to journalists, Storyful Recommender: Builds and maintains focuses on verifying social media, Storyful is an Irish company that communities of the most influential and forensically checking thousands of discovers, verifies and acquires connected users on the social web. Heatmap: pieces of content using advanced digital social media for newsrooms, brands Monitors the conversation and content that will Storyful techniques and traditional journalistic In the market and video producers. shape the trends that matter to the user. skills. Storyful’s global team of Alertbot: Spot the source, location and velocity journalists source, date and geo-locate http://storyful.com/ of those trends before they go viral every piece of content they deliver, and .Streamdesk: Locates, extracts, verifies and provide direct access to the source. acquires the most engaging content on any topic, event, platform or location. Their prototype was not built for real-time events and was tested An application that incorporates a with pre-collected and processed number of advanced aggregations, data due to limitations of the computations, and cues that would Twitter API. And there’s plenty Seriously be helpful for journalists to find It offers functionalities such as Automatically more to think about in terms of Characterizes sources according to their Rapid Source and assess sources in Twitter Identifying Eyewitnesses , Automatically enhancing the eyewitness classifier, implicit location, network, and past Review around breaking news events. Identifying User Archetypes, Visually Cueing thinking about different ways to use content. (SRSR) Location, etc. network information to spider out http://www.nickdiakopoulos.com/2 in search of sources, and to 012/01/24/finding-news-sources-in- experiment with how such a tool social-media/ can be used to cover different kinds of events. It seems that there is no continuity in the work.

32 D9.5.2 / Market Watch

Dataminr is a start-up that transforms the Twitter stream into actionable signals, identifying the most relevant information in real- time for clients in Finance, News Verticals for finance, public sector and and the Public Sector. journalists. In partnership with Twitter, Using proprietary algorithms, Online detection of news, brands, Dataminr has developed Dataminr for News, Dataminr In the market Dataminr analyses all public tweets reputation, etc. which alerts journalists to breaking events and and delivers the earliest warning for developing stories based upon their topics of breaking news, real-world events, interest and region of focus. off-the-radar content and emerging trends.

https://www.dataminr.com/ Getting the data Use the Twitter API to stream tweets with the label #sarcasm, these will be our sarcastic texts, and other tweets that don't have the label #sarcasm, these will be our non-sarcastic texts. Stream them and store them in a database. The detection of sarcasm on Twitter. To Pre-processing the data A sarcasm detector application, detect sarcasm properly a computer Before extracting features from our text data it It does not seem to have been any which is based on a trainable would have to figure out that you meant is important to clean it up, using a series of new developments since 2014, but Sarcasm model. the opposite of what you just said. It is requirements and deleting duplicates. the open source code and the Detector sometimes hard for humans to detect Feature engineering application is still http://www.thesarcasmdetector.co sarcasm, and humans have a much better Several features were engineered and tested to available. https://github.com/Mathi m/about grasp at the English language than help the classification of tweets. euCliche/Sarcasm_detector computers do. Choosing a classifier Cross validation using support vector machine (SVM) with a linear kernel and with an Euclidean regularization coefficient of 0.1.The metric used to guide the cross-validation is the F-score. An application that alerts users Zooming in on the voices of the people A robot is created to scan news outlets and Initially, they focused on business about glitches (errors) in online speaking on current business events extract information from databases (e.g. news, but expansion into additional news, to help them maintain trust in across the web. Securities and Exchange Commission). The domains started in 2015. In 2016 Trooclick online content. A company working 1. Deliver an unbiased news result is a human- written digest of the most they offer the following domains: on news and social media analysis. experience for readers ‐ by bringing important glitches spotted online, which is Business, Health, Tech, Science, together all points of view regarding a emailed once per week. Earnings, World, USA, UK, Israel,

33 D9.5.2 / Market Watch

http://trooclick.com/ given topic, on one page. 2. Create Big India, Canada. Data from news (by converting text into structured data); and enable clients to use that data to enhance their existing products and services. Use a database usually from The Washington Post’s Fact Checker blog, Politifact or Factcheck.org. Then, new speeches are analysed This app from the Washington Post to see if they contain any of the same TruthTeller began with the help of fact checks political speeches misleading claims that are already in our a prototype grant from the Knight against known fact checking database. The video clips are submitted to a Foundation, and has since grown websites (e.g. PolitiFact, Focus on fact-checks political speech as program called MAVIS by Microsoft, which within the Washington Post. It was TruthTeller FactCheck.org, their Fact Checker it happens. turns the audio waveforms into a transcript. funded until the end of 2013. The blog). Once the video clip transcript is ready, an project, at least with this name, has

algorithm is applied to check if the politician been discontinued since 2014 and http://truthteller.washingtonpost.co made any statement that is already in their fact- the website is not active in 2016. m/ (inactive URL) check database. If so, the fact-check appears alongside the video when the politician makes that claim. The aim was to spot, track, and debunk misinformation, as it spreads through social media. The It appears that HearSift is no longer tool was being created by Will accessible at the time of writing . Focus on tracking and debunking HearSift Knight (MIT Technology Review) We keep the reference for future misinformation and MIT PhD student Soroush review of its status.(Not accessible Vosoughi, with funding from the in 2016) Knight Foundation. A beta version was due by the end of 2014. Data Collection TweetCred is a real-time, web- Twitter Streaming API, using Trends API, based system to assess the which returns top 10 trending topics on Twitter. credibility of content on Twitter. They query Trends API after every 3 hours for The system provides a credibility the current trending topics, and collected tweets rating between 1 to 7 for each tweet The credibility of tweets during high- TweetCred related to these topics. Online and active in 2016 on the Twitter timeline. It is an impact events. Event Selection extension of Chrome. Ensuring the selection of events with high

impact and relevance. http://twitdigest.iiitd.edu.in/TweetC Annotation Scheme red/ Human annotators in English

34 D9.5.2 / Market Watch

Analysis SVM ranking algorithm to build a model for credibility of information in tweets. It is worth mentioning that users can enter a keyword to search for tweets and the tool retrieves 50 tweets in real-time from Twitter along with their credibility score computed by the TweetCred.

Monitter was discontinued due to a series of changes in Twitter's API specification. (Website inactive in 2016) Monitter is a real-time visualization This is the message on their of Twitter rumours and trends. website: "Since Monitter is an Monitter unpaid side project, it has been left http://monitter.com/ wanting for the necessary upgrades required to continue operating. We ARE still working on our new version, however we've been unable to get it ready in time for the Twitter API sunset." The tool offers the following services: - Real time alerts. - Statistics and data reports. - Work in group: Share alerts with any user and assign tasks to team members. Monitoring tool to track mentions. - From any device. Monitoring miles of sources in 42 Monitoring web and social networks in (Online and active in 2016) Plans Mention.net - Effectively answer: Retweet, publish or share languages. 42 languages. from 29 to 799 euros/month articles with Twitter, Facebook or Buffer. User https://es.mention.com/ can choose, depending on the mention received. - Removes homonyms and spams noises. - Identifies the most important mentions. - Sets the tone of a mention.

A tool for finding relevant It mines the conversations a user pulls - Classify Twitter, Facebook, LinkedIn and Discontinued from 30 June 2014. conversations on Twitter to engage from Twitter and other social networks Google+ posts. They reached more than 7000 users Needtagger with. for expressions of commercial intent - Build custom stream filters for any and keeping fresh Twitter data http://www.needtagger.com/ related to the user’s business. requirement using its self-service app. (tweets) flowing into the apps at

35 D9.5.2 / Market Watch

- Real time classification of fire hose volumes. full-firehose levels and to process - English language / social posts only. them in real time required a - RESTful API. reliable, high speed and stable supply of tweets. No affordable options to resolve the issue were found so they decided to shut the apps down. The Social Signals API is still online for free: https://needtagger.3scale.net/ The service ranks results using a proprietary A social search engine. It’s a social influence algorithm that measures social certified Twitter partner and media authors on how much others support Topsy makes products to search, analyse maintains a comprehensive index what they are saying. The service also provides and draw insights from conversations Topsy was acquired by Apple and Topsy of tweets, numbering in hundreds access to metrics for any term mentioned on and trends on public social websites closed on 15/12/2015. of billions, dating back to Twitter's Twitter via its free analytics service at including Twitter and Google+. inception in 2006. analytics.topsy.com, where users can compare http://topsy.com/ up to three terms for content in the past hour, day, week or month. TweetReach is an analytics tool which focuses on the impact of the Twitter The pro version gives the user, in addition to Provides analytics based on up to activity of a user, conversations and standard reporting on exposure and “reach”, 50 tweets related to the tweet sent Tweetreach network. It can measure the size of the real time monitoring capability, trends and data Still working in 2016. out by a given user. following, the reach of the messages and charts, searchable tweet archives, data http://tweetreach.com help users to identify influencers who are segmentation ability, contributor influence etc. sharing their content. Analyse user connections on Twitter, how many mentions and retweets gets an account, users most engaged with, interactive map with locations of users mentioned by an account, etc. Searches through the latest 1,600 Analyse any user on Twitter. Insights on user’s tweets to find the most popular activities within a specific date range. messages. C Free service to analyse the most Who a user retweets/replies to/mentions most. Tweetchup an be useful when you are trying to fundamental metrics on Twitter. a user used most. find viral content on Twitter. User’s most retweeted/favorited tweets. http://tweetchup.com/ Days of the week and hours of the day a user sends tweets most. Analyse keywords and hashtags on Twitter Insights on tweets containing specific keywords//user mention. Still working and free in 2016.

36 D9.5.2 / Market Watch

Stats on users mentioned keywords. Interactive map with locations of users mentioned keywords. Hashtags most used within tweets containing keywords.

Tweet-Digest is a free-for-all web The portal extracts data (tweets and user based service. It is a real time information) from Twitter and performs The project was active from 2005 Twitter search portal for extracting, various analytical tasks on it, like spam / to 2009. There are still papers Tweet-Digest tracking and visualizing content phishing detection, credibility available online about it, but their from Twitter. assessment, sentiment detection, social website and the YouTube videos network analysis and query expansion. are not working anymore in 2016. http://twitdigest.com Special emphasis on security aspects.

4.4 Other resources

Besides existing projects and tools, other interesting resources, ranging from bibliography to existing patents or web sites or social networks pointers are listed in this section.

Table 3: Related resources (bibliography, patents, articles…) Name Description Main Focus Main functionality Timeline A viral spread of information is tracked, in a network comprising interconnected nodes. Malicious information in the viral spread of information is identified. A topic-specific sub- Related patents: Patent US Patent for neutralizing propagation of network of nodes prone to be affected by the US20080256233 (Richard https://www.google.es/patents/US2 20150067849 malicious information by IBM malicious information is predicted, and the Hall 2009), US20120158630 0150067849 A1 Pub Date: March 2015 effect of the malicious information at the sub- and US20140237093 network of nodes is neutralized, via initiating a (Microsoft 2012/2013), spread of neutralizing information to the sub- network of nodes. Other variants and embodiments are broadly contemplated herein Patent US Apparatus, systems and methods for The apparatus, systems and methods Initially thought to be used by https://www.google.es/patents/US2 20150095320 scoring the reliability of online dynamically provide the reliability of Trooclick, a tool for social 0150095320 A1 information (Trooclick France) multimedia documents by applying a series of media monitoring

37 D9.5.2 / Market Watch

Pub Date: September 2013 intrinsic criteria and extrinsic criteria by pre- calculating a reliability score for at least a set of multimedia documents of at least one pre- selected source of multimedia documents, and by providing, in response to a request, the multimedia documents from the pre-selected sources associated with the score and the multimedia documents from the other sources associated with a score conditionally calculated. A contact center system can receive messages from social media sites or centers. The messages may include derogatory or nefarious content. The system can review messages to Patent US Social media provocateur detection and identify the message as nefarious and identify https://www.google.es/patents/US2 20140304343 mitigation by Avaya Inc. the poster as a social media provocateur. The 0140304343 A1 Pub Date: October 2014 system may then automatically respond to the nefarious content. Further, the system may prevent future nefarious conduct by the identified social media provocateur by executing one or more automated procedures. Embodiments of the invention relate to identifying users for initiating information spreading in social network. In one embodiment, information for one or more users Identification of users for initiating of a social network is collected and one or more Patent US https://www.google.es/patents/US2 information spreading in a social features for each of the one or more users based 20140280610 0140280610 network by IBM. on the collected information is computed. The A1 Pub Date: September 2014 one or more features are compared with a statistical model and calculating a probability that each of the one or more users will spread a message received from outside their social network based on the comparison. The invention discloses a method for controlling rumour propagation in a Method for controlling rumour complicated network. A new rumour Patent CN https://www.google.es/patents/CN1 propagation in complicated networks. propagation model is established on the basis of 104361231 A 04361231A Pub Date: September 2014 an SIR (susceptible, infected and removed) model, a fact that the infection rate is reduced along with the increase of nodes for rumour

38 D9.5.2 / Market Watch

propagation is considered, and the infection rate is described by introducing a piecewise function, so that rumour propagation behaviors in the complicated network can be more accurately described; an optimum control variable is introduced on the basis of the new model, the optimum control variable can be calculated by a mathematical method and is added into the model, so healthy nodes can be converted into immune nodes as many as possible, and the nodes for rumour propagation in the network are minimal, and the aim of controlling the rumour propagation is fulfilled. Systems and methods for predicting virality of a content item are disclosed. A method includes: receiving a social network structure; identifying communities within the social network Systems and methods for predicting structure, where communities are identified as Patent WO meme virality based on network https://www.google.es/patents/WO dense subnetworks in the social network 2014159540 structure, from Indiana University 2014159540A1 structure; receiving social network content that A1 Research And Technology Corporation includes one or more content items; and Pub Date: October 2014 identifying one or more content items that are predicted to become viral based on utilization of the content items between different communities in the social network structure. An optimized fact checking system analyses and determines the factual accuracy of information and/or characterizes the information by comparing the information with source information. The optimized fact Optimized fact checking method and Patent US https://www.google.es/patents/US9 checking system automatically monitors system 9189514 B1 189514 information, processes the information, fact Pub Date: November 2015 checks the information in an optimized manner and/or provides a status of the information. In some embodiments, the optimized fact checking system generates, aggregates, and/or summarizes content. Patent US https://www.google.es/patents/US2 Automatically Determining Veracity of An approach is provided to determine the

20140365571 0140365571 Documents Posted in a Public Forum, by veracity of an online posting. In the approach,

39 D9.5.2 / Market Watch

A1 IBM when a posting is received at a web site, a topic Pub Date: December 2014 for the posting is automatically identified. The approach further identifies actions and corresponding action originators that have been taken to the posting, with the actions being events such as commenting, liking, disliking, re-sharing, posting the online posting. Veracity information is collected about the action originators. A veracity weighting for the action originators is assigned based on the collected veracity information. The actions are analysed using the veracity weighting to form a weighted veracity summary which is provided to a viewer of the online posting https://twitter.com/IgorBrigadir/list s/verification Some Twitter lists created by other Twitter lists https://twitter.com/RevealEU/lists/ people and projects that are a good Subscribed and following in to follow verification-of-ugc source on the theme of verification for 2016 https://twitter.com/stephen_abbott/l social media ists/for-story-challenge First Draft News is a daily destination site for journalists who source and report This site started on November 2015. It’s meant stories from social media. Visitors can to be a verification resource for journalists browse articles and resources, watch working with UGC and social media. First Draft videos, listen to podcasts and download Emergent.info and some other big names in http://firstdraftnews.com Active in 2016 News supporting training materials. Visitors online news/verification are among the partner who sign up to join First Draft News can organisations behind this project. It is billed as a add comments, save pages and create daily destination site for journalists who source bespoke, shareable packs of their and report stories from social media. favourite items. European LinkedIn Group related to the REVEAL https://www.linkedin.com/groups/4 A good source to see who to Center for project with a good number of followers 308091 follow Social Media that discuss issues related to PHEME The International Fact-Checking The Poynter Institute in the US recently held a Network at the Poynter Institute. The “Tech & Check” conference bringing together IFCN http://www.poynter.org/category/fa IFCN is supported by grants from the journalists and computer scientists to talk about Active Poynter ct-checking/ Omidyar Network and the National automating the verification of online claims. Endowment for Democracy. Interesting challenges are summarized at the following links.

40 D9.5.2 / Market Watch

The IFCN provides a source of news http://www.poynter.org/2016/fact-checking-2- about fact-checking especially for 0-teaching-computers-how-to-spot-lies/404501/ journalists. http://www.poynter.org/2016/whats-does-the- future-of-automated-fact-checking-look- like/404937/ http://reporterslab.org/tech-check-new-ideas- automate-fact-checking/ “No one has yet figured out how to fully automate fact-checking, but discussions between reporters and computer scientists will help identify what types of tools could be most useful and how it might be possible to make them happen.”

The main contributions of this paper are the following: (a)We formalize the task of `controversial event detection' and Detecting controversial events introduce 3 regression machine learning from Twitter— Yahoo Labs: models to address it; These models proved that they had a good The article was written in Paper http://www.marcopennacchiotti.co (b)We describe a rich feature set for the discriminative power 2009. m/pro/publications/CIKM_2010.pd target task; f (c) We report encouraging experimental results: our models register statistically significant performance increases over all baselines, including relevant previous work. Users share information based on There is still considerable room to improve the different types of needs, including the effectiveness of the rumour detection method by need to verify controversial information. improving the filtering of enquiry and Early Detection of Rumours in Such information needs can not only correction signals by training a classifier rather Social Media from Enquiry Posts- help spread rumours, but also provide than relying on manually selected regular Paper: Michigan the first clue for detecting them. Based expressions. A method to automatically update Article available online from Enquiring http://www- on this fact, the authors designed a the filtering patterns in real time could be also 2015 Minds personal.umich.edu/~qmei/pub/ww rumour detection approach. They be developed to prevent potential spamming of w2015-zhao.pdf clustered only those tweets that contain the detection system. More enquiry patterns (the signal tweets), features for each statement could be explored extract the statement that each cluster is and train a better ranking algorithm for making, and use that statement to pull candidate rumour clusters could be done.

41 D9.5.2 / Market Watch

back in the rest of the non-signal tweets Another direction is to adopt this method to that discuss that same statement. Then detect rumours automatically and generate a they rank the clusters based on statistical large data set of rumours, which can benefit features that compare properties of the many potential analyses such as finding features signal tweets within the cluster to that are potentially correlated to the truth value properties of the whole cluster. of a rumour, or analysing general diffusion Extensive experiments show that our patterns or the life cycle of rumours. proposed method can detect rumours effectively and efficiently in their early stage. With a small Hadoop cluster, in about half an hour they processed 10% of all the tweets posted in one day on Twitter. One third of them are real rumours, and about 70% of the top ranked 10 clusters are rumours. Finding interesting claims from data is broadly related to the field of data mining. This framework allows flexible mining tasks to be specified using query templates (e.g., in SQL). The discovered patterns readily translate to claims that are easy to understand by a lay person. In contrast, results from more sophisticated approaches (say, SVM classification) are harder to explain and cannot be directly cited in stories. Companies such as Narrative Science and Paper: It is only a first step toward a Article from 2013 - 2014. Google. http://compute-cuj.org/cj- Automated Insights are able to automatically iCheck/uClai computational approach to fact-checking Work apparently 2014/cj2014_session2_paper1.pdf generate news stories based on data for domains m (not manual). discontinued. such as business and sports, thereby alleviating humans from writing narratives that are largely template-based. It does not generate complete stories but interesting leads and factlets that can be used in full stories. Other companies such as Elias Sports Bureau provide data as well as statistics and factlets derived from data-on particular domains and for a fee. Organizations such as factcheck.org and PolitiFact.com rely on their expert editorial staff to check claims. However, manual approaches are costly to scale

42 D9.5.2 / Market Watch

because of the demand on human expertise and effort. FactMinder assists fact checkers in annotating text, extracting entities, linking sources, and collaboratively building a knowledge base. Truth Goggles (truthgoggl.es/demo.html) and Dispute Finder detect claims on the Web that have already been checked or refuted by authoritative sources. However, computational tools for checking claims directly using data are still sorely lacking. Using question answering systems, such as IBM’s Watson and WolframAlpha, and natural language querying systems allow users to check the correctness of some claims in natural language. However, these systems are not yet capable of handling claims that correspond to complex queries. Social Media: DW und ATC entwickeln Verifizierungsplattform http://www.dw.com/de/social- Deutsche Welle and the Athens Technology media-dw-und-atc-entwickeln- Deutsche Welle and the Athens Centre, are developing a new platform to help Project began in June 2016 Press release verifizierungsplattform/a- Technology Centre are creating a journalists verify UGC on social media. The 15- Press release is in German 19303925?maca=de-Twitter- verification platform for journalists month project is funded by the Innovation Fund only. sharing of the Google Digital News Initiative. From two partners of the REVEAL project Interesting article showing curiosities about the spread of misinformation as food for conspiracy theories. It claims that this is no longer a Social Network Algorithms Are curiosity, but we need to be aware of it and take Distorting Reality By Boosting Article on Facebook's anti conservative actions: "Some have begun to introduce Conspiracy Theories stance is in the news, but the issue of algorithms that warn readers that a share is Recent article by Renee Article http://www.fastcoexist.com/305974 what news social networks choose to likely a hoax, or satire. Google is investigating DiResta, March 2016 2/social-network-algorithms-are- show us is much broader than that. the possibility of "truth ranking" for web distorting-reality-by-boosting- searches, which is promising. These are great conspiracy-theories starting points, but a regulation by algorithm has its own set of advantages and pitfalls. The primary concern is that turning companies into

43 D9.5.2 / Market Watch

arbiters of truth is a slippery slope, particularly where politically rooted conspiracies are concerned." The Increasing Problem With the Very interesting facts and analysis about how Post analysing how misinformation Misinformed people are not uninformed, but misinformed. Recent article (post) by spreads. It also tackles fact checking Article https://www.baekdal.com/analysis/t Very good source to understand the role of the Thomas Baekdal on March 7, issues analysing the truthfulness of he-increasing-problem-with-the- press and news organizations to understand this 2016 statements made by US politicians misinformed/ fact Accepted papers in the "Social Interesting workshop in which PHEME partners Workshop collocated with the Conference Media at the Newsroom" workshop were present and with one PHEME paper ICWSM 2016 conference.17 A set of papers accepted in the workshop workshop http://www.smnews.newslab.ie/acc accepted. This workshop debated precisely on May 2016, Cologne, epted-papers/ the themes interesting for the project Germany Workshop on Web Multimedia Verification (WeMuV) Conference http://www.icme2015.ieee- Several interesting papers on verification workshop icme.org/workshops.php https://sites.google.com/site/wemu v2015/ An overview of recent research on http://firstdraftnews.com/recent- rumours by Craig Silverman for First Article research-reveals-false-rumours- Draft, including UWAR/SWI Collaboration of PHEME really-do-travel-faster-and-further- collaboration within PHEME and Hoaxy

than-the-truth/ by Truthy Indiana Una Hakika is a Kenyan rumour tracking project for collaborative reporting and Article verification of rumours: explaining http://www.unahakika.org/ the Una http://firstdraftnews.com/how-una-

Hakika hakika-helped-slow-the-spread-of- It will be interesting to see its underlying Kenyan dangerous-rumors-in-kenya/ software, WikiRumours, whose code project they're apparently planning to release soon. The BBC on research suggesting that people tend to fall for every lie that http://www.bbc.com/future/story/2 seems credible, and that it is really hard Article 0160323-why-are-people-so- to correct those perceptions, also

incredibly-gullible suggesting that "counter-evidence only strengthens someone’s conviction" Book A recent book on algorithms for

44 D9.5.2 / Market Watch

http://ieeexplore.ieee.org/xpl/article computational verification

Details.jsp?arnumber=7376425

Political rumouring on Twitter during http://nms.sagepub.com/content/ear Article the 2012 US presidential election: ly/2016/03/04/1461444816634054. Rumour diffusion and correction

abstract?rss=1 Paper presented in the CSCW 2016 conference Andrews C. et al.. 2016. Keeping Up with the Tweet- Keeping Up with the Tweet- dashians: dashians: The Impact of The Impact of ‘Official’ Accounts Paper examining how Twitter official 'Official' Accounts on Online Paper on Online Rumouring accounts participate in the spread and Rumouring, Proceeding http://faculty.washington.edu/kstar corrections of rumours in SN CSCW '16 Proceedings of the bi/CSCW2016_Tweetdashians_Ca 19th ACM Conference on mera_Ready_final.pdf Computer-Supported Cooperative Work & Social Computing, Pages 452-465 Paper presented in the CHI 2016 conference Could This Be True? I Think So! Starbird K., et al. 2016. Could Expressed Uncertainty in Online This Be True? I Think So! Rumouring Expressed Uncertainty in Paper http://faculty.washington.edu/kstar Online Rumouring. bi/CHI2016_Uncertainty_Round2_ Proceedings of the ACM CHI FINAL-3.pdf This paper introduces a scheme of Conference on Human different types of uncertainty expressed Factors in Computing in tweets about rumours Systems Claimbuster fact checking demo online Demo Online demo of Claimbuster http://idir- server2.uta.edu/claimbuster Twitter Trails online demo Demo Online demo of Twitter Trails http://twittertrails.com/ Classifying Rumour Stance in Zeng, L., et al., 2016. Crisis-Related Social Media Interesting paper investigating how They claim to have a dataset of 4,300 manually "Classifying Rumour Stance Paper Messages rumours spread using conversational coded tweets. PHEME partners asked for the in Crisis-Related Social http://www.aaai.org/ocs/index.php/ analysis. dataset, but it is not available so far Media Messages". ICWSM/ICWSM16/paper/view/13 Proceedings of the

45 D9.5.2 / Market Watch

075 International Conference on Web and Social Media (ICWSM). Cologne, Germany. Verification Handbook for Online investigating reporting Very interesting web resource on Handbook http://verificationhandbook.com/bo verification for journalists ok2/ Lies, damm lies, and viral content Report by Craig Silverman Interesting report on how news websites http://towcenter.org/wp- Long and thorough report very interesting and for the Tow Center for Digital Report spread and debunk online rumours, content/uploads/2015/02/LiesDamn aligned with the PHEME research Journalism. A Tow/Knight unverified claims and misinformation Lies_Silverman_TowCenter.pdf Report 2015

46 D9.5.2 / Market Watch

5. Conclusion

This document provides a refined version of the Market Watch Analysis. The present analysis takes the Market Watch as a starting point, which was presented in PHEME deliverable D9.5.1. This market watch is preceded by a definition of the aims of PHEME and an overview of the market related to social networks and text analysis made by a selection of business analysts.

Since its aim is to detect possible business opportunities for PHEME, the document explores the current targeted market by identifying potential competitors and existing research projects that deal with veracity in social networks. It also looks for synergies that can be converted into real business cases or improvements in the current research line. We have presented and followed a methodology to scan the market thoroughly based on previous methodologies, adapted to our specific needs and refined by applying the lessons learned during its application.

Hence the document provided an overview of the main competitors and tools, as well as the research focus and outcomes of projects and initiatives relevant to PHEME. It is worth mentioning that during the project lifespan we have witnessed a growing interest from the public and the market in tools and methods to assess the quality of data found on social networks. Several tools and projects currently address the issue at the core of PHEME. Project partners are aware (and are in many cases the main source of information) of the initiatives, tools and projects listed in this document. Some of the results inspired or even redirected work being undertaken within the project, which is one of the goals of this market watch. The final exploitation document will take input from this document in order to assess the maturity of the market for PHEME solutions.

Therefore, as a conclusion, PHEME should continue watching with the upmost interest the results of these initiatives, search for synergies, and eventually reassess its research focus based on feedback concerning what the market is addressing and demanding.

47 D9.5.2 / Market Watch

6. Bibliography and references

Gold, E.R, and Baker, A.M., 2012. Evidence-based Policy: Understanding the Technology Landscape. Journal of Law, Information and Science 22(1)

Zubiaga, A., Liakata, M., Procter, R., Wong, G., Hoi, S., Tolmie, P., 2016. Analysing How People Orient to and Spread Rumours in Social Media by Looking at Conversational Threads http://arxiv.org/pdf/1511.07487.pdf

Y.-a. Kang, C. Gorg, and J. Stasko. Evaluating visual analytics systems for investigative analysis: Deriving design principles from a case study. In IEEE VAST, 2009.

Søe, S.O., 2016: The Urge to Detect, the Need to Clarify: Gricean Perspectives on Information, Misinformation, and Disinformation. Thesis http://static-curis.ku.dk/portal/files/160969791/Ph.d._2016_Obelitz.pdf

Walenz, B., Wu, Y., Song, S., Sonmez, E., Wu, E., Wu, K., Agarwal P.K., Yang, J., Hassan, N., Sultana, A., Zhang, G., Li, C., Yu, C. (The iCheck/uClaim Team), Finding, Monitoring, and Checking Claims Computationally Based on Structured Data. http://compute-cuj.org/cj-2014/cj2014_session2_paper1.pdf

Zen L., Starbird K.; Spiro E. S., Classifying Rumor Stance in Crisis-Related Social Media Messages. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM16/paper/view/13075

Pariente Lobo, T., Pontual, A.L., Bontcheva, K., Moffat, L., 2015. D9.3. Dissemination and Exploitation Plan (Revised). PHEME Project deliverable

Pariente Lobo, T., Pontual, A.L., 2015. D9.5.1. Market Watch – Initial Version. PHEME Project deliverable

48