<<

Data Science-Driven Crowd Intelligence and Its Business Applications

Item Type text; Electronic Dissertation

Authors Wei, Xuan

Publisher The University of Arizona.

Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Download date 26/09/2021 05:08:03

Link to Item http://hdl.handle.net/10150/645751

DATA SCIENCE-DRIVEN CROWD INTELLIGENCE AND ITS BUSINESS APPLICATIONS

by

Xuan Wei

______Copyright © Xuan Wei 2020

A Dissertation Submitted to the Faculty of the

DEPARTMENT OF MANAGEMENT SYSTEMS

In Partial Fulfillment of the Requirements

For the Degree of

DOCTOR OF

In the Graduate College

THE UNIVERSITY OF ARIZONA

2020

ACKNOWLEDGEMENTS

I hold a great appreciation for my dissertation committee members, Drs. Daniel Zeng, Wei

Chen, Yong Ge, and Jason Pacheco, for their encouragement, inspiration, and guidance. I am especially grateful to my advisor, Dr. Daniel Zeng. The life lessons I learned about passion, professionalism, vision, attitude, teamwork, and many others, will have a persistent influence on my future career and life. I am also very grateful to all other faculty members in the MIS department and my coauthors, especially Dr. Zhu Zhang, for their scholarly communication and intellectual stimulation.

Special thanks to my Ph.D. colleagues and friends in Tucson, especially Zhengchao Yang,

Yang Gu, Yuanxia Li, Hao Liu, Zhipeng Chen, Jiayu Yao, Saiying Ge, Zisu Wang, Xinran

Wang, Marni Mcdaniel, and many others, for their friendship and all these happy parties. Many thanks to senior students, especially Dr. Yongcheng Zhan, for their help and support. I also want to thank Dr. Jingyu Liu and Dr. Muhan Zhou for their help in life whenever needed. I also thank the MIS stuff members, Cinda Van Winkle and Dawn Bishop, for their logistic assistance.

I also own my deepest gratitude to my loving wife, Dr. Mingyue Zhang, and family members for their company, support, and encouragement.

I thank the National Institutes of Health (1R01DA037378) for providing the funding of my work.

3 DEDICATION

This dissertation is dedicated to my wife and family

4 TABLE OF CONTENTS

LIST OF FIGURES ...... 8

LIST OF TABLES ...... 10

ABSTRACT ...... 11

1. INTRODUCTION...... 12

2. ESSAY I: MINING CROWD STANCE FROM : A DEEP LEARNING APPROACH BASED ON INTERACTIVE ATTENTION ...... 19 2.1 Introduction ...... 19 2.2 Literature Review ...... 23 2.2.1 Crowd Opinions in Social Media ...... 23 2.2.2 Stance Detection and Related Problems ...... 25 2.2.3 Techniques of Stance Detection ...... 27 2.3 Method ...... 29 2.3.1 Problem Definition...... 29 2.3.2 Psychological Motivation ...... 30 2.3.3 Interactive Attention-Based Stance Detection (IASD) ...... 31 2.3.3.1 Embedding Layer ...... 32 2.3.3.2 Context Encoding Layer ...... 33 2.3.3.3 Attention Layer ...... 33 2.3.3.4 Prediction Layer ...... 34 2.3.3.5 Model Training ...... 34 2.4 Empirical Evaluation ...... 35 2.4.1 Experimental Design ...... 35 2.4.2 Experimental Results ...... 37 2.5 Discussion and Conclusions ...... 39

3. ESSAY II: HOW TO DESIGN NEXT-GENERATION LEARNING FROM CROWDS IN A PRINCIPLED WAY? AN INTERPRETABLE FRAMEWORK COMBINING DEEP LEARNING AND GRAPHICAL MODELS ...... 43 3.1 Introduction ...... 43 3.2 Related Work ...... 47 3.2.1 Truth Inference in Crowdsourcing ...... 47 3.2.2 Deep Generative Models and Inferences ...... 50 3.3 Hypothesis Development ...... 51 3.4 Deep Generative Modeling Framework...... 58 3.4.1 DARF ...... 59 3.4.2 DARFC and S-DARFC...... 62 3.4.3 DARFCD ...... 63

5 3.4.4 Multi-Class Extensions ...... 65 3.5 Model Inference ...... 66 3.6 Empirical Evaluations ...... 74 3.6.1 Experimental Design ...... 74 3.6.2 Hypothesis Testing...... 78 3.6.3 Comparison with Benchmark Models ...... 85 3.6.4 Analysis and Discussion ...... 86 3.7 Conclusions ...... 89

4. ESSAY III: COMBINING CROWD AND MACHINE INTELLIGENCE TO DETECT FALSE NEWS IN SOCIAL MEDIA ...... 93 4.1 Introduction ...... 93 4.2 Related Work ...... 97 4.2.1 Theoretical Foundations...... 98 4.2.2 Definition of False News and Its Classification ...... 99 4.2.3 Computational False News Detection in Social Media ...... 101 4.2.4 False News Studies Using Crowd Wisdom ...... 104 4.2.5 Information Aggregation ...... 105 4.3 A Crowd-Powered Framework for False News Detection ...... 107 4.3.1 Information Extraction ...... 108 4.3.2 Unsupervised Bayesian Result Aggregation Model CLNAM ...... 109 4.3.2.1 Challenges in Result Aggregation ...... 110 4.3.2.2 Technical Insights ...... 111 4.3.2.3 CLNAM Model ...... 112 4.3.2.4 Model Inference ...... 115 4.3.2.5 Learning Prior Beliefs from Data ...... 119 4.4 Empirical Evaluations ...... 122 4.4.1 Experimental Design ...... 122 4.4.1.1 Datasets and Preprocessing ...... 122 4.4.1.2 Baseline Methods and Evaluation Metrics ...... 127 4.4.1.3 Experimental Procedure ...... 131 4.4.2 False News Detection Performance: CAND vs. the Benchmark Methods ...... 132 4.4.2.1 Results in Dataset Weibo ...... 132 4.4.2.2 Results in Dataset Twitter ...... 137 4.4.2.3 Robustness Check ...... 139 4.4.3 Analysis and Discussion ...... 142 4.4.3.1 Complementary Strengths of Humans and Machines ...... 142 4.4.3.2 Early Detection ...... 143 4.4.3.3 Value of Non-Text Features and Multiple Feature-Based Classifiers ...... 145 4.4.3.4 Stress Testing the Simulation Parameters ...... 146 4.4.3.5 Sensitivity to News Characteristics ...... 148 4.4.3.6 Robustness to Intentional Manipulation of Responses and Reports ...... 151 4.4.3.7 Debunking Response Detection as a Multi-Class Classification Problem ...... 155 4.5 Conclusions ...... 158

6 5. CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS...... 162 5.1 Contributions...... 162 5.2 Future Research Directions ...... 164

REFERENCES ...... 167

7 LIST OF FIGURES

Figure 1. Dissertation Framework ...... 15 Figure 2. An Example of Post and Comments with Complicated Target ...... 22 Figure 3. Architecture of IASD ...... 32 Figure 4. Performance of IASD and Benchmark methods for Generic Stance Detection ...... 38 Figure 5. Performance of IASD and Benchmark methods for Stance Detection .. 39 Figure 6. General Pipeline of Microtask Crowdsourcing ...... 44 Figure 7. Problem Setup...... 52 Figure 8. Illustrative Example ...... 53 Figure 9. Roadmap for Model Development ...... 59 Figure 10. Graphical Representation of the DARF and DARFC Models ...... 62 Figure 11. Graphical Representation of the DARFCD Model...... 65 Figure 12. Testing H1 (ROC AUC as Metric) ...... 82 Figure 13. Testing H2 (ROC AUC as Metric) ...... 82 Figure 14. Testing H3 (ROC AUC as Metric) ...... 83 Figure 15. Testing H4 (ROC AUC as Metric) ...... 83 Figure 16. Testing H5 (ROC AUC as Metric) ...... 85 Figure 17. Comparison with Benchmark Models (ROC AUC as Metric) ...... 86 Figure 18. Performance of DARFC with Varying K (ROC AUC as Metric) ...... 87 Figure 19. Visualization of Top 10 Clusters Learned by S-DARFC ...... 89 Figure 20. Visualization of Top 20 Bird Species Learned by S-DARFC in Dataset Throat ... 89 Figure 21. An Example of Human Intelligence in Responses and Reports ...... 97 Figure 22. The Crowd-Powered Framework for False News Detection (CAND)...... 109 Figure 23. Empirical Evidence for the Logistic-Normal Assumption ...... 112 Figure 24. Graphical Representation of the CLNAM Model ...... 115 Figure 25. Graphical Representation for Learning Prior Beliefs from Data ...... 120 Figure 26. Data Preprocessing for Weibo Dataset ...... 127 Figure 27. Data Preprocessing for Twitter Dataset ...... 129 Figure 28. Performance of CAND and Benchmark Methods in Weibo Dataset ...... 134 Figure 29. CAND vs. Benchmark Methods under Different IRs in Weibo Dataset (AUC Score as Metric) ...... 137 Figure 30. Performance of CAND and the Benchmark Methods in Twitter Dataset (Ref. lines: AUC of CAND-12) ...... 138 Figure 31. CAND vs Benchmark Methods under Different IRs in Twitter Dataset (AUC Score as Metric) ...... 139 Figure 32. Performance of CAND-1 and CAND-12 with Different Training Percentages ...... 140 Figure 33. Performance of CAND-1 and CAND-12 with Different k in k-Fold Cross Validation ...... 141 Figure 34. A Stacked Bar Graph Showing the Relative Contributions of the Three Types of Judgments and Prior Beliefs in Weibo Dataset (IR=1:7)...... 142 Figure 35. A Stacked Bar Graph Showing the Relative Contributions of the Two Types of Judgments and Prior Beliefs in Twitter Dataset (IR=1:1.5) ...... 143 Figure 36. Testing Early Detection in Weibo Dataset (IR=1:7) ...... 144 Figure 37. Performance of CAND with and without Non-Text Content Features and

8 Context Features (Ref. lines: AUC of CAND(Y))...... 145 Figure 38. AUC Score of CAND-12 under Different Combinations of Feature-Based Classifiers ...... 146 Figure 39. Performance of CAND-123 on Different Misreport Rates of True News and Report Rates of False News in Weibo Dataset (IR=1:7) ...... 147 Figure 40. Sensitivity of CAND to News Length in Weibo Dataset (IR=1:7) ...... 149 Figure 41. Sensitivity of CAND and the Selected Benchmark Methods to News Type in Weibo Dataset (IR=1:7) ...... 150 Figure 42. Performance of CAND-12 with Various Types of Malicious Users in Weibo Dataset (IR=1:7) ...... 153 Figure 43. Performance of CAND in Comparison with CAM and E-CAM (Ref. lines: AUC of CAND)...... 158

9 LIST OF TABLES

Table 1. Sample Posts with Corresponding Stance and Sentiment ...... 20 Table 2. An Illustration of Attention ...... 31 Table 3. Estimated Worker Reliability in the Illustrative Example ...... 58 Table 4. Dataset Statistics ...... 75 Table 5. Comparison Between Proposed Models and Benchmark Models ...... 77 Table 6. Significance of the Hypothesis Testing (ROC AUC as Metric) ...... 79 Table 7. Performance of Proposed Models and Benchmark Models (ROC AUC as Metric).. 80 Table 8. News Features Considered in Weibo and Twitter Datasets ...... 108 Table 9. Examples of Posts and Responses with Different Labels ...... 124 Table 10. Summary of Methods ...... 131 Table 11. Six Types of Simulated Malicious Users ...... 152 Table 12. Theoretical Debunking Rates and Performance Trends ...... 154

10 ABSTRACT

Crowd intelligence has shown to be a successful practice in various traditional applications such as the . Over the last decades, the proliferation of the has created a group of even more “intelligent” crowds which are more massive, versatile, elastic, and can be accessed immediately at a lower cost. Tapping into such crowds has triggered the emergence of many new business applications and provides unprecedented opportunities for existing businesses to innovate and improve their practices. Despite the promise, there are still lots of managerial concerns to address in order to successfully distill the wisdom of crowds to create business values. In a typical crowd intelligence scenario, the major concerns include what kind of crowd intelligence to use, how to extract or collect crowd intelligence, and how to aggregate or evaluate the crowd intelligence. In this dissertation, I use data science-driven approaches that leverage deep learning techniques, Bayesian graphical models, and their combinations, to address these concerns. Three essays are included in this dissertation. The first essay designs an interactive attention-based deep learning approach to extract crowd opinions from social media.

The second essay explores how to aggregate the noisy crowd intelligence in the crowdsourcing scenario by designing a deep generative model. The last essay taps into the crowd intelligence to tackle the current false news crisis. This dissertation not only illustrates how we can take advantage of crowd intelligence in various real-world applications but also guides future computational design science research in the Information System (IS) field.

11 1. INTRODUCTION

Crowd intelligence has shown to be a successful practice in various applications such as prediction market and traditional opinion poll (Atanasov et al. 2017; Chen et al. 2017). It refers to a distributed problem-solving model that harnesses the cognitive power of large crowds to carry out some challenging computational tasks under a certain organizational structure (Li et al.

2017). Its success benefits greatly from the diversity, independence, and massiveness of crowds

(Surowiecki 2005). In the last decades, the proliferation of the Internet makes these characteristics more prominent and triggers an even more “intelligent” crowds: they become cheaper, more massive, versatile, elastic, and can be accessed immediately. Such Internet-based crowd intelligence has, to some degree, facilitated the emergence of many new business applications that tap into the intelligent pools of crowds, such as Wikipedia, the sharing economy, and crowdsourcing (Atanasov et al. 2017; Bayus 2013; Lee et al. 2018; Lukyanenko et al. 2014; Wang et al. 2017; J. Zhang et al. 2019). For instance, Wikipedia is an online encyclopedia website created and maintained through the open collaboration of crowd users; many organizations rely on crowdsourcing communities to collect innovative ideas for new products and services (Bayus 2013).

While the development of the Internet brings about lots of benefits and opens up many opportunities, new concerns arise because of such less organized and reliable online crowds.

Let’s consider a typical Internet-based crowd intelligence scenario, as shown in Figure 1. To meet certain business needs, the first concern to address is finding the right crowd strategy, namely, what kind of crowd intelligence to use (denoted as C1). In most traditional scenarios, crowd intelligence usually happens in terms of open call and fits in some pre-defined targets

(Surowiecki 2005). The Internet also triggers the emergence of huge amounts of implicit crowd

12 intelligence which contains significant business values. For example, crowd users’ discussions regarding the veracity of news can be exploited to detect false news, although their initial intention is not debunking (Wei, Z. Zhang, et al. 2019); product providers can collect useful feedback on their products from social media platforms or various forums. To exploit the crowd intelligence, the business holders need to find the right crowd strategy. Second, crowd intelligence may be in an unstructured form such as text. Using the same example, crowds may contribute their intelligence regarding the false news in forms of posting responses in social media, which is unstructured text. Hence, an important managerial concern is how to extract the crowd intelligence (denoted as C2). Last, individual crowd intelligence is often unreliable. For example, in microtask crowdsourcing, a typical crowd intelligence scenario, workers may be unreliable in many ways. They usually have different levels of competence and dedication

(Kazai et al. 2011). Because the microtasks are often tedious and repetitive, unintentional mistakes are also inevitable (Karger et al. 2011). Some malicious workers even disobey the task rules provided by the requesters (Gadiraju et al. 2015). Although individuals tend to be unreliable, their accumulated efforts often lead to satisfactory results in many applications

(Surowiecki 2005). Hence, a fundamental problem is how to aggregate or evaluate crowd intelligence (denoted as C3).

Addressing these concerns (especially C2 and C3) is technically challenging as the broad range of applications requires our approach to handle both unstructured data and structured relationships. First, as previously mentioned, crowd intelligence may exist in the form of unstructured data such as text; many real-world applications also involve other types of unstructured data such as images. The recent development of deep learning techniques achieves superior performance in various machine learning tasks (LeCun et al. 2015). Such deep

13 architectures exhibit salient advantages in learning useful patterns from unstructured and high- dimensional data (Lin et al. 2018). In this dissertation, I propose to exploit these deep learning techniques to handle unstructured data. Despite the promise of deep learning, blindly applying it as a black-box model to all scenarios isn’t the panacea. Structured relationships among data widely exist in crowd intelligence scenarios (Atarashi et al. 2018; Wang et al. 2017) and overlooking such relationships may result in low performance. When structured relationships exist, I propose to build Bayesian graphical models to statistically model them. In Section 4, I show that statistical relational models perform better than black-box deep learning approaches when aggregating crowd intelligence to detect false news. Last, some crowd intelligence requires us to simultaneously deal with unstructured data and structured relationships. For example, in microtask crowdsourcing, the labeling tasks often involve unstructured text or image, and structured relationships exist among crowd workers, final target (i.e., labels), and many other objects. To addresses the aforementioned concerns in such crowd intelligence applications, I propose to combine deep learning and Bayesian graphical model, which not only learn flexible representations from unstructured high-dimensional data but also find interpretable structure that can be generalized to new tasks (Johnson et al. 2016).

14

Figure 1. Dissertation Framework

The overall architecture of my dissertation is presented in Figure 1. I propose three essays to address the above three managerial concerns with the proposed techniques under three scenarios, i.e., crowd stance detection, learning from crowd labeling, and false news detection. The first essay tackles the problem of how to extract crowd intelligence (i.e., C2). With the proliferation of user-generated content in social media, mining crowd stance towards certain target or event from this unstructured data becomes increasingly valuable to support decision making across a variety of applications. Existing literature mainly focuses on detecting crowd stance towards a simple target which can be described by a short phrase. In this work, we consider a prevalent and important but less researched scenario where we want to detect the stance towards a complicated target. To this end, we first motivate our solution with the psychological procedure of human decision making. Relying on the attention mechanism, we propose a deep learning approach called IASD which mimics the human decision-making procedure. We apply our approach to the false news stance detection scenario and comprehensively evaluated the effectiveness of our model using datasets collected from two popular social media platforms. The experimental

15 results demonstrate the superior performance of our approach over the benchmark methods. The results also suggest that our approach is robust and flexible enough to detect crowd stance under various scenarios (i.e., binary or multi-class).

The second essay mainly deals with the problem of how to aggregate the crowd intelligence

(i.e., C3) in the microtask crowdsourcing scenario. Microtask crowdsourcing has emerged as a cost-effective approach to collecting large-scale high-quality labeled data across a wide range of business applications. To aggregate the crowd efforts and achieve certain cumulative goals, various assumptions (e.g., worker heterogeneity in quality) are considered, and models are developed based on these assumptions. However, most of the current design of learning from crowds makes simple or constrained assumptions, and the conclusions suffer from low interpretability and generalizability. To provide a set of generalizable practices for the future design of learning from crowds, we first formulate several general hypotheses, including worker heterogeneity in reliability, usefulness of task feature and task clustering structure, etc. To test these hypotheses, we propose a flexible deep graphical framework that enables incremental design and hence allows us to conduct before-and-after evaluation towards the underlying assumptions. This deep framework also allows us to make less constrained and hence more useful assumptions by modeling complex non-linear relationships with deep learning. An inference algorithm combining variational message passing and amortized learning is then developed to estimate the parameters. Last, we empirically test these hypotheses using eight real- world tasks including text and image classifications. The results also demonstrate the effectiveness of our framework over state-of-the-art benchmark models. Our work not only serves as a cost-effective approach to aggregating crowd annotations but also provides general practices for next-generation design of learning from crowds.

16 The last essay studies how to exploit crowd intelligence to tackle the emerging false news crisis by addressing the concerns of what kind of crowd intelligence to use (i.e., C1) and how to aggregate the crowd intelligence (i.e., C3). The explosive spread of false news in social media has severely affected many areas such as news ecosystems, politics, and economics. Machine intelligence has met with limited success in detecting and curbing false news. Human knowledge and intelligence hold great potential to complement machine-based methods. Yet they are largely under-explored in current false news detection research, especially how to efficiently utilize such information. We observe that the crowd contributes to the challenging task of assessing the veracity of news by posting responses or reporting. We propose to combine these two types of scalable crowd judgments with machine intelligence to tackle the false news crisis. Specifically, we design a novel framework called CAND, which first extracts relevant human and machine judgments from data sources including news features and scalable crowd intelligence. The extracted information is then aggregated by an unsupervised Bayesian aggregation model.

Evaluation based on Weibo and Twitter datasets demonstrates the effectiveness of crowd intelligence and the superior performance of the proposed framework in comparison with the benchmark methods. The results also generate many valuable insights, such as the complementary value of human and machine intelligence, the possibility of using human intelligence for early detection, and the robustness of our approach to intentional manipulation.

This research significantly contributes to relevant literature on false news detection and crowd intelligence. In practice, our proposed framework serves as a feasible and effective approach for false news detection.

The rest of this dissertation is organized as follows. The three essays are presented in

Section 2, Section 3, and Section 4 respectively. Section 5 concludes the dissertation by

17 summarizing the contributions and pointing out potential future research directions. References are attached at the end of the dissertation.

18 2. ESSAY I: MINING CROWD STANCE FROM SOCIAL MEDIA: A DEEP LEARNING

APPROACH BASED ON INTERACTIVE ATTENTION

2.1 Introduction

The last decades have witnessed the dramatic expansion of social media, in which tremendous user-generated content (UGC) becomes the of online information in replace of the authority-created content. There is a fair amount of agreement among practitioners and academic literature that online UGC generated by crowds is a promising source and influences decision making across a variety of domains. Firms start paying attention to these

UGC and trying to “find the needle in the haystack”. For instance, UGC can help identify customer needs (Timoshenko and Hauser 2019), inform purchases (Chen et al. 2011), support investment decisions (Aggarwal and Singh 2013), provides (Leung 2009), etc.

Among all types of UGC, the majority of them express certain opinions from the crowds, which provide a unique and valuable way for social media stakeholders to know the crowd's opinions and hence make a profit out of them. Take movie reviews as an example, consumers may share their positive or negative opinions regarding certain movies; based on the opinions, movie theater can adjust the movie schedule accordingly. Another example is the tweets posted by

Internet surfers. They may express various emotions towards some products, offering an opportunity of targeted for product providers.

As UGC with an opinion is weighted more heavily in supporting decision making (Yin et al.

2014), both the academic community and the industry start exploring how to better understand the crowd’s opinions and a lot of opinion mining techniques emerge (Dang et al. 2009; Ravi and

Ravi 2015; Saleh et al. 2011; Wang et al. 2014). In general, opinion mining is defined as detecting the sentiment polarity of text. The most common classification output is Positive,

19 Negative, or Neutral; depending on the context, more specific categories, such as thrilled, joy, sad, anger, anxiety, etc., may be used. However, in many practical applications, we are more interested in learning the position of the author towards a specific target or topic rather than the polarity of the whole text. For example, a movie review of “I feel bad today, so I watch this movie to make me happy” would be classified as negative in sentiment. But if we are interested in the reviewer’s stance towards this movie, she/he likes the movie. Thus, a more valuable problem would be detecting crowd stance (e.g., Favor, Against, Neutral, or None) towards a specific target that we care about1 (Mohammad et al. 2016). Note that the target may or may not be explicitly mentioned in the text. It differentiates from general opinion mining (or sentiment analysis) in two aspects: (1) stance detection is concerned with the attitude/position towards a particular target while sentiment analysis is not; and (2) the sentiment and stance (for a target) within the same text are not necessarily aligned with each other. As the previous example of movie review shows, the stance may be favoring a particular target even if the polarity of the sentence is negative in sentiment. Table 1 provides some examples to further illustrate the problem of stance detection and how stance may be different from sentiment (Küçük and Can

2020).

Table 1. Sample Posts with Corresponding Stance and Sentiment

Post Stance Target Stance Sentiment We live in a sad world when wanting equality Feminist Favor Negative makes you a troll... Movement I have my work and my faith ... If that’s boring Atheism Against Neutral

1 None means the text is unrelated to the target or no stance information can be extracted; hence, it is a different stance label from Neutral. Depending on the context, the stance label set can be {Agree, Disagrees, Discusses,

Unrelated}, {Supporting, Denying, Querying, Commenting}, or some other sets.

20 to some people, I can’t tell you how much I don’t care. It’s a lovely day and I had a great time with my Climate None Positive friends. Change

Existing literature primarily focuses on detecting the stance towards a target which can be described by a short phrase (Somasundaran and Wiebe 2009; Xu et al. 2016; Zarrella and Marsh

2016). For example, the explicit or implicit target of movie reviews is a specific movie on a third-party movie review site, and the target of product reviews is a specific product on an e- commerce platform. See Table 1 for some other examples. However, it is quite common in real- world scenarios that the target is more complicated than a short phrase, especially when crowd users share a post or response to others in social media. The target could be a piece of news or an event that is hard to be conceptualized into a short phrase. Figure 2 shows an example from Sina

Weibo where a user talked about an event regarding the college entrance exam and other users express their stances towards the event. In this case, the target consists of at least the following four key components: She County of Anhui Province (i.e., where), official announcement (i.e., who), postponed college entrance examination, and original examination paper (i.e., what). Any single short phrase will miss critical information about this event. Due to the complexity of the target, it is obvious that previous dedicated approaches for the simple target scenario are inapplicable. In this paper, we aim to fill this research gap by studying how to detect the stance of a post towards a complicated target that is usually described with a sentence or paragraph.

This is challenging because the complicated target necessitates the semantic understanding of both sentences. Referring back to Figure 2, the algorithm needs to understand an event consists of multiple pieces of information and crowd users may post opinions towards any aspect of the event. For example, the first comment argues about whether the original examination paper is used, while the third comment doubts the location where this event happens. 21

Figure 2. An Example of Post and Comments with Complicated Target

Motivated by how humans make judgments in the stance detection scenario, we design a deep learning approach called Interactive Attention-based Stance Detection approach (IASD).

The overall idea is to mimic the decision-making procedure of humans by first locating the relevant part using interactive attention and then making a final decision on the stance based on the extracted information. Specifically, we first learn both word-level and sentence-level latent representations for each sentence with the bidirectional Long Short-Term Memory model (Bi-

LSTM). By interactively comparing the representations, we use an attention mechanism

(Bahdanau et al. 2015; Vaswani et al. 2017) to assign a weight to each word, with the hope of locating the most useful words in identifying stance. After weighting each word, we obtain two condensed representations that capture the most critical information. By concatenating the extracted representations and then feeding it into two fully connected layers, IASD will provide

22 an accurate prediction of the stance.

The contributions of our paper are as follows:

1. We fill the research gap by identifying the crowd stance detection problem where a

complicated target is involved.

2. Motivate by the psychological procedure of human decision-making, we propose a deep

learning approach based on interactive attention.

3. We empirically demonstrate the effectiveness of our approach in two real-world

datasets. Results show our novel and psychologically motivated approach outperforms

the baseline models.

The rest of this article is organized as follows. Section 2.2 reviews related literature about crowd opinions in social media and stance detection techniques. Section 2.3 formulates the scientific problem and details the proposed method. Section 2.4 comprehensively evaluates our method in comparison with several benchmark methods in two real-world datasets. Section 2.5 concludes the article by presenting implications and discussing the future work.

2.2 Literature Review

Given the proliferation of social media and the broad impact of UGC on decision making, we first review studies on how crowd opinions facilitate decision making. As crowd stance detection is related to several important research problems in natural language processing (NLP) such as sentiment analysis and perspective identification, we introduce these relevant problems and elaborate on their differences. Finally, we review major techniques of stance detection, discuss the advantages and disadvantages of each technique, and identify the research gaps.

2.2.1 Crowd Opinions in Social Media

The literature pertaining to crowd opinions in social media (e.g., product reviews, tweets,

23 wikis) focuses on three major themes: identifying motivating factors of voluntary contribution from the crowd, designing incentive mechanisms to foster sustained participation, and investigating the impact of crowd opinions on stakeholders.

Prior studies reach an agreement that social interactions among crowds and benefits obtained from these interactions are the most influential motivating factors of contributing

(Zhang et al. 2013). For example, a natural experiment at Chinese Wikipedia shows that a user’s utility of contributing public content depends on the number of receipts, due to potential social benefits (Zhang and Zhu 2011). Similarly, the position of a Wikipedia editor in the global collaboration network influences that editor’s future contribution behavior (Zhang and Wang

2012). Further, some studies also explore the differences in the participation behaviors of users with different characteristics, such as classifying them as contributors and lurkers (Phang et al.

2015). Other work, building on top of these studies, has explored the design of external interventions and incentive mechanisms that can be used to stimulate UGC production, such as financial incentives (Cabral and Li 2015), reputation-based mechanism (Wei, M. Zhang, et al.

2019; M. Zhang et al. 2020), and feedback message framings (Huang et al. 2019).

The reason that both practitioners and scholars have devoted themselves to incentivize users to contribute in social media is that crowd opinions have a great impact on various applications, such as consumer decision making (Chen et al. 2011), product (Zhu and Zhang 2010), and firm competition (Kwark et al. 2014). Take online product review as an example, the opinions shared by previous consumers (e.g., like or dislike the product) provide additional information for potential consumers to mitigate the uncertainty about the quality of a product and its fit to consumers’ needs (Kwark et al. 2014). Similarly, the information generated by and accumulated on social media can also be used by businesses to identify the target audience, co-create

24 products, and improve the overall customer experience (Rathore et al. 2016).

2.2.2 Stance Detection and Related Problems

Although there is a fair amount of agreement regarding the importance of crowd opinions, the literature is diversified in terms of how to mine these opinions to better support decision making. Some of the main research problems regarding the automatic extraction of crowd opinions include sentiment analysis (Liu 2010), emotion recognition (Sailunaz et al. 2018), perspective identification (Wong et al. 2016), controversy detection (Al-Ayyoub et al. 2018), argument mining (Lippi and Torroni 2016), and stance detection (Küçük and Can 2020;

Mohammad et al. 2016). Next, we will review relevant studies to differentiate stance detection from other related problems.

Stance detection is also known as stance classification (Misra et al. 2016), stance identification (Zhang et al. 2017), and stance prediction (Qiu et al. 2015). It aims to identify the stance (or position) of the text author towards a target (e.g., an entity, concept, event, idea, opinion, claim, topic, etc.), which either explicitly mentioned or implied within the text. This research scenario is a considerably recent member of the aforementioned family of research problems (Mohammad et al. 2016; Sobhani et al. 2017). The set of stance classes is usually

{Favor, Against, Neither}. In the following, we introduce how it is different from some similar concepts such as sentiment analysis, emotion recognition, perspective identification, controversy detection, and argument mining.

First, sentiment analysis and emotion recognition are related to the more general topic of affective computing which deals with automatic analysis of all human affects including sentiments and emotions, regardless of any target (Küçük and Can 2020). Specifically, sentiment analysis aims to detect the sentiment polarity of a text author and the output classes are usually

25 from the set {Positive, Negative, Neutral} (Ravi and Ravi 2015). Emotion recognition (a.k.a. emotion detection or emotion extraction) is quite similar to sentiment analysis expect that it usually uses more diverse emotion classes such as joy, sadness, anger, disgust, anxiety, surprise, fear, and love (Sailunaz et al. 2018). The major difference between the affective computing technique (i.e., sentiment analysis and emotion recognition) and stance detection is twofold: 1) the former problem is concerned with the sentiment or emotion without a particular target which is expected by the latter problem; and 2) the sentiment/emotion and stance within the same text may not be aligned. Using previous example, the polarity of the post “I feel bad today, so I watch this movie to make me happy” is negative in sentiment while the stance towards the specific movie is positive. Second, perspective identification is related to the subjective evaluation of the text author and is commonly defined as the extraction of the point of view of the text author from the content (Wong et al. 2016). One significant difference between perspective identification and stance detection is that there is no stance target investigated in the former problem. Instead, it aims to search for the perspective of the text author from several different alternatives. For example, in the context of US elections, the alternatives may be the perspectives of Democrats or

Republicans. Third, a controversy is a discussion regarding a specific target which provokes opposing opinions among people (Al-Ayyoub et al. 2018). Thus, controversy detection is applied to calculate a controversy score for each unit of content. Since stance detection is sometimes performed on controversial topics such as debates, controversy detection can be used as a prospective preprocessing procedure for a stance detection problem. Last, argument mining is applied to extract the possible argument structure in a given textual content (Lippi and Torroni

2016), which includes three stages—detecting the argumentative sentences in the text; extracting the argument components; and formating the final argument graph by connecting the extracted

26 components. It is closely related to stance detection in that the output of argument mining can be used as an additional input to stance detection so as to improve the stance detection accuracy

(Sobhani et al. 2015).

2.2.3 Techniques of Stance Detection

Most recent studies of stance detection in social media started from two competitions:

SemEval-2016 shared task on stance detection in English tweets (Mohammad et al. 2016) and

NLPCC-ICCPOL-2016 shared task on stance detection in Chinese microblogs (Xu et al. 2016).

In these two competitions, there are five labeled targets in both training and testing datasets. Ever since the competitions, many researchers start developing new techniques for stance detection in social media. The literature can be categorized into feature-based machine learning approaches and deep learning-based approaches.

This stream of research usually exploits off-the-shelf machine learning algorithms and tries to improve performance by introducing novel features. Support Vector Machine (SVM) is the most commonly employed feature-based machine learning approach for stance detection. For instance, Wojatzki and Zesch (2016) proposed to use the n-gram features, syntactic features, stance-lexicon features, concept features, and target transfer features; these features are fed into a two-level stacked SVM classifier to predict stance. Addawood et al. (2017) developed a unique feature set including lexical, syntactic, Twitter-specific, and argumentation features; again, the features are fed into SVM to automatically classify stance for controversial topics. Logistic regression is another frequently used classifier for stance detection (Ferreira and Vlachos 2016;

Kucher et al. 2018). It is used as either the sole classifier or part of an ensemble classifier. Other machine learning approaches such as Naïve Bayes (Misra et al. 2016) and decision tree (Simaki et al. 2017) are also applied to the stance detection problem, either as the main best-scoring

27 approach or as the baseline approach. Some researchers also attempted to develop ensemble learners which combine SVM, logistic regression, decision tree, naïve Bayes, and many others

(Lai et al. 2017). The major limitation of feature-based machine learning methods includes 1) the structure of the model is often too shallow to capture the underlying semantic of sentences; 2) these methods often rely on laborious feature engineering to create hand-crafted and task-specific features; and 3) the algorithm is often not delicately designed to utilize the target information due to the inflexibility of the off-the-shelf machine learning algorithms.

Compared with feature-based methods, many recent studies start applying deep learning techniques, primarily Convolutional Neural Networks (CNNs) (Hercig et al. 2017; Zhang et al.

2017), Recurrent Neural Networks (RNNs) (Xu et al. 2018), and their variants (Rajendran et al.

2018), to detect stance. Deep learning models are able to avoid the laborious feature engineering by automatically learning text representations. Among this research strand, LSTM is the most frequently used RNN structure and performs favorably in many stance detection tasks. For example, Rajendran et al. (2018) developed several RNN models to classify the stance of a news article towards its headline. They found that RNN models and the extensions showed significant variations in the stance detection task and the bidirectional LSTM model achieved the best accuracy. Other studies also tried to combine CNN and RNN to further improve the performance. For instance, Zhang et al. (2017) constructed a supervised classifier by stacking an

LSTM layer over a CNN layer to detect patients’ stance towards complementary medicine options in an online health community.

The major advantage of RNN is the ability to encode contextual information and capture long-term dependencies. However, RNNs assume that all words make equal contributions to the text semantics and they are unable to pay attention to the most salient parts of a text. To address

28 this issue, researchers start incorporating the attention mechanism into neural networks to enable their models to pay attention to the most relevant words or phrases. A target-specific attention model is proposed and shown to perform better than neural models without attention mechanism

(Du et al. 2017). It is later extended by using token-level attention with a gated structure (Zhou et al. 2017). To detect topical stance, Dey et al. (2018) proposed a two-phase solution (i.e., subjectivity detection and stance detection) where LSTM is used to encode sentences and attention is adopted in both phrases. Although the attention mechanism is generally considered useful to improve the performance of stance detection, the existing work mainly focuses on the scenario where the target is predetermined as a short phrase. For example, the majority of the literature utilized public datasets released in competitions (e.g., SemEval-2016 and NLPCC-

ICCPOL-2016) to validate the effectiveness of proposed approaches. In these datasets, the target information is quite limited and often provided in advance. Some examples include atheism, climate change, feminist movement, and Hillary Clinton. There is still a significant research gap for an effective approach to detect crowd stance when the target information is much more complicated than a short phrase. For example, to detect false news using crowd users’ stance, the target is usually an event or a post (Wei, Z. Zhang, et al. 2019). To fill the gap, we propose a psychologically motivated deep learning approach based on interactive attention. This study provides a unique perspective on how to handle complicated targets in the task of crowd stance detection.

2.3 Method

2.3.1 Problem Definition

Before introducing the technical details, we first formulate the problem of mining crowd stance from social media. The goal is to classify the stance of a social media post or comment

29 towards another post. The definition of this problem is given below.

Definition 1. 푃 is a social media post consisting of 푁 words: 푃 = {푊1, 푊2, ⋯ , 푊푁}. 푪 is a comment of 푃 and consists of 푀 words: 퐶 = {푉1, 푉2, ⋯ , 푉푀}. 퐿 is a set of labels describing the potential stance of 퐶 towards 푃. Depending on the context, 퐿 could be {Favor, Against, Neutral,

None}, {Supporting, Denying, Querying, Commenting}, or some other sets. The proposed problem is to learn a mapping 푓 from pair (푃, 퐶) to 퐿:

푓: (푃, 퐶) ↦ 퐿.

2.3.2 Psychological Motivation

We motivate our technical approach with the psychological procedure of human decision- making in the stance detection scenario. Generally, when we face a stance detection task, we first read through these two sentences (i.e., the target sentence and stance sentence) and then iteratively find the most important and relevant part of each sentence—the part to which we should pay more attention. This procedure is interactive because when extracting such information from one sentence, we need to keep in mind the other one. Referring back to the example in Figure 2, after we read each pair of posts and comments, we find out the part to which we should pay more attention by interactively comparing them. The results are shown in

Table 2. In the first case, the major overlap between the post and the comment is the attitude towards whether the original examination paper is used. Hence, when deciding the stance, it would be beneficial to focus on this relevant part and ignore other unrelated information such as where and who. The second case provides multiple ways for us to identify the stance. The comment doubts not only some specific aspects of the post (i.e., officials and the use of original examination paper) but also the source of news. It even directly points out that this is a rumor.

Nevertheless, no matter how we identify the stance, it would be helpful to focus more on the

30 most relevant phrases. In the last case, the comment questions the location of this event. Hence, ignoring any information except the location will help decision making. Our analysis suggests that it is easier to identify the stance of the second sentence towards the first sentence if we pay more attention to certain parts of the sentences.

Table 2. An Illustration of Attention

Case 1 Post #She county of Anhui Province postponed the college entrance examination due to rainstorm# The water level is very high and it is officially announced that “The original examination paper would still be used.” Is it reasonable to use the original examination paper? Now the information is easily available unless it is not a nationwide unified exam. Comment You can tell with your toes that they will not use the original test; otherwise, they will be criticized by people all over the nation. Case 2 Post #She county of Anhui Province postponed the college entrance examination due to rainstorm# The water level is very high and it is officially announced that “The original examination paper would still be used.” Is it reasonable to use the original examination paper? Now the information is easily available unless it is not a nationwide unified exam. Comment Announced from the officials “the original examination paper would still be used”? Where did you find the announcement? Don’t spread rumors here. Case 3 Post #She county of Anhui Province postponed the college entrance examination due to rainstorm# The water level is very high and it is officially announced that “The original examination paper would still be used.” Is it reasonable to use the original examination paper? Now the information is easily available unless it is not a nationwide unified exam. Comment This news was previously about Jiangsu. How can it become Anhui now?

2.3.3 Interactive Attention-Based Stance Detection (IASD)

Motivated by the psychological procedure of human decision making in the stance detection scenario, we design a deep learning approach called Interactive Attention-based Stance Detection

31 (IASD). It takes as input the post 푃 = {푊1, 푊2, ⋯ , 푊푁} and one of its comment 퐶 =

{푉1, 푉2, ⋯ , 푉푀}, and yields a predicted stance label as output. The overall architecture of IASD is presented in Figure 3. In the following, we introduce the details of layers in IASD.

Softmax Layer Dense Layer

Concatenation Layer

ttention Layer

Bi LSTM

Layer Context ncoding Layer

mbedding Layer

Input

Figure 3. Architecture of IASD

2.3.3.1 Embedding Layer

We use word embedding to represent each word in the input sentences as a dense vector of size 퐷. By looking up the embedding matrix 퐸, the embedding layer transforms post 푃 and

퐷 퐷 comment 퐶 into {풆1, 풆2, ⋯ , 풆푁} and {풇1, 풇2, ⋯ , 풇푀} respectively, where 풆푛 ∈ ℝ and 풇푚 ∈ ℝ .

The values of the embedding matrix can be initialized randomly and then learned from the model

32 training. In practice, we initialize 퐸 with pre-trained word vectors obtained from popular word embedding algorithms such as skip-gram (Mikolov et al. 2013) and GloVe (Pennington et al.

2014).

2.3.3.2 Context Encoding Layer

For the post, to encode the contextual information, we feed the word vectors {풆1, 풆2, ⋯ , 풆푁} into a bi-directional LSTM layer (Xu et al. 2018). The LSTM layer captures the long-term dependencies among words and the bi-directional structure encodes both the left and right contexts of each word. Formally,

⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ [풉1, 풉2, 풉3, … , 풉푁] = 퐿푆푇푀⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ({풆1, 풆2, ⋯ , 풆푁}),

⃖⃗⃗ ⃖⃗⃗ ⃖⃗⃗ ⃖⃗⃗ [풉1, 풉2, 풉3, … , 풉푁] = 퐿푆푇푀⃖⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗({풆1, 풆2, ⋯ , 풆푁}),

⃗⃗ 푑 ⃖⃗⃗ 푑 where 풉푛 ∈ ℝ and 풉푛 ∈ ℝ are the outputs of each word in the forward and backward LSTM

⃗⃗ ⃖⃗⃗ layers respectively. Note that, the last output of the LSTM layer (i.e., 풉푁 and 풉1) is a latent representation that encodes all information of the post. By concatenating the results of the forward and backward LSTM layers, we obtain a word-level representation [풉1, 풉2, 풉3, … , 풉푁]

⃗⃗ ⃖⃗⃗ 2푑 (풉푛 = [풉푛; 풉푛] ∈ ℝ ) which captures the context and a sentence-level representation 풉 =

⃗⃗ ⃖⃗⃗ 2푑 [풉푁; 풉1] ∈ ℝ which captures all information on the post. We will use both representations in the following attention layer. We apply a similar procedure to the comment and denote the obtained word-level representation as [품1, 품2, 품3, … , 품푀] and the sentence-level representation as 품.

2.3.3.3 Attention Layer

As previously mentioned, the attention layer attempts to locate the words to which the algorithm should pay more attention. We use the comment as an example. To calculate how much weight to put on each word, we compare each word of the comment with the whole post.

33 As both the word (i.e., 품푚) and the post (i.e., 풉) are vectors of size 푑, we use dot operation as score function to evaluate their relevance. Formally, the score 푠푚 between 품푚 and 풉 is defined as:

푠푚 = 품푚 ∙ 풉.

Next, we compute the attention weight 푎푚 for the 푚-th word based on the score via a softmax operation:

exp⁡(푠푚) 푎푚 = 푀 . ∑푚=1 exp⁡(푠푚)

Finally, by weighting each word of the comment with 푎푚, we can condense the representation of the comment as 푔̃ ∈ ℝ2푑:

푀 푔̃ = ∑푚=1 푎푚품푚.

This vector is supposed to capture only the information that helps identify the stance. By following a similar procedure, we can obtain a condensed representation of the post, which is denoted as 푔̃ ∈ ℝ2푑.

2.3.3.4 Prediction Layer

The outputs of the attention layer are two vectors (i.e., ℎ̃ and 푔̃) that capture all relevant information. To combine the information comes from both sentences, we concatenate ℎ̃ and 푔̃ and feed then into a fully connected dense layer to encode their interactions into a latent vector.

The learned vector is finally fed into a softmax layer to generate the predicted probability of each label.

2.3.3.5 Model Training

As the number of potential stance labels can be larger than two, we use cross-entropy as the loss function to train our model in an end-to-end manner. Adam optimizer is used to learn the model parameters due to its computational efficiency and low memory requirements (Kingma

34 and Ba 2015). We also adopt the dropout strategy in the embedding and LSTM layers to reduce model over-fitting (Srivastava et al. 2014).

2.4 Empirical Evaluation

To validate the effectiveness of our proposed approach, we comprehensively evaluate IASD and compare it with benchmark methods using two real-world datasets from the social media platforms Twitter and Weibo. In this section, we first describe the experimental design including datasets, benchmark methods, evaluation metrics, and experimental procedures. Then, we report and discuss the experimental results.

2.4.1 Experimental Design

We use datasets collected from Twitter and Sina Weibo to evaluate IASD and the benchmark methods. These two datasets were released in prior relevant studies on fake news detection where the authors exploit the expressed stance on comments to help detect false news

(Wei, Z. Zhang, et al. 2019). The first dataset Weibo consists of 2,186 pieces of false news and

9,455 pieces of true news. To construct a set of post-comment pair, 12,000 comments were randomly sampled from all responses to the posts. Each post-comment pair was labeled by experts into four categories or two categories. In the first fourfold scheme, each pair was labeled based on whether the comment is supporting, denying, querying, or commenting on the content of a given post. The number of each type of label is 119, 1,606, 469, and 9,806 respectively. In the false news detection scenario, this fourfold scheme may be redundant as the most essential and relevant labeling to recognize the veracity of a piece of news is whether each comment is debunking or not (Wei, Z. Zhang, et al. 2019). Hence, each post-comment pair is also annotated based on whether the comment is debunking the post or not. As a result, the number of debunking and non-debunking pairs is 2,075 and 9,925 respectively. To train the model, we

35 randomly sample 70% of the pairs as the training set, 15% as the validation set, and the remaining 15% as the testing set.

The second dataset was originally collected from Twitter (Ma et al., 2016; Ma et al., 2017), where the ground truth of each news post was confirmed through an expert-oriented fact- checking website, i.e., . Similar to the Weibo dataset, all responses to each post were collected. To construct a testbed, 8,103 post-comment pairs were randomly sampled and then labeled by experts into {Supporting, Denying, Querying, Commenting} and {Debunking, Not

Debunking}. In total, this dataset contains 943 pieces of false news and 1,007 pieces of true news. Among the sampled 8,103 post-comment pairs, the number of supporting, denying, querying, and commenting comments is 871, 1,007, 251, and 5,974 respectively. When a binary labeling scheme is used, 6,845 out of 8,103 comments debunk the post. Last, we adopt the same training/validation/testing split ratio as the one of Weibo.

To evaluate the effectiveness and superiority of our IASD approach, we compare it with two widely used benchmark methods from prior studies on stance detection, i.e., Convolutional

Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (Bi-LSTM). Given a post and one of its comments, we use two CNN (or Bi-LSTM) modules to learn the hidden representations respectively (Kim 2014; Xu et al. 2018). The learned representations are then concatenated and fed into two fully connected layers. The model is finally followed by a softmax layer to generate a predicted probability for each potential label.

Mining crowd stance towards the fake news and true news in social media is a classification problem in essence. Thus, to measure the performance of these methods, we adopt several widely accepted classification metrics. When the results are binary, we use PR AUC and F1 score as major metrics; we also report precision and recall for reference. We use PR AUC rather

36 than ROC AUC because our dataset is unbalanced and the PR curve gives a more accurate picture of an algorithm’s performance (Davis and Goadrich 2006). Due to the same reason, accuracy is not used as it does not provide adequate information on a classifier’s functionality

(He and Garcia 2008). F1 score is preferred over precision and recall because, as the harmonic mean of precision and recall, it requires the algorithm to have decent precision and recall scores.

When the problem is a fourfold prediction problem, PR AUC is no longer reported because it only works in binary setting; instead, we use accuracy as another major metric besides F1 score.

All methods including our proposed IASD method and the benchmark methods are implemented using the TensorFlow library of Python 3. The word vectors in datasets Weibo and

Twitter were initialized with open-source embeddings trained on Weibo data (Li et al. 2018) and

Twitter data (Pennington et al. 2014) respectively. For a fair comparison, we conducted random grid searches for 40 times to tune the hyperparameters including dropout rate, number of the

LSTM units, number of units in the latent layer, etc. All methods were evaluated for 10 runs with a different randomization seed in each run.

2.4.2 Experimental Results

In this section, we compare the performance of our proposed IASD method with the benchmark methods in two tasks. The first task is based on stance label set {Supporting,

Denying, Querying, Commenting} and aims at mining generic crowd stance in comments towards the target post. We name this task generic stance detection. The second task is based on stance label set {Debunking, Not Debunking} and aims at extracting whether the author of a comment is debunking the target post. As this task is dedicatedly designed for the application of false news detection, we call this task false news stance detection (Küçük and Can 2020). By evaluating our method in these two tasks, we can examine whether our method can be applied in

37 different scenarios with varied stance labels. This is important because in real-world scenarios, depending on the context, the number of potential stance labels may vary.

(a) Weibo Dataset (b) Twitter Dataset

Figure 4. Performance of IASD and Benchmark methods for Generic Stance Detection

We first report the results of the generic stance detection problem in the form of bar charts.

For better visualization, the 푦 axis of the following graphs might not start at 0 or end at 1, although all metrics range from 0 to 1. Whenever an error bar is reported, the result is averaged over 10 runs and the standard error of the mean (SEM) is used. Figure 4 demonstrates the performance of IASD in comparison with the benchmark methods under various metrics in both datasets. In both subfigures, the F1 score of IASD is used to plot the reference lines. Using F1 score as the major metric, we can see that our IASD approach significantly outperforms all benchmark methods, especially in the Weibo dataset (see Figure 4a). Particularly, compared with

Bi-LSTM, IASD increases the F1 score from 55.19% to 57.20%. Similar results are observed in the Twitter dataset, although the performance gap is less significant (see Figure 4b). Overall, the results suggest the superior performance of our proposed method over other deep learning-based stance detection approaches. In addition, the major difference between our IASD method and Bi-

LSTM is that we introduce two attention layers to mimic the human decision-making procedure.

38 The performance gap between them shows the effectiveness of our design.

In false news detection problem, the crowd stance towards a piece of news generally falls into two categories, i.e., debunking or not debunking. For this task, we similarly report the results in the form of bar charts. Figure 5 compares the performance of IASD with the benchmark methods under datasets Weibo and Twitter respectively. Note that, in this binary task, we use PR AUC as one of the major metrics in place of accuracy. In both subfigures, the PR

AUC score of IASD is used to plot the reference lines. Figure 5 clearly shows that our proposed

IASD approach performs significantly better than CNN and Bi-LSTM in metrics F1 score and

PR AUC. For example, compared with the best benchmark (i.e., CNN), IASD increases the PR

AUC score from 78.93% to 80.64% in Weibo dataset, and increases PR AUC score from 48.88% to 50.06% in the Twitter dataset. Again, the results suggest the superiority of our proposed approach for mining crowd stance towards fake news in social media.

(a) Weibo Dataset (b) Twitter Dataset

Figure 5. Performance of IASD and Benchmark methods for Fake News Stance Detection

2.5 Discussion and Conclusions

With the rapid growth of UGC in social media, mining crowd stance towards certain target/event/idea from this unstructured data becomes increasingly valuable to support decision

39 making across a variety of domains. Existing literature is limited to detecting crowd stance towards a simple target which can be described by a short phrase. In this work, we consider a prevalent and important but less researched scenario where we want to detect the stance towards a complicated target. To address this problem, we motivate our solution with the psychological procedure of human decision making. Relying on the attention mechanism, we propose a deep learning approach called IASD which mimics the decision-making procedure. We applied our approach to the false news stance detection scenario and comprehensively evaluated the effectiveness of our model using datasets collected from two widely used social media platforms, i.e., Sina Weibo and Twitter. Compared with benchmark methods CNN and Bi-LSTM, our approach achieves significantly higher F1 scores and/or PR AUC scores in both datasets. The experiments also demonstrate that our approach is robust and flexible enough to detect stance across various scenarios (i.e., binary or multi-class).

Our study not only contributes to the growing body of literature on stance detection but also sheds light on how to design a deep learning approach. First, unlike existing research that primarily focuses on generic opinion mining considering no or simple target, our study provides a pioneering effort in detecting crowd stance towards a complicated target. The target could be a product, concept, idea, event, news, or a post in social media. Second, the designing procedure of our approach provides unique insights for future research in terms of how to design a deep learning architecture. We motivate our approach with the psychological procedure of human decision making and design a deep learning architecture that mimics the decision-making procedure. The superior performance of our method over the benchmarks demonstrates the effectiveness of such a design.

Our research has many practical implications and actionable insights for many relevant

40 stakeholders. For social media platforms that host or heavily rely upon user-generated content, understanding “what other people think about themselves” is crucial to improve the platform design. Knowing users’ stance is also helpful in motivating users to generate more and higher- quality content, which benefits the sustainable development of platforms. Second, our approach provides significant value to the stakeholders in social media . Social media platforms, such as Weibo and Twitter, serve as a critical channel for product advertising. Given limited time and budget, the most important question for product providers is to understand the effectiveness of their advertising. Our proposed approach could help detect crowd stance towards the advertising and then make timely adjustments. Third, our research provides actionable insights for public opinion surveillance from the perspective of public administration by detecting crowd stances towards newly released policy or important public outbursts. Last but not least, our approach also helps in specific fields such as false news detection. Our approach can accurately identify crowd users’ stance towards certain news. By aggregating these crowd opinions, we can efficiently detect false news and hence curb the false news epidemic (Wei, Z. Zhang, et al.

2019).

This study is not without limitations, some of which may pave the direction for future work on mining crowd stance. First, we only use false news stance detection as a case study, although stance detection can be applied in various applications. In the future, we are interested in adapting our approach to other business intelligence problems, such as public opinion surveillance and online review analysis. Second, our approach is motivated by the psychological procedure of human decision making and we mimic this procedure using the attention mechanism. To further verify the effectiveness of this mechanism, we plan to visualize the attention weights and check whether such attention will help humans make decisions by

41 conducting behavioral experiments. Last, we choose two popular social media platforms (i.e.,

Sina Weibo and Twitter) as testbeds in our study. Future research can be conducted by incorporating more social media platforms with varied characteristics.

42 3. ESSAY II: HOW TO DESIGN NEXT-GENERATION LEARNING FROM CROWDS

IN A PRINCIPLED WAY? AN INTERPRETABLE FRAMEWORK COMBINING DEEP

LEARNING AND GRAPHICAL MODELS

3.1 Introduction

Microtask crowdsourcing has emerged as a cost-effective approach to collecting large-scale high-quality labeled data across a wide range of business applications. It refers to a distributed problem-solving model that harnesses the cognitive and computational power of large crowds to complete a huge amount of dividable tasks efficiently (J. Luo et al. 2018). These tasks, such as image labeling and sentence translation, are often easy for humans to complete but remain challenging for computers because they rely on humans’ cognitive ability (Wang et al. 2017).

Figure 6 shows the general pipeline of microtask crowdsourcing (J. Luo et al. 2018). To complete a set of microtasks, task requesters (e.g., firms or individuals) first post them on some web-based crowdsourcing platforms2, such as Amazon Mechanical Turk (AMT) and Figure

Eight. Crowd workers on the platforms can request and finish some microtasks voluntarily or for a small amount of payment. After collecting the completed microtasks, the platforms will aggregate the results and deliver them to the task requesters. The immediate, cheap, and elastic workforce supply in such microtask crowdsourcing systems makes it possible to complete a huge amount of dividable microtasks at low cost and with high throughput (Wang et al. 2017).

2 In many cases, the task requesters may hold the crowdsourcing platforms themselves.

43

Figure 6. General Pipeline of Microtask Crowdsourcing

Firms, ranging from behemoths like and Google to small start-ups, have started using microtask crowdsourcing to meet their daily business needs. For example, to curb the spread of fake news, Facebook relies on crowd users to prescreen fake news by encouraging them to flag fake news when reading Facebook posts (Kim et al. 2018). Google launched

Crowdsource by Google which allows crowd users to contribute their intelligence by completing tasks such as image label verification and hand recognition; the results are used to train the

Artificial Intelligence (AI) system and finally improve Google’s services such as Google Photos,

Maps, Translate, etc3. The emerging self-driving car industry also heavily relies on the microtask crowdsourcing, ranging from the training data collection to real-time map update4. Many other companies have, more or less, integrated crowdsourcing in their business pipelines either directly or through some online crowdsourcing platforms (Wang et al. 2017).

Despite the promise, significant challenges remain in terms of how to combine crowd efforts. This is difficult because crowd workers may be unreliable in many ways. They usually have different levels of competence and dedication (Kazai et al. 2011). Because the microtasks

3 https://crowdsource.google.com/

4 https://www.wired.com/story/your-next-gig-map-the-streets-for-self-driving-cars/

44 are often tedious and repetitive, unintentional mistakes are also inevitable (Karger et al. 2011).

Some malicious workers even disobey the task rules provided by the requesters (Gadiraju et al.

2015). Hence, a fundamental question is how to aggregate the annotations provided by a set of potentially unreliable crowd workers (i.e., learning from crowds). To this end, various assumptions (e.g., worker heterogeneity in quality) are considered, and models are developed based on these assumptions. However, most of the current design of learning from crowds makes simple or constrained assumptions. One example is the relationship between the ground truth labels and the tasks. Intuitively, the ground truth labels should be related to the tasks themselves.

A sentence with obvious positive words and not preceded by a negation tends to be positive in sentiment although the received annotations might be mostly negative if the crowd workers happen to be sloppy (Kazai et al. 2011); an image with clear dog patterns tends to be dog even if some malicious workers annotate it as a cat. Yet, most of the current research on annotation aggregation completely ignores such relationships or, to accommodate the probabilistic framework, pose rigid assumptions on the mapping from task features to ground truth labels.

Two representative examples are logistic regression (Bi et al. 2014; Raykar et al. 2010; Yan et al.

2010) and topic models (Rodrigues et al. 2017; Simpson et al. 2015). Although posing rigid assumptions retains the model tractability, it sacrifices the modeling capability. In addition, conclusions drawn from existing literature often suffer from low interpretability and generalizability because they are not evaluated in a principled way. Specifically, although their final performance increases over the benchmarks, it is unclear whether the improvement is attributed to the newly-added assumptions in the design. This will lower the trustability of such conclusions and prevent them from being generalized to new settings. One salient example is considering the task heterogeneity in difficulty. As far as we know, all relevant literature did not

45 compare their models with the ones that turn off the task difficulty component (Kim and

Ghahramani 2012; Wang et al. 2017; Whitehill et al. 2009). In this regard, it is still unclear whether and how much considering task heterogeneity in difficulty can contribute to the problem of learning from crowds.

To provide a set of generalizable practices for the future design of learning from crowds, we first formulate several general hypotheses, including worker heterogeneity in reliability, task heterogeneity in difficulty, usefulness of task feature, task clustering structure, and semi- crowdsourced design. To test these hypotheses, we propose a deep generative framework with structured latent variables that combines deep learning and graphical models (Johnson et al.

2016). The flexibility of the graphical model enables incremental design and hence allows us to conduct before-and-after evaluation towards the underlying assumptions. The deep framework allows us to make less constrained and hence more useful assumptions by modeling complex non-linear relationships with deep learning. An inference algorithm combining variational message passing (Bishop 2006) and amortized learning (Kingma and Welling 2014) is then developed to estimate the parameters. Last, we empirically test these hypotheses using eight real- world tasks including text and image classifications. The results also demonstrate the effectiveness of our framework over state-of-the-art benchmark models.

The rest of this paper is organized as follows. Section 3.2 reviews relevant literature.

Section 3.3 develops five general hypotheses for the design of learning from crowds. Section 3.4 details the proposed deep generative modeling framework and Section 3.5 presents the details of model inference. Section 3.6 empirically tests the proposed hypotheses and evaluates the proposed framework by describing the experimental design and showing the results. We conclude the paper by presenting contributions, implications, limitations, and future research

46 directions in Section 3.7.

3.2 Related Work

In this section, we survey several streams of relevant literature. We first survey the literature regarding answer aggregation in crowdsourcing. Since the proposed framework is a deep generative model that combines deep learning and Bayesian graphical model, we then review the literature on deep generative model and its model inference.

3.2.1 Truth Inference in Crowdsourcing

Truth inference, also named annotation aggregation or answer aggregation, is a process of inferring the ground true labels of instances where each instance receives multiple potentially noisy labels (Sheng and Zhang 2019). Its original application is aggregating diagnosis data from multiple doctors by evaluating the credibility of each doctor (Dawid and Skene 1979). The recent proliferation of AI technologies requires large-scale high-quality data with ground truth labels more than ever, and hence, to some degree, triggers the development of online crowdsourcing markets such as AMT. The research regarding truth inference in crowdsourcing is becoming popular in recent years.

Based on the type of the crowdsourcing task, the literature can be divided into single-label

(i.e., binary or multi-class) truth inference (Wang et al. 2017), multi-label truth inference (Li et al. 2019), real-valued truth inference (Raykar et al. 2010), and many others. For example, in the task of multi-label truth inference, each instance is associated with a subset of candidate labels simultaneously (J. Luo et al. 2018); in real-valued crowdsourcing scenarios, workers provide a real-valued label for each task rather than an ordinary label. In the following, we focus on single- label crowdsourcing because single-label classification is among the most frequently used tasks in real-world applications. However, the conclusions in this study can be generalized to other

47 tasks; see Section 3.7 for further discussion.

The literature can be further divided into two categories based on whether the task features are taken into account. Traditional single-label truth inference algorithms only exploit noisy labels provided by crowd workers (Yin et al. 2020). Most algorithms consider workers’ reliability and adopt a probabilistic framework because the relationships among the ground truth labels, noisy labels, and the worker reliability can be described by a probabilistic procedure.

Intuitively, the generation of the observed noisy labels depends on the interaction of the underlying true labels and worker reliability (Moreno et al. 2015). In the seminal work, the DS

(Dawid-Skene) model uses a confusion matrix to describe each label source’s reliability and estimates the parameters with an expectation-maximization algorithm (Dawid and Skene 1979).

Its Bayesian extension, called independent Bayesian Combination of Classifiers (iBCC), was later proposed (Kim and Ghahramani 2012). The DS model was also extended in multiple ways including designing better inference algorithms (Liu et al. 2012), relying on the spectral method to ensure theoretical guarantee (Zhang et al. 2016), etc. With the flexibility of probabilistic models, various other factors besides worker reliability are incorporated into the models to further improve the performance, including task difficulty (Kim and Ghahramani

2012; Wang et al. 2017; Whitehill et al. 2009) and grouping of workers (Kajino et al. 2013;

Moreno et al. 2015; Simpson et al. 2013).

When task features are taken into account, the truth inference algorithm also learns a predictive classifier that predicts the ground truth labels based on the task features and in turn helps estimate the ground truth labels. Intuitively, the ground truth labels should be related to the task features, and hence considering the task features will be helpful in the aggregation. In the early work along this research stream, to accommodate the probabilistic framework, researchers

48 pose rigid assumptions on the mapping from task features to ground truth labels. Two representative examples are logistic regression (Bi et al. 2014; Raykar et al. 2010; Yan et al.

2010) and topic models (Rodrigues et al. 2017; Simpson et al. 2015). For example, in the literature using logistic regression, researchers regress the probability for a positive label on the task feature (Raykar et al. 2010); in the literature using topic models, the generation of each instance (i.e., sentence) follows the structure of the topic model and hence fits the probabilistic framework (Simpson et al. 2015). Although posing rigid assumptions retains the model tractability, it sacrifices the modeling capability. In crowdsourcing scenarios, the task features are often high-dimensional data (e.g., images and text), and hence the mapping from features to ground truth labels tends to be complex. Inspired by the successful applications of deep learning in learning complicated non-linear transformation, recent research has started using deep neural networks to model the relationship between task features and ground truth labels (Dizaji and

Huang 2018; Rodrigues and Pereira 2018). The resulting model belongs to deep generative models and our proposed approach falls into this category. To further improve the performance, the clustering structure of tasks is also considered in some literature (Yin et al. 2020). As the training of deep learning models is usually data-hungry, some studies introduce an extra set of unlabeled data to facilitate the training (Atarashi et al. 2018).

Truth inference is also related to the literature of classifier combination, where each weak classifier outputs a result for the same machine learning task and then the results are aggregated

(Tulyakov et al. 2008). In this stream, many ensemble methods (e.g., bagging and boosting) and their variants consider both how to construct the classifiers and how to aggregate the results. We only review the literature on result aggregation where a fixed set of classifiers are given, since the goal of our setting is to combine human and machine judgments. To combine prediction

49 scores, simple rules (e.g., sum rule, product rule, max rule, and min rule) are frequently used and have shown good performance in many tasks (Mohandes et al. 2018). To aggregate prediction classes, voting-based strategy and its variants are widely used. Majority Voting (MV) selects the candidate class that has a majority (e.g., more than half of the votes in binary classification). It is extended by assigning different weight to each classifier (Mohandes et al. 2018). Some other studies take a probabilistic perspective to solve the problem by introducing parameters to govern the behaviors of classifiers and ground truth. For example, a Bayesian model called independent

Bayesian Classifier Combination (iBCC) is proposed to model the generative process of classifiers’ predictions by taking into account each classifier’s reliability (Kim and Ghahramani

2012).

3.2.2 Deep Generative Models and Inferences

As mentioned previously, our proposed approach is a deep generative model (DGM) with structured latent variables (Johnson et al. 2016), which combines probabilistic graphical models

(PGMs) and deep neural networks (DNNs). PGMs and DNNs are two popular choices to model real-world data and the relationships within data. PGMs can find interpretable structure in the data, but often poses rigid assumptions (Johnson et al. 2016). One of the most popular inference algorithms for PGMs is variational inference, which enables efficient inference, especially when the model has a conjugate structure (Bishop 2006; Lin et al. 2018). DNNs are capable of learning flexible representations from complex high-dimensional data, but may not allow interpretable or tractable probabilistic structure (Johnson et al. 2016). The most widely used learning algorithm is stochastic gradient descent. Although PGMs and DNNs employ two fundamentally different modeling and learning strategies, recent research on deep generative models (DGMs) tries to combine them and exploit their complementary strengths (Lin et al. 2018). DGM is a kind of

50 probabilistic graphical model (PGM) that parameterizes the conditional distribution between random variables with deep neural networks (DNNs) (Lin et al. 2018; Y. Luo et al. 2018). With the marriage between PGMs and DNNs, DGMs are able to learn flexible data representations while maintaining certain levels of model interpretability (Lin et al. 2018). They have been successfully applied in many applications such as image generation (Kingma and Welling 2014).

The major obstacle of applying DGMs widely is model inference. Several studies have proposed some general solutions rather than bespoke ones that developed for one model at a time

(Johnson et al. 2016; Lin et al. 2018). One salient example is structured variational autoencoder

(SVAE) which uses recognition networks to generate conjugate graphical model potentials so that message passing and natural gradient can be easily combined with amortized variational inference (Johnson et al. 2016). This inference method was later simplified and generalized (Lin et al. 2018).

3.3 Hypothesis Development

Before developing hypotheses for the design of crowdsourced answer aggregation algorithms, we first formulate our research scenario. We consider a typical scenario in crowd labeling where task requesters utilize the online crowdsourcing markets (e.g., AMT,

CrowdFlower) to collect the ground truth labels of a set of tasks. These tasks may come from various business scenarios. For instance, a product provider may want to label whether a tweet contains the intention to purchase a specific product to enable precise advertising targeting (Z.

Zhang et al. 2020); a restaurant manager may want to label the sentiment of the received reviews in order to improve the provided services. In these scenarios, a higher accuracy of the obtained labels is often associated with lower economic costs. Like previous literature, we focus on the most frequently used and representative binary labeling task because such tasks are often the

51 building blocks of many other complex tasks (Wang et al. 2017). In Section 3.4.4, we show that our approach can be easily extended to the multi-class setting.

Suppose the task requesters want to collect labels for 푁 instances in which the feature of the

푑 i-th instance is represented as a vector 풐푖 ∈ 푅 . Limited by the budget, each task is usually assigned to only several workers (typically three to five) on the crowdsourcing platform. Assume there are 퐽 workers in total. Let 푦푖,푗 ∈ {0, , 푛푢푙푙} represent the annotations of the i-th instance provided by the j-th worker. Here 푦푖,푗 = 푛푢푙푙 means the i-th instance is not annotated by the j-th worker. As shown in Figure 7, the relationship between the 푁 instances and the 퐽 workers can be represented as a bipartite graph. For example, the first instance is annotated by the first three workers with labels 1, 0, and 0, respectively. Given the above input, the goal of this annotation aggregation problem is to infer the ground truth of each instance, denoted by 풛 ∈ {0, }푁.

Figure 7. Problem Setup

Note: Red lines and labels represent label 1 while black ones indicate label 0. “xx” means a specific real value.

We use an illustrative example to motivate the technical insights behind our proposed

52 approach and develop hypotheses that can be used to guide future crowdsourcing practice.

Assume we want to label six tweets based on whether the authors have the intention to purchase an iPhone, as shown in Figure 8. The first three tweets are associated with positive labels and the remaining with negative labels. Seven workers with varying reliability annotate the tweets: two reliable workers who provide labels with high accuracy (100.0% and 83.3% respectively), three normal workers whose annotations are mostly correct (accuracy 66.7%), and two random workers with the accuracy being 50.0% and 33.3% respectively. If we use the simple heuristic method Majority Voting (MV), i.e., the label is positive only if at least half of the received annotations are positive, the accuracy of the aggregated results is 66.7%, which will be far from satisfactory in real-world business applications.

Figure 8. Illustrative Example

Note: Each shaded block indicates a positive label while empty block means a negative label.

Let’s consider a counterfactual scenario where we know the worker's reliability in advance.

In the first case, we can use only the annotations from the reliable workers or simply ignore the random workers; then, we apply the majority voting algorithm. Both settings will generate the

53 correct ground truth. In another case, weighting each worker’s annotations with her/his reliability

(i.e., 100.0% 83.3%, 66.7%, 66.7%, 66.7%, 50.0%, 33.3%) will also provide a correct estimation of the ground truth labels. These two counterfactual analyses indicate that knowing worker reliability is potentially helpful in ground truth estimation. In the literature, worker reliability has been intensively used to improve the performance of annotation aggregation. Intuitively, it makes sense to estimate each worker’s reliability and assign fewer weights to annotations provided by less reliable workers (Wang et al. 2017). Hence, we hypothesize the following:

Hypothesis 1 (H1). Considering worker heterogeneity in quality should contribute to the annotation aggregation problem.

Some recent studies on truth inference start considering incorporating the task features into the algorithm (Dizaji and Huang 2018; Simpson et al. 2015; Yin et al. 2020). The intuition is that the ground truth labels are associated with the task features and building a predictive classifier can offer valuable information. This idea can be demonstrated by the illustrative example in

Figure 8. Using MV, we are able to recover the ground truth of the first two sentences with high confidence as they both receive 6 positive labels out of 7. Although the estimated label for the third sentence is wrong, we are not confident with this estimation because the number of received negative labels (i.e., 4) is close to the decision boundary (i.e., 3.5). In this case, is it possible to further improve the results besides the above method (i.e., considering worker reliability)? Note that the first three sentences are all positive and share a similar sentence pattern, i.e., want or need an iPhone. This motivates us to train a predictive classifier which captures this pattern. Given that there are already 3 out of 7 workers providing positive labels, we are likely to correct the results of MV and regard the third sentence as positive. In this research, we propose the following hypothesis:

54 Hypothesis 2 (H2). Considering task features should contribute to the annotation aggregation problem.

Another way to utilize the task features is clustering instances into groups based on some intrinsic characteristics. Although each task is different, they might be similar to each other if represented in a latent space and form groups. The instances of the same group are supposed to share the same label (Yin et al. 2020). For example, in Figure 8, although theses sentences are syntactically different, they can be semantically clustered into several clusters. The first three sentences form a group as they all directly express the author’s intention to purchase a new iPhone. Assuming such clustering information is available, applying the simple MV algorithm to this cluster will yield a positive prediction. Similarly, the fourth and fifth sentences fall into the same group as they all indicate the authors currently own an iPhone which is doing well, and MV will predict this cluster as negative. As a result, we correctly recover the ground truth labels of all instances even if we apply the simple MV strategy. The analysis also suggests why considering clustering among instances will improve the aggregation accuracy—the label noise of a single instance is counteracted when we consider the clustering among instances. Hence, we formulate the following hypothesis regarding task clustering:

Hypothesis 3 (H3). Considering the clustering of tasks should contribute to the annotation aggregation problem.

In H2 and H3, both considering task features and considering task clustering heavily rely on learning a good task representation. Considering task features requires us to build a predictive classifier, in which case a good task representation will improve the predictive performance.

Task clustering assumes that tasks of the same group are similar in the latent representation space. Motivated by the success of semi-supervised learning (Van Engelen and Hoos 2020), we

55 propose to use an extra set of unannotated instances to facilitate representation learning. This is applicable because, in crowdsourcing scenarios, it is easy to obtain a large set of instances without annotation. Following the previous literature on semi-crowdsourced clustering (Y. Luo et al. 2018; Yi et al. 2012), we similarly name such a technique as semi-crowdsourced learning and formulate the following hypothesis.

Hypothesis 4 (H4). Semi-crowdsourced learning should contribute to the annotation aggregation problem.

Last, task difficulty varies in essence. Some previous studies consider the task heterogeneity in difficulty in their proposed models (Bi et al. 2014; Kajino et al. 2012; Wang et al. 2017;

Welinder, Branson, Perona, et al. 2010; Whitehill et al. 2009). In general, they empirically show that considering task difficulty will contribute to the annotation aggregation problem. The general idea is that our estimation of a worker’s reliability will not be impacted much if she/he makes mistakes in difficult tasks (Wang et al. 2017; Whitehill et al. 2009). We logically show that this is a definite result in a crowdsourcing system that involves worker reliability and task difficulty. Specifically, we consider the relationships among the following three variables: the probability of correct annotations, worker reliability, and task difficulty. The probability of correct annotations should be a function of worker reliability and task difficulty. Intuitively, the probability of correct annotations increases as a function of worker reliability and decreases as a function of task difficulty. When a wrong annotation is observed (independent event w.r.t worker reliability and task difficulty), the potential reasons are: the worker is unreliable or/and the task is difficult. If we know the task is easy, we are likely to think the worker is unreliable, whereas if we know the task is difficult, we have less evidence to believe the worker is unreliable. Hence, if a worker makes mistakes in a difficult task, our estimation of the worker’s reliability will not be

56 impacted much. From the algorithmic perspective, this means we assign fewer weights to difficult tasks when estimating worker reliability.

Intuitively, this idea is consistent with our common sense. However, it may not contribute to the performance of annotation aggregation. We use Figure 8 to illustrate this point. Among the six sentences, the third one is the most difficult such that even one reliable worker (i.e., the second one) wrongly annotates it. Based on the aforementioned idea, this sentence should be weighted less when estimating worker reliability. In the extreme case, we simply ignore the third sentence. Table 3 shows the estimated worker reliability considering all sentences or all sentences except the third one. We then weight each worker’s annotations with the estimated reliability and find that considering task difficulty decreases the accuracy from 100% to 83.3%.

This is because ignoring hard instances causes 1) workers who made correct annotations on hard instances are less reliable and 2) workers who made wrong annotations on hard instances are more reliable; see the last two rows of Table 3 for whether each worker made a correct annotation on the third sentence and how their estimated reliability changes after ignoring the third sentence. Such results are clearly not what we expected when considering task difficulty.

Our preliminary analysis seems to be different from previous literature. Hence, we formulate the following hypothesis and empirically test it in Section 3.6. Note that, we are currently unable to theoretically reject or prove our hypothesis; we leave this as one of the future works.

Hypothesis 5 (H5). Considering task heterogeneity in difficulty should contribute to the annotation aggregation problem.

Although the assumptions in these proposed hypotheses have more and less been studied in the literature, they are not comprehensively evaluated in terms of whether and how much they contribute to the annotation aggregation problem. For example, although task difficulty is

57 considered in some previous literature, all literature did not compare their models with the ones that turn off the task difficulty component. Hence, it is still unclear whether the final performance improvement is attributed to considering the task heterogeneity in difficulty. In addition, the abovementioned assumptions are not integrated within a unified framework. In this work, we propose a deep generative modeling framework in which we can flexibly incorporate each hypothesis. Using our proposed framework, we test the proposed hypotheses.

Table 3. Estimated Worker Reliability in the Illustrative Example

Workers Accuracy of Weighted Reliable Reliable Normal Normal Normal Random Random MV Reliability 100% 83.3% 66.7% 66.7% 66.7% 50% 33.3% 100% Based on all Reliability without 3rd 100% 100% 60% 60% 80% 60% 40% 83.3% Sentence Correct Annotation in Yes No Yes Yes No No No 3rd Sentence? Change in Estimated ↑ ↓ ↓ ↑ ↑ ↑ reliability

3.4 Deep Generative Modeling Framework

To verify the proposed assumptions, we propose a deep generative model and incrementally incorporate hypotheses one by one. Figure 9 shows the roadmap for our model development.

Majority Voting (MV) and independent Bayesian Classifier Combination (iBCC) are two classic benchmarks from the literature (Kim and Ghahramani 2012). MV treats each worker equally and predicts an instance to have the label with the most “votes” from workers. iBCC takes a

Bayesian approach and models the reliability of each worker (H1). Our proposed models are presented in the two dotted boxes and they are named based on the underlying assumptions. For

58 example, Deep Aggregation model with worker Reliability and task Features (DARF) considers

H1 and H2; DARFC extends DARF by taking task clustering structure into account; Semi- crowdsourced DARFC (S-DARFC) also takes into account semi-crowdsourcing learning to facilitate representation learning. To make the roadmap uncluttered, we group the models without and with semi-crowdsourcing in the top and bottom boxes respectively. Figure 9 also marks out the major differences between models in terms of the underlying hypotheses. By comparing two models connected by the arrow line, we can test each hypothesis. Note that, although H4 can be tested in multiple ways, we focus on the pair DARFC and S-DARFC. In the following, we introduce models DARF, DARFC, S-DARFC, and DARFCD sequentially.

+Worker +Task +Task +Task Reliability Feature Clustering Difficulty (H1) (H2) (H3) (H5) MV iBCC DARF DARFC DARFCD

+Semi-Crowdsourcing (H4)

Notations R: worker Reliability F: task Feature C: task Clustering S-DARF S-DARFC S-DARFCD S: Semi-crowdsourcing D: task Difficulty

Figure 9. Roadmap for Model Development

3.4.1 DARF

First, we define the generative process for the proposed DARF model. The model is composed of two components: the annotation component handles the labeling behaviors of the workers and describes how the observed annotations (i.e., 풚) are generated; the feature component defines the generative process of the instance features (i.e., 풐). In the following, we

59 take instance 푖 as an example and introduce the details of each component.

퐷 퐷 In both components, the generation of the observed data (i.e.,풐푖 ∈ ℝ or {0, } and 풚푖 ∈

퐽 {0, , 푛푢푙푙} ) is conditioned on the ground truth 푧푖 ∈ {0, } of instance 푖. We use 휆 ∈ [0, ] to represent the probability of an instance having the ground truth . Conditioning on 휆, the ground truth label of instance 푖 (i.e., 푧푖) follows a Bernoulli distribution:

푧푖|휆⁡~퐵푒푟푛표푢푙푙푖(휆). (1)

The feature component is a combination of variational autoencoder (VAE) and Gaussian mixture (Bishop 2006; Kingma and Welling 2014). For each data point 풐푖, we have a

푑 corresponding latent variable 풙푖 ∈ ℝ and the decoding term 푝(풐푖|풙푖, 훾) follows a multivariate

Gaussian distribution (in case of real-valued data) or multivariate Bernoulli distribution (in case of binary data) whose parameters are calculated from flexible neural networks parameterized by

퐷 5 훾. For example, when the data is real-valued (i.e., 풐푖 ∈ ℝ ), 풐푖 is generated as follows :

2 풐푖~풩 (흁훾(풙푖), 푑푖푎푔 (흈훾(풙푖))). (2)

퐷 When the data is binary (i.e., 풐푖 ∈ ℝ ), 풐푖~퐵푒푟푛표푢푙푙푖 (흁훾(풙푖)). The generation of the latent variable 풙푖 is a multivariate Gaussian mixture where the number of the mixture components is the number of the potential ground truth values (i.e., two). Specifically, depending on the value of 푧푖 ∈ {0, } , 풙푖 follows different multivariate Gaussian distributions:

풙푖~풩(흁푧푖, 휮푧푖). (3)

As previously mentioned, in a crowdsourcing scenario, each worker may be unreliable for

5 For computation efficiency, it is often assumed the multivariate Gaussian distribution has a diagonal covariance matrix.

60 various reasons. We adopt the widely-used two-coin assumption for modeling each worker’s labeling behavior (Kim and Ghahramani 2012; Raykar et al. 2010). Specifically, worker 푗 has a two-by-two confusion matrix 푴푗 with sensitivity 훼푗 = 푝(푦푖,푗 = |풛푖 = ) and specificity 훽푗 =

푝(푦푖,푗 = 0|푧푖 = 0). When the ground truth label is one, worker 푗 provides a correct annotation with probability 훼푗; otherwise, worker 푗 annotates correctly with specificity 훽푗. Formally,

푧푖 1−푧푖 푦푖,푗~퐵푒푟푛표푢푙푙푖(훼푗) 퐵푒푟푛표푢푙푙푖( − 훽푗) . (4)

To explicitly write down the likelihood of the annotations 풚, we define 퐼푖,푗 = 핀⁡(푦푖,푗 ≠ 푛푢푙푙).

Hence,

퐼 푁 퐽 푧푖 1−푧푖 푖,푗 ⁡푝(퐲|퐳, 휶, 휷) = ∏푖=1 ∏푗=1 (퐵푒푟푛표푢푙푙푖(훼푗) 퐵푒푟푛표푢푙푙푖( − 훽푗) ) . (5)

Finally, following previous probabilistic Bayesian modeling literature (Blei et al. 2003; He et al. 2019; Wei, Z. Zhang, et al. 2019), we place conjugate Beta and Normal-Inverse-Wishart

(NIW) priors over Bernoulli and multivariate Gaussian distributions respectively:

휆~퐵푒푡푎(휏1, 휏2), (6)

훼푗~퐵푒푡푎(휌훼,1, 휌훼,2), (7)

훽푗~퐵푒푡푎(휌훽,1, 휌훽,2), (8)

흁푘, 횺푘~푁퐼푊(풎, 휅, 푺, 푣). (9)

The graphical model representation of DARF is presented in Figure 10a. The priors are not listed for clarity. The middle plate indicates there are 푁 instances to annotate; the top plate shows

퐽 workers work on the instances; and the bottom plate represents there are two mixture components. The empty circles denote the latent variables we want to infer, while the shaded circles represent observable data, i.e., received annotations and instance features.

61

(a) DARF (b) DARFC

Figure 10. Graphical Representation of the DARF and DARFC Models

3.4.2 DARFC and S-DARFC

In the DARF model, instances are divided into two groups based on label categories.

However, real-world examples are likely to have finer-grained clustering structure and it is crucial to take such structure into account, especially for some instances that receive few reliable annotations but forms clear clusters (Yin et al. 2020).

To exploit the potential clustering structure, the DARFC model groups instances into clusters such that instances of the same group share the instance property (i.e., group truth labels and latent representations). The graphical representation is presented in Figure 10b and the notations follow the same as in Figure 10a. Unlike the DARF model where instances are divided into two groups based on the ground truth labels, DARFC assumes both the ground truth labels and the latent instance representations are generated from a mixture of the underlying clusters. Take instance 푖 as an example, the ground truth label 푧푖 follows a mixture of Bernoulli distribution; and the latent representation 푥푖 follows a mixture of multivariate Gaussian distribution. Specifically, let 퐾 denote the number of clusters. For each instance, we have a

62 corresponding latent cluster assignment 푐푖 ∈ { , , … , 퐾}. It indicates which latent cluster the instance 푖 comes from and is drawn from a categorical distribution over { , , … , K} with mixing coefficient 흎. Formally, the conditional distribution of 푐푖 follows

푐푖~퐶푎푡푒푔표푟푖푐푎푙(흎), (10)

퐾 퐾 where 흎 satisfies 흎 ∈ [0, ] and ∑푘=1 흎푘 = . Conditioned on 푐푖, we draw ground truth label z푖 from the corresponding Bernoulli distribution:

푧푖~퐵푒푟푛표푢푙푙푖(휆푐푖), (11) and the latent representation 풙푖 follows

풙푖~퓝(흁풄풊 , 휮풄풊). (12)

The generative process of 풐푖 and 풚푖 is the same as the DARF model; we ignore the details here.

Finally, we impose conjugate priors over 흎 and 흀 to complete the DARFC model:

흎~퐷푖푟푖푐ℎ푙푒푡(휻). (13)

휆푘~퐵푒푡푎(휏1, 휏2). (14)

Based on H4, the DARFC model can be extended by introducing an extra set of instances

(푠) (푠) without annotations (denoted as 풐푖 , 푖 ∈ { , , … , 푁 }) to facilitate a better hidden

(푠) representation learning. The generative process of 풐푖 is the similar to the one of 풐푖:

(푠) 푐푖 ~퐶푎푡푒푔표푟푖푐푎푙(흎), (15)

(푠) 풙푖 ~풩(흁 (푠), 횺 (푠)), (16) 푐푖 푐푖

(푠) (푠) 2 (푠) 풐푖 ~풩 (흁훾(풙푖 ), 푑푖푎푔 (흈훾(풙푖 ))). (17)

3.4.3 DARFCD

H5 points out that incorporating task difficulty is beneficial for the annotation aggregation

63 problem. We follow previous literature to model how task difficulty interacts with worker reliability and annotations (Wang et al. 2017). The observed annotation of instance 푖 provided by the worker 푗 (i.e., 푦푖,푗) is jointly determined by three factors: the ground truth label of instance 푖, the reliability of worker 푗, and the task difficulty. The observed annotation is more likely to be the ground truth label when the worker is more reliable or the task is easier.

Different from DARF and DARFC, the reliability of worker 푗 is modeled by 훼푗 ∈

(−∞, +∞) and 훽푗 ∈ (−∞, +∞). Larger 훼푗 (or 훽푗) indicates worker 푗 is more likely to provide a correct annotation when the ground truth label of instance 푖 is one (or zero). 훼푗 = 0 (or 훽푗 = 0) means worker 푗 always provides a random guess when the ground truth label of instance 푖 is one

(or zero). We assume 훼푗 and 훽푗 follows normal distributions:

2 훼푗~풩(휇훼, 휎훼 ), (18)

2 훽푗~풩(휇훽, 휎훽 ). (19)

The difficulty of instance 푖 is modeled by ℎ푖 ∈ [0, +∞). ℎ푖 = 0 means instance 푖 is very difficult to label and larger ℎ푖 indicates lower difficulty. With the reliability of worker 푗 and difficulty of instance 푖, we are able to define instance-specific sensitivity and specificity. The sensitivity of worker 푗 towards instance 푖 is defined as sigmoid(훼푗ℎ푖) and the specificity is defined as sigmoid(훽푗ℎ푖). For a very difficult instance (i.e., ℎ푖 = 0), even a high-quality worker will provide a random guess (i.e.,⁡푠푖푔푚표푖푑(0) = 0.5); when an instance is very easy (i.e., ℎ푖 =

+∞), even a low-quality worker can label it correctly (i.e., 푠푖푔푚표푖푑(+∞) = ). As ℎ푖 ∈

[0, +∞), we are unable to find a distribution which makes the model still conjugate. Hence, we treat ℎ푖 as a trainable variable and estimate it using gradient descent. With sensitivity and specificity, we can generate the annotations as follows:

64 푧푖 1−푧푖 푦푖,푗~퐵푒푟푛표푢푙푙푖(푠푖푔푚표푖푑(훼푗ℎ푖)) 퐵푒푟푛표푢푙푙푖( − 푠푖푔푚표푖푑(훽푗ℎ푖)) . (20)

The graphical representation of the DARFCD model is presented in Figure 11.

Figure 11. Graphical Representation of the DARFCD Model

3.4.4 Multi-Class Extensions

Although we focus on the binary scenario as a case study to test our proposed hypotheses, it is straightforward to extend previous models to multi-class settings. In this case, the ground truth labels 푧푖 may take one of multiple potential values. We only need to replace all probabilistic distributions related to 푧푖 with their multivariate variants. Specifically, if involved in the generative process, Bernoulli and Beta distributions should be replaced with Categorical and

Dirichlet distributions respectively. Take the DARFC model as an example, 푧푖 follows a

Categorical distribution; 휆푘 follows a Dirichlet distribution. The two-by-two confusion matrix is also extended to its multivariate version where the 푘-th row of the matrix measures the probability of providing each label when the ground truth label is 푘 (Kim and Ghahramani

2012). Each row follows a Dirichlet distribution and the annotations 풚 is generated based on a

Categorical distribution. When task heterogeneity in difficulty (i.e., H5) is considered, the task- specific confusion matrix of worker 푗 towards instance 푖 is defined as sigmoid(푴푗ℎ푖).

65 3.5 Model Inference

To infer the model parameters for our proposed deep generative modeling framework. The goal of the model inference is to compute the posteriors of the latent variables, primarily the ground truth label of each instance (i.e.,풛), conditioning on the observable data 풚 and 풙. However, this conditional density is intractable for exact inference because we need to compute the marginal density of the observable variables, which requires exponential computation time (Wainwright and

Jordan 2008). To address this issue, we take a mean-field variational inference approach for approximate inference. The key idea is to approximate the posteriors with a mean-field variational family, which is flexible enough to capture the density close to the posteriors but simple enough to enable tractable inference (Blei et al. 2017). As our models in Figure 9 are developed incrementally under the same probabilistic framework, we use DARFC as an example to showcase the model inference procedure.

The overall joint distribution of all hidden and observed variables takes the form:

⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡푝(풚, 풐, 풄, 풙, 풛, 휣; 훾) = 푝(흀)푝(흎)푝(휶)푝(휷)푝(흁, 휮)

⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡∙ 푝(풄|흎)푝(풛|흀, 풄)푝(풙|흁, 휮, 풄)푝(풐|풙; 훾)푝(풚|풛, 휶, 휷). (21)

For convenience, we denote the global variables as 횯 = {훌, 흎, 흁, 횺, 휶, 휷}. Our learning objective is to maximize the marginal likelihood of observed data 퐨 and annotations 퐲. As we have mentioned, exact inference is intractable. Thus, we consider a fully factorized mean-field variational family 풬 over the latent variables

푞(풄, 풙, 풛, 휣) = 푞(풄)푞(풛)푞(풙)푞(휣) = 푞(풄)푞(풛)푞(풙)푞(흀)푞(흎)푞(휶)푞(휷)푞(흁, 휮). (22)

Each member of the variational family⁡풬 serves as a candidate approximation to the exact posterior. Hence, our aim is to find the one that minimizes the Kullback-Leibler (KL) divergence to the exact posterior. Because the KL divergence is not computable, we try to optimize an

66 alternative objective called evidence lower bound (ELBO), which is equivalent to the KL up to an added constant (Blei et al. 2017). In our case, the ELBO can be written as:

log 푝(풐, 풚) ≥ ℒ[푞(풄)푞(풛)푞(풙)푞(횯); 훾]

푝(퐲,퐨,퐜,퐱,퐳,횯;훾) ≜ 피 [log ]. (23) 푞(풄)푞(풛)푞(풙)푞(횯) 푞(풄)푞(풛)푞(풙)푞(횯)

In traditional variational inference for conjugate models where all the complete conditionals are in the exponential family and the priors are conjugate, the optimal solution of maximizing the

ELBO over each variational parameter can be derived in close form and applying coordinate ascent algorithm to each coordinate is guaranteed to climb the ELBO to a local optimum (Blei et al. 2017). However, it is not directly applicable to our model because of the non-conjugate

푝(퐨|퐱; 훾) term. Inspired by the structured variational autoencoders (Johnson et al. 2016) and amortized learning (Lin et al. 2018), we introduce a recognition network 푟(풐푖; 휙) to enable fast amortized inference and allow the inference algorithm to leverage conjugacy structure.

Specifically, we first assume all mean-field distributions are in the same exponential family and write each variational distribution in its standard exponential family form:

푞(휃) = 푒푥푝{〈휂휃, 푡(휃)〉 − 푙표푔 푍(휂휃)},⁡휃 ∈ {풄, 풛, 풙} ∪ 횯, (24) where 휂휃, 푡(휃), and 푙표푔 푍(휂휃) are natural parameter, sufficient statistic, and log partition function respectively6. With such an assumption, the ELBO can be written as:

ℒ[푞(풄)푞(풛)푞(풙)푞(횯); 훾] ≜ ℒ[휼풄, 휼풛, 휼풙, 휼횯; 훾]. (25)

To enable a conjugate structure, we replace the non-conjugate likelihood⁡푝(퐨|퐱; 훾) in the original

ELBO with a conjugate term 휓(푥푖; 표푖, 휙), which is defined as

6 We follow the same notations throughout this paper by writing natural parameter, sufficient statistic, and log partition function as 휂∗, 푡(∗), and 푙표푔 푍(휂∗) respectively.

67 휓(풙푖; 풐푖, 휙) = 〈푟(풐푖; 휙), 풕(풙푖)〉. (26)

In the above formula, the recognition network 푟(풐푖; 휙) is some parameterized class of functions and 풕(풙푖) is the sufficient statistic of 풙푖 (i.e., sufficient statistic of the multivariate Gaussian distribution). Following the language of variational autoencoder (Kingma and Welling 2014),

푟(풐푖; 휙) and 푝(풐|풙; 훾) can be viewed as encoder and decoder, which attempts to learn a hidden representation of the data. Hence, we obtain the following surrogate objective

푝(퐜,퐱,퐳,횯)exp⁡{휓(풙;풐,휙)}푝(퐲|퐳,휶,휷) ℒ̃[휼 , 휼 , 휼 , 휼 ; 휙] ≜ 피 [log ]. (27) 풄 풛 풙 횯 푞(풄)푞(풛)푞(풙)푞(횯) 푞(풄)푞(풛)푞(풙)푞(횯)

Note that this surrogate objective has a conjugate form as the variational inference objective for a conjugate observation model, we can derive the local and global optimizers for each variational parameter (i.e., 휼풄, 휼풛, 휼풙, and 휼횯) in closed form. Under the conjugacy setting, the natural parameters of each variational distribution can be computed from the expected sufficient statistic of the nodes in its Markov blanket, i.e., neighbor nodes in our case (Wainwright and

Jordan 2008). Hence, before proceeding further, we write each distribution in its standard exponential family form and introduce its expected statistic.

푝(흎) follows Dirichlet distribution. It can be written as:

0 0 푝(흎) = 푒푥푝{〈휼흎, 풕(흎)〉 − 푙표푔 훧(휼흎)}, (28)

휁1 − 푙표푔 휔1 0 퐾 퐾 where 휼흎 = [ ⋮ ] , 풕(흎) = [ ⋮ ] , 푙표푔 훧(휼흎) = ∑푘=1 푙표푔 훤(휁푘) − 푙표푔 훤(∑푘=1 휁푘), and 휁퐾 − 푙표푔 휔퐾

휓(휁1) 퐾 the expected statistic 피푞(흎)풕(흎) = [ ⋮ ] − 휓(∑푘=1 휁푘). In the last formula, 휓(∙) is the 휓(휁퐾) digamma function.

푝(c푖|흎) follows Categorical distribution. Without considering the subscript, each 푝(푐|흎) can be written as:

68 푝(푐|흎) = 푒푥푝{〈휼푐, 풕(푐)〉 − 푙표푔 훧(휼푐)} = 푒푥푝{〈풕(흎), 풕(푐)〉}, (29)

푙표푔 휔1 핀(푐 = ) where 휼푐 = [ ⋮ ] , 풕(푐) = [ ⋮ ] , 푙표푔 훧(휼푐) = 0, and the expected statistic 푙표푔 휔퐾 핀(푐 = 퐾)

휔1 피푞(푐)풕(푐) = [ ⋮ ]. 휔퐾

푝(휆푘) follows Beta distribution. Without considering the subscript, each 푝(휆) can be written as:

0 0 푝(휆) = 푒푥푝{〈휼휆, 풕(휆)〉 − 푙표푔 훧(휼휆)}, (30)

휏1 푙표푔 휆 where 휼휆 = [ ] , 풕(휆) = [ ] , 푙표푔 훧(휼휆) = 푙표푔 훤(휏1) + 푙표푔 훤(휏2) − 푙표푔 훤(휏1 + 휏2), 휏2 푙표푔( − 휆)

휓(휏1) and the expected statistic 피푞(휆)풕(휆) = [ ] − 휓(휏1 + 휏2). 휓(휏2)

푝(푧푖|훌, 푐푖) follows mixture of Bernoulli distribution. Without considering the subscript, each 푝(푧|흀, 푐) can be written as:

푝(푧|훌, 푐)⁡

= 푒푥푝{핀(푐 = )〈휼풛(λ1), 퐭(푧)〉 − log 훧(휼풛(λ1)) + ⋯ + 핀(푐 = 퐾)〈휼풛(λ퐾), 퐭(푧)〉 − log 훧(휼풛(λ퐾))}⁡

푇 = 푒푥푝{〈퐭(c), [퐭(λ1), ⋯ , 퐭(λ퐾)] 퐭(푧)〉}⁡, (31)

푙표푔 휆푘 푧 where 휼푧(휆푘) = [ ] , 풕(푧) = [ ] , 푙표푔 훧(휼푧(휆푘)) = 0. The expected statistic 푙표푔( − 휆푘) − 푧

휆 피 풕(푧) = [ ] if the posterior of 푧 follows 퐵푒푟푛표푢푙푙푖푡(휆). 푞(푧) − 휆

푝(흁푘, 횺푘) follows Normal-Inverse-Wishart (NIW) distribution. Without considering the subscript, each 푝(흁, 횺) can be written as:

0 0 푝(흁, 휮) = 푒푥푝{〈휼흁,휮, 풕(흁, 휮)〉 − 푙표푔 훧(휼흁,휮)}⁡, (32)

69 횺−1흁

휅퐦 1 vec (− 횺−1) vec(퐒 + 휅퐦퐦푇) 2 푣 where 휼 = [ ] , 퐭(흁, 횺) = 1 , ln 훧(휼 ) = (푑 ln − ln|푆|) + 흁,횺 휅 − 흁푇횺−1흁 흁,횺 2 2 푣 + 푑 + 1 − ln|횺| [ 2 ]

푣퐒−1퐦

1 −1 vec (− 푣퐒 ) 푣 푑 2 ln Γ ( ) − ln 휅, the expected statistics are 피 퐭(흁, 횺) = 1 . In 푑 2 2 푞(흁,횺) − (휅−1푑 + 푣퐦푇퐒−1퐦) 2 1 푣 (휓 ( ) + 푑 ln − ln|퐒|) [2 푑 2 ]

푑(푑−1) 푣 푑 푣+1−푖 푣 푑 푣+1−푖 the above formulas, Γ ( ) = 휋 4 ∏ Γ ( ), 휓 ( ) = ∑ 휓 ( ), and Γ(∙) is the 푑 2 푖=1 2 푑 2 푖=1 2

Gamma function.

p(풙푖|흁, 휮, 푐푖) follows mixture of multivariate Gaussian distribution. Without considering the subscript, each p(풙|흁, 휮, 푐) can be written as:

⁡푝(퐱|흁, 횺, c)

= 푒푥푝 {핀(푐 = ){〈휼풙(흁1, 휮1), 풕(풙)〉 − 푙표푔 훧(휼풙(흁1, 휮1))} + ⋯

+ 핀(푐 = 퐾){〈휼풙(흁퐾, 휮퐾), 풕(풙)〉 − 푙표푔 훧(휼풙(흁퐾, 휮퐾))}}

풕(풙) ⁡= 푒푥푝 {〈풕(푐), [풕(흁 , 휮 ), … , 풕(흁 , 휮 )]푻 [ ]〉}, (33) 1 1 퐾 퐾 ퟏ

휮−1흁 풙 1 푇 −1 where 휼 (흁, 휮) = [ 1 ] , 풕(풙) = [ 푻 ] , 푙표푔 훧(휼 (흁, 휮)) = 흁 휮 흁 + 풙 푣푒푐 (− 휮−1) 푣푒푐(풙풙 ) 풙 2 2

1 흁 푙표푔|휮| = −ퟏ푇{풕(흁, 휮)[ : ]}. The expected sufficient statistic 피 풕(풙) = [ ] if 2 푞(풙) 푣푒푐(휮 + 흁흁푇) the posterior of 풙 follows 풩(흁, 휮).

푝(훼푗) follows Beta distribution. Without considering the subscript, each 푝(훼) can be written as:

0 0 푝(훼) = 푒푥푝{〈휼훼, 풕(훼)〉 − 푙표푔 훧(휼훼)} (34)

70 휌훼,1 푙표푔 훼 where 휼훼 = [ ] , 풕(훼) = [ ] , 푙표푔 훧(휼훼) = 푙표푔 훤(휌훼,1) + 푙표푔 훤(휌훼,2) − 휌훼,2 푙표푔( − 훼)

휓(휌훼,1) 푙표푔 훤(휌훼,1 + 휌훼,2), and the expected statistic 피푞(훼)풕(훼) = [ ] − 휓(휌훼,1 + 휌훼,2). 푝(훽푗) is 휓(휌훼,2) similarly defined as 푝(훼푗).

Depending on the value of 푧푖, 푦푖,푗 follows different Bernoulli distributions. Formally,

푧푖 1−푧푖 푦푖,푗~퐵푒푟푛표푢푙푙푖(훼푗) 퐵푒푟푛표푢푙푙푖( − 훽푗) . (35)

Without considering the subscript, each 푝(푦|푧, 훼, 훽) can be written as:

푝(푦|푧, 훼, 훽) = 푒푥푝{푧〈풕(훼), 풕훼(푦)〉 + ( − 푧)〈풕(훽), 풕훽(푦)〉}

〈풕(훼), 풕 (푦)〉 = 푒푥푝 {〈풕(푧), [ 훼 ]〉}, (36) 〈풕(훽), 풕훽(푦)〉

푦 − 푦 where 풕 (푦) = [ ] and 풕 (푦) = [ ]. 훼 − 푦 훽 푦

By taking derivative of the surrogate objective ℒ̃[휼풄, 휼풛, 휼풙, 휼횯; 휙] w.r.t. each variational parameters, we obtain the following update formulas for the local variational parameters:

∗ [ ( )[ ] ( ) ] [ ( )] ( ) 휼풙푖 = 피푞(흁,휮) 풕 흁1, 휮1 : , … , 푡 흁퐾, 휮퐾 [: ] 피푞(풄) 풕 푐푖 + 푟 풐푖; 휙 , (37)

퐽 ∑ 퐼푖,푗〈피푞(휶)[풕(훼푗)], 풕훼(푦푖,푗)〉 ∗ [ ( ) ( )] [ ( )] 푗=1 휼푧푖 = 피푞(흀) 풕 휆1 , … , 풕 휆퐾 피푞(풄) 풕 푐푖 + [ 퐽 ], (38) ∑푗=1 퐼푖,푗〈피푞(휷)[풕(훽푗)], 풕훽(푦푖,푗)〉

풕(풙 ) 휼∗ = 피 [풕(흎)] + 피 [풕(흁 , 휮 ), … , 푡(흁 , 휮 )]푇피 [ 푖 ] 푐푖 푞(흎) 푞(흁,휮) 1 1 퐾 퐾 푞(풙)

푇 +피푞(흀)[풕(휆1), … , 풕(휆퐾)] 피푞(풛)[풕(풛푖)]. (39)

The global variational parameters can be updated as follows:

∗ 0 푁 휼흎 = 휼흎 + ∑푖=1 피푞(풄)[풕(푐푖)], (40)

휼∗ = 휼0 + ∑푁 피 [풕(푐 )[푘]]피 [풕(푧 )], (41) λ푘 λ푘 푖=1 푞(풄) 푖 푞(풛) 푖

71 풕(풙 ) 휼∗ = 휼0 + ∑푁 피 [풕(푐 )[푘]]피 [ 푖 ], (42) 흁푘,횺푘 흁푘,횺푘 푖=1 푞(풄) 푖 푞(풙)

∗ 0 ∑푁 ( )[ ] 휼훼푗 = 휼훼푗 + 푖=1 퐼푖,푗피푞(풛)[풕 푧푖 ]풕훼(푦푖,푗), (43)

휼∗ = 휼0 + ∑푁 퐼 피 [풕(푧 )[ ]]풕 (푦 ). (44) 훽푗 훽푗 푖=1 푖,푗 푞(풛) 푖 훽 푖,푗

By using a block coordinate ascent algorithm which updates variational parameters

∗ ∗ ∗ ∗ ̃ iteratively, we can find a local partial optimizer (휼푐푖(휙), 휼풙푖(휙), 휼푧푖(휙), 휼횯(휙)) of ℒ w.r.t.

∗ ∗ ∗ ∗ (휼풄, 휼풛, 휼풙, 휼횯) given a fixed 휙. Note that (휼푐푖(휙), 휼풙푖(휙), 휼푧푖(휙), 휼횯(휙)) is a function of 휙. By plugging them back into ℒ[휼풄, 휼풛, 휼풙, 휼횯; 훾], the final objective 풥 is:

∗ ∗ ∗ ∗ 풥(훾, 휙) ≜ ℒ[휼푐푖(휙), 휼풙푖(휙), 휼푧푖(휙), 휼횯(휙); 훾], (45) which is a function of 훾 and 휙. The specific form of the final objective is:

( ) ∗ ( ) ∗ ( ) ∗ ( ) ∗ ( ) 풥 훾, 휙 ≜ ℒ[휼푐푖 휙 , 휼풙푖 휙 , 휼푧푖 휙 , 휼횯 휙 ; 훾]⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡

⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡= 피푞∗(풙)[푙표푔 푝 (풐|풙; 훾)] + 피푞∗(풛)푞∗(휶)푞∗(휷)[푙표푔 푝 (풚|풛, 휶, 휷)]

∗ ∗ ∗ ⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡−피푞∗(훩)[퐾퐿(푞 (풄)푞 (풛)푞 (풙)||푝( 풄 ∣ 흎 )푝( 풛 ∣ 흀 )푝( 풙 ∣ 흁, 휮, 풄 ))]

−KL(푞∗(흀, 흁, 휮, 휶, 휷)||푝(흀, 흁, 휮, 휶, 휷))⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡ (46)

The first term 피푞∗(풙)[푙표푔 푝 (풐|풙; 훾)] can be obtained by applying the reparameterization trick (Kingma and Welling 2014). The annotation likelihood term

피푞∗(풛)푞∗(휶)푞∗(휷)[푙표푔 푝 (풚|풛, 휶, 휷)] is:

푁 퐽 푁 퐽 ∑푖=1 ∑푗=1 퐼푖,푗푧푖⟨피푞∗(훼)[푡(α푗)], 푡α(푦푖,푗)⟩ + ∑푖=1 ∑푗=1 퐼푖,푗( − 푧푖)⟨피푞∗(휶)[풕(훽푗)], 풕훽(푦푖,푗)⟩.

(47)

The local KL divergence term is:

∗ ∗ ∗ 피푞∗(휣)[퐾퐿(푞 (풄)푞 (풛)푞 (풙)||푝( 풄 ∣ 흎 )푝( 풛 ∣ 흀 )푝( 풙 ∣ 흁, 휮, 풄 ))]⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡

푁 ∗ ∗ =∑푖=1{피푞∗(흎)[퐾퐿(푞 (푐풊)||푝(푐풊|흎))] + 피푞∗(흀)푞∗(풄)[퐾퐿(푞 (푧풊)||푝(푧푖|흀, 푐풊))]

72 ∗ +피푞∗(흁,횺)푞∗(풄)[퐾퐿(푞 (풙푖)||푝(풙푖|흁, 횺, 푐푖))]} (48)

Specifically,

∗ ∗ ∗ ( ( ) ( )] ∗ [ ( )] ∗ [ ( )] 피푞 (흎)[퐾퐿 푞 푐푖 ||푝 푐푖 ∣ 흎 = ⟨휼푐푖 − 피푞 (흎) 풕 흎 , 피푞 (풄) 풕 푐푖 ⟩, (49)

∗ ∗ ∗ ∗ ( ) ( ) ∗ [ ( ) ( )] ∗ [ ( )] ∗ ( ) 피푞 (흀)푞 (풄)[퐾퐿(푞 푧푖 ||푝 푧푖 ∣ 흀, 푐푖 )] = ⟨휼푧푖 − 피푞 (흀) 풕 휆1 , … , 풕 휆퐾 피푞 (풄) 풕 풄 , 퐸푞 (푧푖)풕 푧푖 ,

(50)

∗ 피푞∗(흁,횺)푞∗(풄)[퐾퐿(푞 (풙푖)||푝(풙푖|흁, 횺, 푐푖))]⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡

∗ 〈 ∗ [( ( )[ ] ( )[ ])] ∗ [ ( )] ∗ ( )〉 ⁡⁡⁡⁡⁡⁡⁡⁡= 휼풙푖 − 피푞 (흁,휮) 풕 흁1, 휮1 : , … , 풕 흁퐾, 휮퐾 : 피푞 (풄) 풕 푐푖 , 피푞 (풙푖)풕 풙푖

∗ ⊤ ∗ [( ( )[ ] ( )[ ])] ∗ [ ( )] − 푙표푔 푍 (휼풙푖) − 피푞 (흁,휮) 풕 흁1, 휮1 : , … , 풕 흁퐾, 휮퐾 : 피푞 (풄) 풕 푐푖 . (51)

The global KL divergence terms are listed as follows:

∗ ∗ 0 ∗ 0 퐾퐿(푞 (흎)||푝(흎)) = 〈휼흎 − 휼흎, 피푞∗(흎)[풕(흎)]〉 − (푙표푔 푍(휼흎) − 푙표푔 푍(휼흎)), (52)

∗ ∗ 0 ∗ 0 (휆 ) 휆 〈 ∗ [ (휆 )]〉 , (53) 퐾퐿(푞 푘 ||푝( 푘)) = 휼휆푘 − 휼휆푘 , 피푞 (흀) 풕 푘 − (푙표푔 푍(휼휆푘 ) − 푙표푔 푍(휼휆푘 ))

∗ 퐾퐿(푞 (흁푘, 휮푘)||푝(흁푘, 휮푘)) =⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡

∗ 0 ∗ 0 〈 ∗ [ ( )]〉 , (54) 휼흁푘,휮푘 − 휼흁푘,휮푘 , 피푞 (흁,휮) 풕 흁푘, 휮푘 − (푙표푔 푍(휼흁푘,휮푘) − 푙표푔 푍(휼흁푘,휮푘))

∗ ∗ 0 ∗ 0 〈 ∗ 〉 퐾퐿(푞 (훼푗)||푝(훼푗)) = 휼훼푗 − 휼훼푗 , 피푞 (휶)[풕(훼푗)] − (푙표푔 푍(휼훼푗) − 푙표푔 푍(휼훼푗)), (55)

∗ ∗ 0 ∗ 0 퐾퐿(푞 (훽 )||푝(훽 )) = 〈휼 − 휼 , 피 ∗ [풕(훽 )]〉 − (푙표푔 푍(휼 ) − 푙표푔 푍(휼 )). (56) 푗 푗 훽푗 훽푗 푞 (휷) 푗 훽푗 훽푗

Previous literature (Johnson et al. 2016) shows that 풥(훾, 휙) is a lower bound of the optimized variational inference objective, i.e.,

풥(훾, 휙) ≤ max ℒ[휼풄, 휼풛, 휼풙, 휼횯; 훾]. (57) 휼풄,휼풛,휼풙,휼횯

Hence, by maximizing 풥(훾, 휙) using gradient-based optimization, we are maximizing a lower bound on the model log evidence log 푝(풐, 풚). Following previous literature (Kingma and

Welling 2014), the gradients of the objective 풥(훾, 휙) w.r.t. 훾 and 휙, i.e., ∇훾풥(훾, 휙) and

∇휙풥(훾, 휙), can be obtained by using the reparameterization trick. Last, we use gradient descent

73 algorithm to update the deep network parameters 훾, 휙.

The inference procedure of DARF and S-DARFC can be similarly derived. For DARFCD, as the generation of annotations involves a sigmoid function, the model is still non-conjugate even if we have introduced a recognition network to deal with the non-conjugate likelihood⁡푝(퐨|퐱; 훾). To solve this issue, we follow the standard practice in the literature by introducing a variational lower bound of the sigmoid function (Bishop 2006).

3.6 Empirical Evaluations

In this section, we test the proposed hypotheses using our deep generative framework and comprehensively evaluate the proposed models by comparing them with the state-of-the-art benchmarks. First, we describe the experimental design including datasets, benchmark models, evaluation metrics, and experimental procedures. Then, we report and discuss the experimental results.

3.6.1 Experimental Design

We conduct truth inference experiments on eight real-world binary tasks, including text and image classification. Table 4 summarizes the dataset statistics. Rotten Tomatoes (RT) contains

10,662 movie reviews from the Rotten Tomatoes website7 and was labeled for sentiment polarity classification (Pang and Lee 2005). Each review is associated with two potential labels: “fresh”

positive and “rotten” negative . 27,746 crowd annotations from 203 AMT workers were collected for 4,999 reviews (Rodrigues et al. 2013). In this study, we use the remaining 5,663 reviews without annotations for testing our semi-crowdsourcing model (i.e., S-DARFC). For each review, we pre-train its textual feature by applying Google’s Bidirectional ncoder

7 https://www.rottentomatoes.com/ 74 Representations from Transformers (BERT) and obtain a vector of size 768 (Devlin et al. 2018).

FL (FOREO Luna) and iPhone are two datasets originally collected from Twitter for product adoption intention mining (Z. Zhang et al. 2020). The products for these two datasets are

FOREO Luna facial cleanser and iPhone 7 respectively. We repurpose these two datasets for a more popular task—purchase intention mining, i.e., identifying whether the author of a tweet shows the intention to purchase a specific product (Ding et al. 2015). To obtain a set of tweets with ground truth for evaluating the methods, two of the authors dedicatedly label a small set of randomly sampled tweets. The Cohen’s Kappa scores for FL and iPhone are 0.941 and 0.926 respectively, suggesting very excellent inter-annotator reliability. Similar to the literature, we use

AMT to collect crowdsourced labels. As a result, 20,335 annotations from 102 workers are obtained for 3,487 tweets in the FL dataset and 21,889 annotations from 133 workers for 4,333 tweets in the iPhone dataset. For both datasets, we prepared an extra set of tweets (3,198 and

4,333 respective) for testing the semi-crowdsourcing model. Similar to RT, each tweet is pre- trained with BERT and represented as a vector of size 768.

Table 4. Dataset Statistics

# # Instances Label # # Pos. Task Dataset # Ann. Instances (# with Ann.) Cardinality Workers /# Neg. with GT Text RT 10,662 (4,999) 27,746 5.55 203 10,662 1 Classif FL 10,801 (3,487) 20,335 5.83 102 903 0.36 ication iPhone 7,531 (4,333) 21,889 5.05 133 1,372 0.77 Bill 6,033 (3,016) 15,080 5 356 6,033 0.80 Image Forehead 6,033 (3,016) 15,080 5 329 6,033 0.80 Classif Head 6,033 (3,016) 15,080 5 340 6,033 0.80 ication Shape 6,033 (3,016) 15,080 5 360 6,033 0.80 Throat 6,033 (3,016) 15,080 5 316 6,033 0.80 Note: Ann. means annotations; GT means ground truth; Pos. means positive; and Neg. means negative. Label cardinality refers to the average number of received annotations per instance.

The image classification datasets are extracted from a larger public dataset Caltech-UCSD

75 Birds 200, which contains several tasks to label binary local characteristics for 6,033 bird images

(Welinder, Branson, Mita, et al. 2010; Yin et al. 2020). We use five of them in our experiment:

Bill, Forehead, Head, Shape, and Throat. The labels are whether the bill shape is all-purse or not, whether the forehead is black or not, whether the head pattern is plain or not, whether the shape is perching-like or not, and whether the throat is black or not respectively. All images receive five crowdsourced labels. To use these datasets for evaluating the semi-crowdsourcing model, we keep annotations for only half of the images (i.e., 3,016) and treat the remaining 3,017 images as instances without annotations. It should be noted that, because of this change, our experimental results are different from the results reported in the literature (Yin et al. 2020).

Last, for each image, we use 287 local attributes as object features (Yin et al. 2020).

For comparison with our proposed models (see Figure 9), we select several state-of-the-art benchmark models from recent literature. Majority Voting (MV) is the most frequently used benchmark, which predicts the label as positive if at least half of the received annotations are positive (Wang et al. 2017). Although easy to implement, MV completely ignores any heterogeneity in workers’ quality and may perform poorly when workers’ reliability is low. As a representative model using the probabilistic approach, iBCC models each worker’s reliability with a confusion matrix and estimates the model parameters with the Expectation-Maximization

(EM) algorithm (Kim and Ghahramani 2012). Recent literature starts considering task features to further improve the performance of annotation aggregation. BCCwords models the generation of sentences using a topic model (Simpson et al. 2015). However, this only works when the task is in the form of sentences. CrowdDeepAE (CDAE) and Deep Clustering Aggregation Model

(DCAM) are two deep generative models that utilize deep learning to model the complex non- linear relationship between instance features and task labels. Compared with CDAE, DCAM also

76 considers the clustering structure among tasks. The comparison between the proposed model and the benchmarks models is shown in Table 5.

Table 5. Comparison Between Proposed Models and Benchmark Models

Worker Task Task Semi- Task Reliability Features Clustering Crowdsourcing Difficulty (H1) (H2) (H3) (H4) (H5) MV iBBC ✓ Benchmarks BCCwords ✓ ✓ CDAE ✓ ✓ DCAM ✓ ✓ ✓ DARF ✓ ✓ Proposed DARFC ✓ ✓ ✓ Models S-DARFC ✓ ✓ ✓ ✓ DARFCD ✓ ✓ ✓ ✓

To compare the performance of the proposed models and the benchmark models, we adopt the widely accepted evaluation metrics for binary classification, including ROC AUC (area under the Receiver Operating Characteristic curve) and F1 score. We choose ROC AUC and F1 score as the metrics because our datasets are slightly skewed (see Table 4) and hence accuracy does not provide adequate information on a classifier’s functionality (He and Garcia 2008).

Limited by space, we mainly show the results with metric ROC AUC. The results with metric F1 score lead to the same conclusion in our experiments.

All models are implemented in Python 3 using the library TensorFlow, which enables automatic differentiation and GPU acceleration. We impose weak priors over all Bayesian priors.

For models that hold a clustering structure (i.e., DARFC, S-DARFC, and DARFCD), we choose

퐾 = 00. In Section 3.6.4, we show that our models are not sensitive to the selection of 퐾 as long as 퐾 is moderately large. The dimension of the latent data representation (i.e., 푑) is set as

40. The encoder 푟(풐푖; 휙) and the decoder 푝(풐|풙; 훾) are instantiated using a fully connected

77 퐷−푑 structure. For convenience, we denote as ∆. The numbers of hidden nodes are {퐷, 퐷 − ∆, 푑 + 3

∆, 푑} for the encoder and the order is reversed for the decoder. Each latent layer is followed by a batch normalization layer to increase the stability of the networks (Ioffe and Szegedy 2015). To enable fast convergence, we pre-train an autoencoder and use the learned weights to initialize the deep network parameters 훾, 휙. We run the gradient descent algorithm with a learning rate 0.001 until the relative improvement of 풥(훾, 휙) falls below a sufficiently small threshold (e.g., 0.001) or the algorithm reaches the maximum number of iterations (e.g., 2000). For benchmark models

BCCwords and DCAM, we use the code released by the authors8. For MV, iBCC, and CDAE, we implement the models based on the descriptions of their corresponding papers. In our datasets, each instance receives more than five annotations on average, whereas this value may be lower in real-world scenarios. To examine the robustness of each model to label cardinality and make our approach practically applicable, we randomly sample annotations with the sampling rate varying from 0.6 to 1. Last, all methods were evaluated for 10 runs with a different randomization seed in each run.

3.6.2 Hypothesis Testing

In this section, we report the results of hypothesis testing. For convenience, in each hypothesis testing, we denote the model with the assumption we want to test as “testing model” and the model without that assumption as “baseline model”. For example, when testing H1, the baseline model is MV and the testing model is iBCC.

8 For DCAM, we use the code released by the authors except that we change the number of epochs from 500 to

2,000 because the algorithm is far from being converged in our text classification tasks when the epoch reaches at

500, and hence performs poorly.

78 Table 6. Significance of the Hypothesis Testing (ROC AUC as Metric)

Sample Sample Dataset H1 H2 H3 H4 H5 Dataset H1 H2 H3 H4 H5 Rate Rate 0.6 *** ** ** ^ 0.6 ^ ** ** + 0.7 *** ** ** ^ 0.7 ** ** + RT 0.8 *** ** ** ^ Forehead 0.8 ** ** ^ 0.9 *** ** ** ^ ^ 0.9 ** ** ^ 1 *** ** 1 ** ** ^ 0.6 *** ** ** 0.6 *** ** ** ** 0.7 *** ** ** 0.7 *** ** ** ** FL 0.8 *** ** ** Head 0.8 ^ ** ** *** 0.9 *** ** ** ^ 0.9 ** ** ** *** 1 *** ** ** ** 1 *** ** ** *** 0.6 *** ** ** ^ 0.6 *** ** ** ** 0.7 *** ** ^ 0.7 *** ** ** ** iPhone 0.8 *** ** ** ^ Shape 0.8 *** ** ** ** 0.9 *** ** ** ^ 0.9 *** ** ** ** 1 *** ** ^ 1 *** ** ** 0.6 *** ** ** *** 0.6 ** ** ** + 0.7 *** ** ** *** 0.7 + ** ** ^ Bill 0.8 *** ** *** Throat 0.8 ^ ** ** ^ 0.9 *** ** *** 0.9 ^ ** ** + 1 *** ** ** *** 1 Note: Empty cells mean the direction is not supported, namely, the testing model performs worse than the baseline model. When the direction is supported, ^p<+∞; +p<0.1; *p<0.05; **p<0.01; ***p<0.001.

For each hypothesis, we conducted paired t-tests (using results of 10 runs) with varying annotation sample rates across multiple metrics in all datasets. Limited by space, we only present the results using metric ROC AUC. Table 6 reports the results of the paired t-tests. For easier interpretation, we also use line charts to compare the performance of the testing model and the baseline model. Figures 12 through 16 report the results of Hypotheses H1 through H5 respectively. For each line chart, the testing model is plotted using orange line and the baseline model using blue line. The scores are averaged over 10 runs and standard error of the mean

(SEM) is used to draw the error bar. For better visualization, the 푦 axis might not star at 0 or end at 1, although all metrics range from 0 to 1. To avoid interpretation biases that arise from these

79 truncated graphs, we make sure the maximum of the 푦 axis is larger than the minimum by at least 0.1. Detailed metric scores of all models in all datasets are presented in Table 7.

Table 7. Performance of Proposed Models and Benchmark Models (ROC AUC as Metric)

Dataset Sample Rate MV iBCC DARF DARFC S-DARFC DARFCD BCCwords DCAM CDAE 0.6 91.12 (0.08) 93.70 (0.09) 93.80 (0.14) 93.95 (0.07) 93.96 (0.08) 93.87 (0.07) 93.72 (0.06) 93.79 (0.10) 93.73 (0.09) 0.7 92.56 (0.07) 94.45 (0.10) 94.53 (0.12) 94.62 (0.08) 94.61 (0.06) 94.64 (0.09) 94.43 (0.09) 94.55 (0.09) 94.46 (0.09) RT 0.8 93.51 (0.06) 94.98 (0.06) 95.04 (0.08) 95.12 (0.05) 95.11 (0.06) 95.16 (0.07) 94.95 (0.07) 95.09 (0.06) 94.99 (0.06) 0.9 94.22 (0.06) 95.41 (0.03) 95.46 (0.05) 95.52 (0.06) 95.52 (0.06) 95.55 (0.06) 95.37 (0.04) 95.52 (0.07) 95.40 (0.04) 1 94.74 (0.00) 95.78 (0.00) 95.74 (0.01) 95.87 (0.03) 95.86 (0.01) 95.85 (0.02) 95.73 (0.00) 95.87 (0.03) 95.78 (0.03) 0.6 76.31 (0.24) 85.91 (0.13) 89.02 (0.29) 90.99 (0.12) 90.85 (0.10) 90.54 (0.13) 90.05 (0.16) 90.97 (0.13) 80.34 (0.00) 0.7 79.34 (0.19) 87.62 (0.16) 90.34 (0.24) 91.76 (0.18) 91.63 (0.16) 91.47 (0.13) 90.97 (0.16) 91.76 (0.12) 88.44 (0.75) FL 0.8 82.38 (0.18) 89.02 (0.17) 91.46 (0.07) 92.36 (0.16) 92.26 (0.12) 92.22 (0.13) 91.75 (0.12) 92.37 (0.11) 91.71 (0.19) 0.9 84.52 (0.18) 90.07 (0.10) 92.14 (0.38) 92.86 (0.12) 92.78 (0.06) 92.90 (0.13) 92.38 (0.09) 92.88 (0.12) 92.29 (0.20) 1 86.00 (0.00) 91.37 (0.00) 92.87 (0.15) 93.53 (0.06) 93.49 (0.09) 93.73 (0.07) 93.25 (0.00) 93.59 (0.05) 93.06 (0.11) 0.6 78.66 (0.27) 81.67 (0.15) 82.04 (0.23) 85.72 (0.16) 85.84 (0.16) 85.20 (0.14) 86.69 (0.17) 85.44 (0.14) 77.95 (1.13) 0.7 80.11 (0.18) 83.45 (0.12) 83.35 (0.36) 86.80 (0.12) 86.94 (0.13) 86.27 (0.11) 87.66 (0.15) 86.50 (0.15) 83.29 (0.17) iPhone 0.8 81.33 (0.12) 84.61 (0.12) 84.74 (0.43) 87.37 (0.11) 87.56 (0.09) 86.92 (0.10) 88.26 (0.12) 87.14 (0.11) 85.07 (0.15) 0.9 82.00 (0.07) 85.51 (0.16) 85.70 (0.22) 87.90 (0.07) 88.03 (0.10) 87.38 (0.07) 88.84 (0.08) 87.67 (0.11) 85.88 (0.11) 1 82.57 (0.00) 86.80 (0.00) 86.40 (0.07) 88.48 (0.05) 88.64 (0.09) 87.92 (0.04) 89.56 (0.00) 88.40 (0.11) 86.77 (0.06) 0.6 84.79 (0.14) 86.02 (0.14) 86.38 (0.25) 91.87 (0.28) 93.12 (0.13) 91.60 (0.27) 91.11 (0.15) 87.35 (0.33) 0.7 86.21 (0.05) 87.03 (0.10) 87.32 (0.18) 92.05 (0.29) 93.23 (0.17) 91.75 (0.23) 91.23 (0.10) 88.34 (0.32) Bill 0.8 87.30 (0.08) 87.88 (0.09) 87.85 (0.09) 92.16 (0.26) 93.36 (0.20) 91.93 (0.20) 91.39 (0.09) 89.08 (0.14) 0.9 88.04 (0.07) 88.65 (0.07) 88.59 (0.11) 92.26 (0.23) 93.46 (0.20) 92.00 (0.21) 91.61 (0.08) 89.31 (0.15) 1 88.48 (0.00) 89.24 (0.00) 89.38 (0.15) 92.31 (0.28) 93.55 (0.17) 92.03 (0.22) 91.80 (0.04) 89.81 (0.16) 0.6 87.52 (0.10) 87.68 (0.11) 88.53 (0.16) 91.60 (0.17) 91.98 (0.08) 91.58 (0.12) 91.68 (0.10) 88.65 (0.19) 0.7 88.39 (0.09) 88.33 (0.10) 89.40 (0.15) 91.71 (0.13) 92.13 (0.09) 91.75 (0.13) 91.72 (0.09) 89.39 (0.11) Forehead 0.8 89.03 (0.09) 88.91 (0.09) 89.54 (0.09) 91.75 (0.11) 92.08 (0.12) 91.83 (0.08) 91.85 (0.05) 89.82 (0.10) 0.9 89.60 (0.06) 89.29 (0.06) 89.74 (0.13) 91.81 (0.13) 92.18 (0.10) 91.93 (0.10) 91.94 (0.06) 90.05 (0.08) 1 90.02 (0.00) 89.71 (0.00) 90.33 (0.09) 91.86 (0.12) 92.23 (0.14) 92.02 (0.12) 92.19 (0.05) 90.33 (0.05) 0.6 87.43 (0.15) 88.39 (0.12) 89.64 (0.25) 93.93 (0.21) 94.42 (0.32) 93.89 (0.21) 93.14 (0.11) 90.85 (0.31) 0.7 88.92 (0.09) 89.35 (0.09) 90.48 (0.14) 94.26 (0.21) 94.71 (0.36) 94.20 (0.19) 93.29 (0.16) 91.94 (0.18) Head 0.8 90.15 (0.07) 90.28 (0.05) 90.92 (0.13) 94.47 (0.21) 94.92 (0.31) 94.50 (0.16) 93.55 (0.18) 92.41 (0.16) 0.9 90.97 (0.04) 91.09 (0.05) 91.56 (0.14) 94.60 (0.29) 95.07 (0.36) 94.62 (0.19) 93.80 (0.09) 92.53 (0.12) 1 91.52 (0.00) 91.75 (0.00) 92.14 (0.02) 94.68 (0.25) 95.18 (0.35) 94.70 (0.21) 94.02 (0.06) 92.93 (0.09) 0.6 91.63 (0.18) 92.99 (0.21) 94.36 (0.54) 96.09 (0.20) 96.48 (0.22) 95.97 (0.21) 95.86 (0.13) 94.80 (0.70) 0.7 92.76 (0.12) 93.65 (0.13) 93.66 (0.08) 96.17 (0.22) 96.55 (0.21) 96.03 (0.22) 95.96 (0.09) 95.11 (0.36) Shape 0.8 93.57 (0.10) 94.19 (0.11) 94.50 (0.04) 96.30 (0.19) 96.67 (0.15) 96.14 (0.22) 96.10 (0.05) 95.58 (0.11) 0.9 94.13 (0.07) 94.63 (0.07) 94.89 (0.17) 96.36 (0.22) 96.74 (0.15) 96.21 (0.23) 96.22 (0.06) 95.66 (0.10) 1 94.54 (0.00) 95.08 (0.00) 95.04 (0.17) 96.44 (0.18) 96.84 (0.15) 96.34 (0.23) 96.39 (0.05) 96.01 (0.12) 0.6 88.64 (0.12) 88.99 (0.06) 89.44 (0.07) 92.94 (0.18) 93.50 (0.11) 92.80 (0.14) 92.73 (0.10) 90.32 (0.10) 0.7 89.61 (0.09) 89.88 (0.09) 90.28 (0.07) 93.09 (0.16) 93.59 (0.12) 92.95 (0.12) 92.84 (0.10) 91.13 (0.12) Throat 0.8 90.26 (0.07) 90.45 (0.05) 90.65 (0.10) 93.17 (0.20) 93.58 (0.12) 93.03 (0.15) 92.92 (0.08) 91.49 (0.10) 0.9 90.74 (0.07) 90.85 (0.05) 90.89 (0.05) 93.20 (0.23) 93.69 (0.14) 92.98 (0.17) 93.08 (0.05) 91.71 (0.08) 1 91.02 (0.00) 91.19 (0.00) 91.42 (0.03) 93.24 (0.20) 93.72 (0.13) 93.08 (0.17) 93.27 (0.04) 91.91 (0.04)

80 Overall, the results from Table 6 and Figures 12 through 15 lend support to Hypotheses H1 through H4: on most tasks, the models with one more assumption (orange lines from Figures 12 through 15) perform significantly better than the ones without that assumption (blue lines from

Figures 12 through 15), namely, S-DARFC>DARFC>DARF>DAR>MV. It should be noted that it would be enough to claim one hypothesis is generally supported as long as it is supported by most datasets. For H1, iBCC performs significantly better than MV in all datasets except

Forehead and Throat. This shows considering worker reliability is helpful in the annotation aggregation problem. H2 and H3 are supported on all datasets. The performance gaps in Figure

13 and Figure 14 between the testing models and the baseline models are attributed to considering the task features and the task clustering structure respectively. Compared with the results of testing H2 in Figure 13, the performance gap of testing H3 in Figure 14 is significantly larger. This suggests that considering the potential clustering is a valuable way to further exploit the task features. Results of testing H4 (Table 6 and Figure 16) suggests that, although less significant, it is still helpful to use semi-crowdsourcing to facilitate the hidden representation learning, especially for the image classification tasks. This can be interpreted as the images have more transferable patterns in the learned hidden representations. Meanwhile, when testing H1 and H3, the performance gaps between the testing models and the baseline models increase as a function of decreased annotation sample rate. This suggests that considering worker reliability and task clustering structure will make the aggregation algorithm less susceptible to the number of received annotations. Last, although some hypotheses are not supported in some datasets (e.g.

Figure 15b for testing H4 in dataset FL), we observe that the testing models perform at least comparably to the baseline models. This makes our approach practically applicable. In real- world scenarios, we usually do not know in advance which hypothesis is supported in which

81 dataset. Our observation suggests that, as a rule of thumb, blindly considering H1 through H4 is feasible in most cases.

(a) RT (b) FL (c) iPhone (d) Bill

(e) Forehead (f) Head (g) Shape (h) Throat

Figure 12. Testing H1 (ROC AUC as Metric)

(a) RT (b) FL (c) iPhone (d) Bill

(e) Forehead (f) Head (g) Shape (h) Throat

Figure 13. Testing H2 (ROC AUC as Metric)

82

(a) RT (b) FL (c) iPhone (d) Bill

(e) Forehead (f) Head (g) Shape (h) Throat

Figure 14. Testing H3 (ROC AUC as Metric)

(a) RT (b) FL (c) iPhone (d) Bill

(e) Forehead (f) Head (g) Shape (h) Throat

Figure 15. Testing H4 (ROC AUC as Metric)

Table 6 and Figure 16 provide the results of testing H5. The result shows that we do not have enough evidence to support H5, indicating that considering the task difficulty may not

83 contribute to the annotation aggregation problem. In some datasets (e.g., RT, FL, Forehead, and

Head), DARFCD performs comparably to DARFC, whereas in some other datasets (e.g., iPhone,

Bill, and Throat), DARFCD is even outperformed by DARFC with significant margins. This result seems contradictory to the conclusion in the literature where task difficulty is exploited in the annotation aggregation problem (Bi et al. 2014; Kajino et al. 2012; Wang et al. 2017;

Welinder, Branson, Perona, et al. 2010; Whitehill et al. 2009). However, as far as we know, all previous literature did not compare their models with the ones that turn off the task difficulty component. Hence, although their final performance increases over the benchmarks, it is unclear whether the improvement is due to the consideration of task difficulty. Our incremental design and systematic evaluation enable us to evaluate the usefulness of considering task difficulty in a principled way. Currently, our experimental results based on multiple datasets do not provide enough evidence for H5. This inconsistency with the literature also sheds light on how to design learning from crowds in the future. Before claiming the usefulness of an assumption, it is important to systematically evaluate it using a flexible framework that is capable of turning on and off that factor.

84

(a) RT (b) FL (c) iPhone (d) Bill

(e) Forehead (f) Head (g) Shape (h) Throat

Figure 16. Testing H5 (ROC AUC as Metric)

3.6.3 Comparison with Benchmark Models

To show the effectiveness of our approach, we compare the model considering all supported hypotheses (i.e., S-DARFC with H1 through H4) with the state-of-the-art benchmark models.

Similar to the hypothesis testing part, we vary the annotation sample rate from 0.6 to 1 and use line charts to compare the performance across metrics ROC AUC, F1 score, and Accuracy. The results using ROC AUC as the metric are reported in Figure 17. Detailed metric scores of the benchmark models in all datasets are also presented in Table 7.

In text classification tasks, S-DARFC performs comparably to (when evaluated by ROC

AUC) or slightly better than (when evaluated by F1 score) DCAM and BCCwords. They are followed by CDAE, iBCC, and MV. In image classification tasks, our proposed model S-

DARFC consistently outperforms all other benchmark models with significant margins9. It is as

9 BCCwords is not included in the comparison because it works only when task is in the form of text.

85 expected because our model considers more hypotheses that have been tested to be useful in the previous hypothesis testing part. In addition, we observe that the performance gap between S-

DARFC and the benchmark models tends to increase as each instance receives fewer annotations. This indicates our method is less vulnerable to the number of received annotations.

(a) RT (b) FL (c) iPhone (d) Bill

(e) Forehead (f) Head (g) Shape (h) Throat

Figure 17. Comparison with Benchmark Models (ROC AUC as Metric)

3.6.4 Analysis and Discussion

We have verified that considering task clustering contributes to the annotation aggregation problem (H3). In our experiment, the number of clusters 퐾 is set to 200. To examine the robustness of our framework to the value of 퐾, we vary it from 2 to 300 and visualize the performance of DARFC. The results using ROC AUC as the metric are reported in Figure 18.

The results show that as 퐾 increases, the performance of DARFC increases and stabilizes once 퐾 reaches at a sufficiently large value (e.g., 100). Hence, our approach is robust to the selection of

퐾 as long as 퐾 is large enough. It should be noted that there is no overfitting if we choose a large

퐾 because our variational treatment will automatically determine the optimal number of mixture

86 components (Bishop 2006; Y. Luo et al. 2018).

(a) RT (b) FL (c) iPhone (d) Bill

(e) Forehead (f) Head (g) Shape (h) Throat

Figure 18. Performance of DARFC with Varying 퐾 (ROC AUC as Metric)

To illustrate the quality of clustering, we visualize the clusters learned by the S-DARFC model and examine whether the instances form meaningful clusters. For a clear view, we use t-

SNE to map the learned hidden representation to two-dimensional space (Maaten and Hinton

2008). First, we visualize the top 10 clusters in Figure 19 based on the number of images within each cluster. The results suggest that, for image classification tasks, instances within the same cluster are close to each other, and instances belonging to different clusters are far away from each other10. We examine the images within each cluster and find that one cluster tends to have only several bird species. For example, in cluster 9, 96.88% of the images belongs to “blue winged warbler” (40.63%) or “prothonotary warbler” (56.25%), which have the same throat color. In the datasets FL and iPhone, although less significant, we are still able to find

10 We observe the same pattern on other image classification tasks.

87 meaningful clusters. For example, all sentences in the first cluster of FL have a format of

“FOR O LUN GO R VI W #” where and might be replaced with other content. The clusters are less distinguishable in dataset RT. This explains why the performance gap between DARFC and S-DARFC is quite marginal, as shown in Figure

14.

Because the category of each bird image in the image classification tasks is available, we are able to examine the quality of the learned cluster in another way. Due to similarity, we use dataset Throat as an example. We choose the top 20 bird species and visualize images of these bird species in Figure 20. We find that the images belonging to the same species tend to be close to each other in the space of hidden representation. For the text classification tasks, as no external category label is available, we are unable to similarly visualize each passage of text based on category.

(a) RT (b) FL

88

(c) iPhone (d) Throat

Figure 19. Visualization of Top 10 Clusters Learned by S-DARFC

Figure 20. Visualization of Top 20 Bird Species Learned by S-DARFC in Dataset Throat

3.7 Conclusions

Microtask crowdsourcing is a promising approach to obtaining large-scale high-quality labeled data across a wide range of business applications. Although models with various underlying assumptions are proposed in the literature to aggregate the crowd efforts and finally achieve certain cumulative goals, the current design of learning from crowds often makes simple or constrained assumptions and lacks interpretability and generalizability in conclusions. To

89 provide a set of generalizable practices, we first formulate several general hypotheses, including the usefulness of worker reliability, task feature, task clustering structure, semi-crowdsourcing design, and task difficulty. To test these hypotheses, we propose a flexible deep graphical framework that enables incremental design. Such a framework allows us to evaluate the underlying assumptions in a principled way by conducting before-and-after evaluations. This deep framework also allows us to make less constrained and hence more useful assumptions by modeling complicated non-linear relationships with deep learning. To estimate the parameters, we develop an inference algorithm combining variational message passing and amortized learning. Last, we empirically test these hypotheses with eight real-world tasks including text and image classifications. The experimental results also indicate the superiority of our framework in comparison with the state-of-the-art benchmark models.

Our research is of both theoretical and practical value. Theoretically, we propose and empirically test several hypotheses for the design of learning from crowds. Specifically, we show the considering worker heterogeneity in quality, task features, task clustering structure, and semi- crowdsourcing design could contribute to the task of annotation aggregation. Different from existing literature where task difficulty is used, we find no evidence for the usefulness of considering task difficulty. Our research also provides some theoretical implications for future research design. First, we provide a general research pipeline for design science research to obtain generalizable conclusions. We design a flexible framework that enables incremental design and rigorously evaluate our proposed technical insights. Most of the current design science research primarily focuses on performance improvement yet ignores the interpretability and generalizability of such improvement, which may lead to potentially incorrect conclusions.

For example, our conclusion regarding the effectiveness of task difficulty is different from the

90 existing literature. We argue that our conclusion is more likely to be true compared with the previous literature because we evaluate these hypotheses in a principled way by conducting before-and-after evaluations, whereas the previous studies did not track whether the final performance improvement is attributed to the newly-added factors in the design. As such, our research not only contributes to relevant business applications but also deliver additions to the knowledge base, which fits the Information System (IS) research framework (Hevner et al.

2004). Second, our research also has methodological implications for relevant studies. It provides an example of how to combine deep learning and graphical models in a unified framework. Both deep learning techniques and graphical models hold unique advantages for modeling real-world problems. Deep learning techniques are able to learn a flexible representation of complex high-dimensional data while graphical models are good at finding a structure that is interpretable and generalizes to new tasks easily (Johnson et al. 2016).

Integrating these two can combine their complementary strengths and fits a broader range of applications.

Our research delivers many practical implications and actionable insights for relevant stakeholders. First, our proposed framework serves as a cost-effective approach for aggregating crowd efforts. As our empirical evaluation shows, our framework significantly outperforms the state-of-the-art methods in the literature. Such performance gaps will immediately translate into large differences in cost because businesses usually rely on data collection daily (Wang et al.

2017). Second, although the primary goal of our model is to infer the ground true labels, we can also estimate the reliability of participating workers. Such information is of great value and importance as the managers of the crowdsourcing platforms can utilize it to filter out unreliable crowd workers or build performance-based payments as incentive mechanisms (Ho et al. 2015),

91 fostering the creation of a healthier and more sustainable crowdsourcing market (Wang et al.

2017). Last, our tested hypotheses can be several general and trustable practices for the future design of learning from crowds in many other similar scenarios, including single-label (i.e., binary or multi-class) truth inference (Wang et al. 2017), multi-label truth inference (Li et al.

2019), real-valued truth inference (Raykar et al. 2010), and many others.

We conclude our paper by presenting its limitations and future directions. First, we choose five most frequently considered assumptions from the literature and formulate a hypothesis for each assumption. There are still some other assumptions such as the grouping of workers

(Moreno et al. 2015) that may contribute to the task of learning from crowds. Given the flexibility of our proposed Bayesian framework in incorporating new relational structures, we plan to investigate more assumptions to further improve the performance in the future. Second, as a proof of concept, this study uses the most representative scenario (i.e., binary tasks) as an example to test the proposed hypotheses. Given that microtask crowdsourcing is currently used in a wide range of applications, we plan to extend our study to more scenarios. Third, our results regarding the effectiveness of task difficulty (i.e., H5) seem to provide empirical evidence again the findings in the literature. It would be interesting to theoretically explore this issue. Fourth, limited by computationsl resource, all experimental results are averaged over 10 times. In the future, it will be valuable to explore the best power size of our significance tests. Last, the hypotheses are tested within our deep generative framework only, which may somewhat limit the generalizability. In the future, we plan to test the hypotheses across different frameworks.

92 4. ESSAY III: COMBINING CROWD AND MACHINE INTELLIGENCE TO DETECT

FALSE NEWS IN SOCIAL MEDIA

4.1 Introduction

False news, which refers to any false news article or message that is published and propagated through media and has an assertion in it, is now viewed as one of the largest concerns globally (Shu et al. 2017; Vosoughi et al. 2018). In the US, almost one third of people reported that they had been exposed to fake news11 in the past week; and the rate is as high as 49% in some countries such as Turkey12. The proliferation of social media platforms further facilitates the prevalence of false news from its generation, spread to consumption (Lazer et al. 2018).

During the 2016 US election, more than one million tweets related to the fake news “Pizzagate” were created in the first month (Shu et al. 2017); the posts regarding top twenty popular election- related news even outnumbered those related to top twenty mainstream news. The prevalence of false news has seriously affected individuals and society. Individuals show declining trust in the over the past years13. In general, only 44% of people trust news, and particularly social media news is seen as even more unreliable (22%). Such a phenomenon is envisioned as a growing problem for business, as social media now serves as a critical channel for e-commerce such as advertising. Our economies are not immune to false news either, because investors frequently rely on news to make investment decisions. For example, a false tweet saying that

Barack Obama was injured in an explosion wiped out $130 billion in stock value in the blink of

11 Fake news belongs to false news except that it’s intentionally false. See Section 4.2.2 for a clarification of false news and related concepts. 12https://www.forbes.com/sites/niallmccarthy/2018/06/14/where-exposure-to-fake-news-is-highest- infographic/#7cadae94a4dd 13 http://media.digitalnewsreport.org/wp-content/uploads/2018/06/digital-news-report-2018.pdf

93 an eye (Vosoughi et al. 2018).

Firms, ranging from behemoths like Facebook and Google to small start-ups, have started taking actions to curb the false news epidemic. For example, as a social media platform that suffered incessant false news problems in recent years, Facebook allows users to flag potential false news and then resorts to third-party organizations to fact-check the flagged news (Kim et al.

2018). Machine learning algorithms are also utilized to help detect false news and accounts.

Recently, Facebook built a “war room” for safeguarding the elections around the world by rooting out false news that may be intended to influence voters14. In , Sina Weibo, one of the most popular microblogging services, also allows users to report false posts so as to maintain a healthy and sustainable community.

The academic community has also devoted much effort to tackle the false news crisis.

Existing research mainly considers two types of data sources: news content and social context

(Shu et al. 2017). Relevant features are extracted from these data sources and then fed into machine learning algorithms. Overall, the current effort to detect false news heavily relies on machine intelligence (e.g., machine learning approaches) to achieve automatic detection. Human intelligence, which holds great potential to complement machine-based methods, is largely under-explored. Actually, machines and human brains have different problem-solving capabilities and may benefit from a symbiotic existence (Kamar 2016). Machine algorithms are able to learn certain false news-related patterns from large-scale data, yet largely depend on the training data. Humans are generally more competent than machines in intelligent tasks such as natural language understanding (Demartini et al. 2017), as they understand the context and

14 https://www.nytimes.com/2018/09/19/technology/facebook-election-war-room.html

94 nuance (Vaughan 2018). Humans also have updated knowledge, which is critical to detect the ever-changing false news. However, human intelligence is a valuable resource with high costs and constraints (Kamar 2016). Most existing approaches that exploit human intelligence for false news detection (e.g., expert-oriented and crowdsourcing-oriented fact-checking) are not applicable in large-scale scenarios. Given the potential value of under-explored human intelligence, we propose the following questions: what types of scalable human intelligence can be exploited; how useful are they; how can such intelligence be efficiently utilized; and how can human and machine intelligence be combined symbiotically?

Premised on the online disinhibition effect from cyberpsychology literature (Suler 2004) and the wisdom of crowds (Surowiecki 2005), we propose to incorporate two types of scalable crowd intelligence. The online disinhibition effect suggests that online users will express themselves more openly in the less restrained Web environment (Suler 2004). In social media, when users are exposed to a piece of news with which they personally disagree, some of them may honestly share such opinions through actions such as commenting or reporting. Aggregating these crowd opinions can be beneficial to help detect false news, as confirmed in prior literature that has successfully taken advantage of the wisdom of crowds in various online scenarios such as microtask crowdsourcing (Wang et al. 2017), prediction markets (Chen et al. 2017), and social reporting (Oh et al. 2013). We demonstrate our idea with a real example from

Sina Weibo. A Weibo user posted a piece of news about Typhoon Mangkhut, as shown in Figure

21. This news was reported by two crowd users and then verified as false news by the Sina community management center15. In the upper right corner, users can click “report” to flag this

15 https://service.account.weibo.com/

95 post as false information, personal attack, or many other types. Weibo will record how many times this post is reported as false information. The right panel lists some of the responses, where many users debunked the false news via responses (marked with underline). This example indicates that crowd users may contribute their intelligence (i.e., judgment as to the veracity of news) by posting responses or reporting in social media. We can potentially use such human intelligence to help us recognize false news. In addition, such information is scalable because reporting and commenting often occur voluntarily alongside posting. However, they are not completely reliable because not every post reader will report or negate the post by commenting when the post is false; and readers may also mistakenly report or debunk a piece of true news. In this research, we combine these two types of less reliable yet scalable human intelligence with machine intelligence to help detect false news. Specifically, we propose a novel framework called Crowd-powered fAlse News Detection (CAND), which first extracts machine, human, and hybrid judgments from news features, reports, and received responses, and then aggregates the extracted judgments with an unsupervised Bayesian result aggregation model to obtain the final prediction. By evaluating the proposed framework with two real-world datasets from Weibo and

Twitter, this research demonstrates the effectiveness of crowd intelligence in fighting false news and the superior performance of our framework to utilize human intelligence in comparison with the benchmark methods. Our analysis also generates many valuable insights, such as the complementary value of human and machine intelligence, the possibility of using human intelligence for early detection, and the robustness of our approach to intentional manipulation.

96

Figure 21. An Example of Human Intelligence in Responses and Reports

Note: We have blurred all private information. “[bless]” means an emoji and “[loc: Shenzhen]” refers to a hyperlink about the location. In the right panel, the debunking responses are underlined.

The rest of this paper is organized as follows. Section 4.2 reviews relevant literature.

Section 4.3 details the proposed CAND framework. Section 4.4 evaluates the proposed framework by describing the experimental design and showing the empirical results. We conclude the paper by presenting contributions, implications, limitations, and future research directions in Section 4.5.

4.2 Related Work

In this section, we survey several streams of relevant literature. We first review related literature that serves as the theoretical foundations of this work. Second, we summarize existing studies about false news detection in social media. Third, we review false news studies that are

97 related to crowd intelligence. Since the proposed framework needs to aggregate the extracted judgments, we finally review the literature on information aggregation.

4.2.1 Theoretical Foundations

Our proposed approach builds on the online disinhibition effect from cyberpsychology literature (Suler 2004) and the wisdom of crowds (Surowiecki 2005).The online disinhibition effect demonstrates that some people feel less restrained and express themselves more openly while online compared with communicating in-person (Suler 2004). This effect is triggered by the characteristics of the online environment such as and asynchronous communication and could be manifested in both positive and negative directions (Lapidot-Lefler and Barak 2012; Suler 2004). In our scenario, we propose to utilize crowd opinions to help detect false news. The online disinhibition effect ensures that at least some users will post their judgments honestly (e.g., by commenting and reporting) when reading a piece of news that is against their prior beliefs; when communicating in-person, they may refuse to point out someone else is wrong.

The wisdom of crowds (a.k.a., collective intelligence, crowd wisdom) refers to the theory that large groups of individuals are often collectively smarter than any single member and even expert individual in many tasks such as problem solving, decision making, and predicting

(Surowiecki 2005). Crowdsourcing is the most prominent and successful practice that unleashes the wisdom of crowds16. It is an online and distributed problem-solving and production model that leverages the collective intelligence of online communities to finish specific tasks (Brabham

16 Crowd wisdom and crowdsourcing are sometimes treated as the same (Doan et al. 2011). We distinguish them in that crowd wisdom is a conceptual-level idea, while crowdsourcing is a problem-solving practice that leverages crowd wisdom.

98 2013). Such a model (and therefore the wisdom of crowds) has been successfully applied in a wide array of applications such as crowdsourcing marketplaces (e.g., Amazon Mechanical Turk), crowdfunding (e.g., Kickstarter), and user-generated content (e.g., Yelp). Information Systems

(IS) researchers have also devoted much effort to studying the wisdom of crowds and its applications (Atanasov et al. 2017; Bayus 2013; Lee et al. 2018; Lukyanenko et al. 2014; Wang et al. 2017). In this research, we tap into the wisdom of crowds and propose to utilize scalable crowd judgments (i.e., responses and reports in social media) in the task of false news detection.

Note that a single user’s intelligence may be unreliable. For example, not all users will debunk the news in their responses; even if they do so, their debunking may be unreliable for various reasons including political affiliation, ambiguous news, intentional or unintentional mistakes, etc.

In our proposed approach, this issue is mitigated by modeling the credibility of humans and aggregating the judgments from many crowd users. The premise of our approach is that based on the wisdom of crowds, false news and true news have different debunking and reporting patterns after aggregating individuals’ opinions.

4.2.2 Definition of False News and Its Classification

There are no universal definitions for false news and related concepts, including fake news, , and rumor. We clarify these concepts based on the most recent literature. First, we need to define the term news. Following the literature (Vosoughi et al. 2018), we purposefully use a broad definition. News refers to any news article or message that is published and propagated through media and has an assertion in it (Sharma et al. 2019; Vosoughi et al.

2018; Zhou and Zafarani 2018). In our research context, we treat each Weibo post or tweet as news. False news refers to news that contains false information. Based on the intention, false news can be divided into fake news (intentionally and verifiably false news) and unintentionally

99 false news (Sharma et al. 2019).

In the literature, most classification schemes are about false information. Although information is a broader concept than news, we focus on the classification of false information in the following because we can easily apply these classification schemes to false news by limiting the context to news.

Based on the intention, false information can be divided into misinformation and . Misinformation refers to the “unintentionally spread false information that can be a result of misrepresentation or misunderstanding stemming from cognitive biases or lack of understanding or attention,” and disinformation refers to the “false information created and spread specifically with the intention to deceive” (Kumar and Shah 2018; Sharma et al. 2019;

Wardle 2017).

Based on the knowledge, false information can be divided into opinion-based and fact- based. The most common opinion-based false information is fake reviews on e-commerce websites where a unique ground truth does not exist, while fact-based false information consists of lies about entities that have a unique ground truth value (Kumar and Shah 2018).

Based on the means employed to falsify information, false information can be divided into seven potentially overlapping types: “satire or parody (i.e., no intention to cause harm but has potential to fool), fabricated content (i.e., completely false), misleading content (i.e., misleading use of information to frame an issue), imposter content (i.e., genuine sources impersonated with false sources), manipulated content (i.e., genuine information or imagery manipulated to deceive), false connection (i.e., headlines, visuals, or captions that do not support the content), and false context (i.e., genuine content shared with false contextual information ” (Sharma et al.

2019; Wardle 2017).

100 Another concept related to false news is rumor. The definition of rumor from the Oxford

Dictionary is “a currently circulating story or report of uncertain or doubtful truth,” and it’s also defined as the “social phenomena of a news story or claim spreading or diffusing” through social media (Vosoughi et al. 2018). Overall, for rumor, we are not sure about the veracity and the intention.

In this work, we focus on the veracity of news, regardless of the intention and the means of falsifying information, because intention is often unverifiable in (Vosoughi et al. 2018).

s the definition in the literature often varies, we may still use the phrase “fake news” or

“rumor” in certain cases, depending on the original context that we are referring to.

4.2.3 Computational False News Detection in Social Media

In existing literature about false news detection in social media, two major types of information, i.e., news content (e.g., headline and news text) and social context of news (e.g., users’ information and social engagement with news , are leveraged to detect false news. Based on the data source, approaches for false news detection can be divided into two categories: news content-based and social context-based (Shu et al. 2017).

In news content-based approaches, useful linguistic or visual features are extracted from news content such as news source, headline, body text, image, or video. Linguistic features, including lexicon-level features (e.g., word number and frequency of special words), syntax- level features (e.g., parts-of-speech tagging), and other language features, serve as signals to identify certain writing styles in false news. Visual features (e.g., clarity score and coherence score) are also used for false news detection (Gupta et al. 2013; Jin et al. 2017), since visual cues have been shown to be an important manipulator in fake news (Castillo et al. 2014).

Based on the extracted features, several categories of models are designed to detect false news:

101 knowledge-based, style-based, and deep learning-based. Knowledge-based approaches use external sources (e.g., knowledge graph) to fact-check proposed claims in news content. Style- based approaches capture false news-related manipulators from the perspective of writing style, such as (Wang 2017) and non-objectivity (Potthast et al. 2018). In addition, deep learning has been recently used for detecting multi-source or multimodal fake news (Karimi et al. 2018; Wang et al. 2018), as deep learning enables automatically extracting latent features from data in place of laborious and time-consuming manual feature extraction.

In social context-based approaches, four major types of features (i.e., user-based, post- based, network-based, and flagging-based) are extracted from social context. First, user-based features extracted from user profiles are used to measure user characteristics and credibility (Shu et al. 2018), since user profiles include some clues to infer social bots or cyborgs that create or spread false news (Shu et al. 2019). For instance, an unsupervised Bayesian approach is proposed to simultaneously model the truth of news and user credibility (Yang et al. 2019).

Second, post-based features are useful information extracted from relevant social media posts such as posts regarding a news event or responses to these posts. These posts are useful because people often express their emotions or opinions towards news in their social media posts

(Shu et al. 2017). To efficiently extract social stance from the unstructured text data, deep learning-based approaches are widely used. In order to identify rumors, recurrent neural network-based methods are used to learn the hidden representations that capture the social responses in a series of relevant posts over time (Ma et al. 2016). More advanced deep learning techniques, e.g., attention mechanism, are also used in recent literature. For example, a hierarchical neural network combined with social information and attention mechanism shows outstanding performance in rumor detection (Guo et al. 2018). When user responses are rare at

102 the early stage of news propagation, the neural generative model can be used to generate user responses to news articles to enable early detection of false news (Qian et al. 2018). False news detection can also be jointly trained with stance/opinion classification under a multi-task learning framework, so as to take advantage of the shared task-invariant features (Kochkina et al. 2018;

Ma et al. 2018).

Third, network-based features are often extracted from the interactions between users and news by building specific networks in terms of stance (Jin et al. 2016; Tacchini et al. 2017), co- occurrence (Ruchansky et al. 2017), and so on. For instance, in a hybrid model for fake news detection, Ruchansky et al. (2017) construct a weighted co-occurrence network where an edge denotes the number of articles with which two users have both engaged. After extraction, these three types of features are further fed into false news detection models. Existing models include propagation-based and stance-based. Propagation-based models exploit some propagation methods (e.g., PageRank-like) to predict the veracity of a news event by the credibility of its relevant social media posts (Jin et al. 2014). In stance-based models, users’ stances from relevant social media posts are utilized to infer the veracity of news articles (Tacchini et al. 2017).

Fourth, news reporting/flagging is mainly used for early identification and containment of false news in the literature (Sharma et al. 2019). For example, the marked temporal point process is used to model the tradeoff between the number of received flags and the exposure of fake news (Kim et al. 2018). Based on the framework, an online algorithm was developed to decide which stories to send for fact-checking and when to do so, so as to efficiently reduce the spread of fake news. Along the same direction, a Bayesian approach is developed to jointly detect fake news and learn about users’ flagging accuracy over time (Tschiatschek et al. 2018).

Currently, human responses and reports are still largely under-explored, especially how to

103 efficiently utilize such information. Compared with the existing literature where responses and reports are used, our work is significantly different in terms of how such data sources are used.

Human responses are often treated as features of black-box methods in the literature (Guo et al.

2018; Ma et al. 2016), whereas we incorporate more interpretable structures in our Bayesian approach. For example, intuitively false news will receive more debunking responses compared with true news. In black-box models, responses are fed into the model with the hope that the algorithms will learn such patterns from the data. However, in our design, we explicitly model the part that contains an interpretable structure and leave the other parts to black-box models in task T1. Specifically, we assume that false and true news have different debunking patterns, which are captured by the logistic-normal assumption. For the unstructured data (e.g., news body and news responses), we use deep learning methods to handle it. In the literature, human reports are often used separately to explore the early detection of false news (Kim et al. 2018;

Tschiatschek et al. 2018). In this work, we use an interpretable structure by assuming different reporting patterns for false and true news. The learned pattern is then combined with feature- based machine judgments and human intelligence in responses to generate a final prediction. In

Section 4.5, we will show that our approach possesses several advantages over the existing literature.

4.2.4 False News Studies Using Crowd Wisdom

As we have mentioned, we leverage crowd wisdom to help fight false news. Hence, in this part, we review false news studies that use crowd wisdom. Currently, there are three major ways to utilize crowd wisdom. First, crowdsourcing-oriented fact-checking resorts to crowdsourcing platforms to have ordinary people fact-check potential false news (Shu et al. 2017). For example,

Fiskkit, an online commenting platform, allows users to discuss and annotate the veracity of a

104 news article by rating or tagging it. Although such a fact-checking method is scalable, it cannot identify false news in real time. Second, crowd wisdom is used in computational studies regarding false news detection. The two most frequently explored data sources are crowd responses and reporting/flagging. Relevant literature was introduced in the last subsection regarding post-based and flagging-based features. Last, crowdsourcing marketplaces, primarily

Amazon Mechanical Turk (AMT), are widely used in explanatory social science studies regarding false news (Pennycook and Rand 2018, 2019a). For example, to verify whether laypeople’s judgments are reliable indicators of news source quality, thousands of people were recruited from AMT and Lucid to participate in two preregistered experiments where individuals rated familiarity with and trust in 60 news sources (Pennycook and Rand 2019b).

Our work belongs to the second category, where we aim to develop a practical computational method to detect fake news by combining two types of crowd wisdom (i.e., user responses and reporting/flagging information) with feature-based machine learning methods.

4.2.5 Information Aggregation

In our proposed approach, we attempt to aggregate judgments from humans and machines in order to predict the veracity of news. Information aggregation techniques have been widely studied in many scenarios under different names. One well-documented scenario is classifier combination, where each classifier outputs a result for the same machine learning task and then the results are aggregated (Tulyakov et al. 2008). In this stream, many ensemble methods (e.g., bagging and boosting) and their variants consider both how to construct the classifiers and how to aggregate the results. We only review the literature on result aggregation where a fixed set of classifiers are given, since the goal of our setting is to combine human and machine judgments.

In addition, we focus on relevant literature where the goal of the task classification and the

105 classifier outputs are prediction scores or labels. To combine prediction scores, simple rules (e.g., sum rule, product rule, max rule, and min rule) are frequently used and have shown good performance in many tasks (Mohandes et al. 2018). To aggregate prediction classes, voting- based strategy and its variants are widely used. Majority Voting (MV) selects the candidate class that has a majority (e.g., more than half of the votes in binary classification). It is extended by assigning different weight to each classifier (Mohandes et al. 2018). Some other studies take a probabilistic perspective to solve the problem by introducing parameters to govern the behaviors of classifiers and ground truth. For example, a Bayesian model called independent Bayesian

Classifier Combination (iBCC is proposed to model the generative process of classifiers’ predictions by taking into account each classifier’s reliability (Kim and Ghahramani 2012).

The second major scenario is crowdsourced answer aggregation in microtask crowdsourcing, where a huge amount of microtasks are assigned to potentially unreliable crowd workers through some web-based crowdsourcing platforms (e.g., AMT) and then the results are aggregated (Brabham 2008; Wang et al. 2017). The most relevant scenario is data labeling, where workers are requested to assign one (single-label) or multiple (multi-label) labels to data instances in each microtask (Moreno et al. 2015; Wei et al. 2017). Actually, when the results to be aggregated are labels, classifier combination and crowdsourced answer aggregation boil down to the same technical problem; however, some application-specific features, e.g., worker grouping in the crowdsourcing setting (Moreno et al. 2015), can be exploited under different scenarios. In the seminal work, the DS (Dawid-Skene) model uses a confusion matrix to model workers’ behavior; and an M algorithm is developed to estimate the parameters (Dawid and

Skene 1979). The above-mentioned iBBC model is a Bayesian extension of the DS model. These two models are further extended by considering various related factors such as task easiness

106 (Kim and Ghahramani 2012), instance attribute (Welinder, Branson, Perona, et al. 2010), label correlation (Wei et al. 2017), and worker grouping (Moreno et al. 2015).

Our scenario is different from the aforementioned information aggregation literature because only humans or machines are involved in the literature, whereas we need to combine the extracted judgments from humans and machines in our scenario. Hence, we need to design a specific model for our setting to combine such mixed judgments.

4.3 A Crowd-Powered Framework for False News Detection

In this section, we present our proposed framework for false news detection—Crowd- powered fAlse News Detection (CAND). Figure 22 shows the overall design, which consists of two stages: information extraction and result aggregation. Without loss of generality, we represent each piece of news by its content and context features (Shu et al. 2017). For content features, we consider the news text and whether each piece of news has external links, images, and videos17. Context features include posting time and author profiles (e.g., whether an author is a verified author and whether an author has an avatar). Author profiles are incorporated because prior work has found that features about the news source (i.e., news author in our context) play a critical role in identifying false news (Oh et al. 2013; Sharma et al. 2019). For example, news posts from organizational Weibo accounts is less likely to be considered as false compared with those from personal accounts. Table 8 lists all features, their descriptions, and whether they are included in the Weibo and Twitter datasets. Given a piece of news, humans may contribute their intelligence as to the veracity of news in responses and reports. Based on the content and context

17 We do not consider image and video content because: (1) Considering them necessitates designing multimodal models for the classifiers in task T1 and hence goes beyond the scope of this work; (2) It will not influence our evaluation as it will increase the performance of our approach and the benchmark methods at the same time.

107 features, human responses, and human reports, the information extraction stage will extract machine, human, and hybrid human-machine judgments. To symbiotically aggregate the extracted judgments, we build an effective and interpretable Bayesian result aggregation model called Credibility-based Logistic-Normal Aggregation Model (CLNAM). In the following, we elaborate each stage in detail.

Table 8. News Features Considered in Weibo and Twitter Datasets

Included in News Features Dataset? Type Name Description Weibo Twitter Text Text content of the news • • Content Has_external_link Whether the news has external links • • Features Has_image Whether the post has images • • Has_video Whether the post has videos • • The hour when the news is posted Post_hour • • (dummy variable) Gender Gender of the author • Is_verified Whether the author is a verified user • • VIP VIP level (0 represents no VIP) • Whether the account is personal or Account_type • organizational Context Has_avatar Whether the author has an avatar • • Features Has_bio Whether the author has a short bio • Log of # of days since the author joined Log_tenure • the platform Log_num_followings Log of # of users the author has followed • • Log_num_followers Log of # of the author’s followers • • Log_num_posts Log of # of posts the author has posted • • Log_num_likes Log of # of likes the author has received • Log_num_lists Log of # of lists the author has •

4.3.1 Information Extraction

Suppose there are 푁 pieces of news whose veracity we want to assess hereafter, “test set” , where news 푛 is associated with 퐾(푛) response(s) and a certain number of received reports. We take news 푛 as an example to illustrate the procedure. For news features, one or more (denoted as

108 푀) machine learning (ML) algorithms are trained on a dataset that is independent of the test set.

(1) The trained 푀 classifiers are then used to predict the veracity of news 푛 in the test set. Let 푌푛,푚 ∈

(0, ) denote the predicted probability generated by classifier 푚. The results are called machine judgments from features. We formulate this task as T1: feature-based false news detection.

Similarly, to extract useful knowledge from responses, we train another classifier on a response dataset that is independent of the test set. The classifier then predicts whether each response to

(2) news 푛 is a debunking response or not. Let 푌푛,푘 ∈ (0, ) be the predicted probability for response

푘. The output is a hybrid of human and machine intelligence. We formulate this process as task

T2: debunking response detection. For the human reports, the number of received reports of

(3) news 푛 is readily available without further extraction. We denote it as 푌푛 ∈ {0, , . . . }.

Figure 22. The Crowd-Powered Framework for False News Detection (CAND)

4.3.2 Unsupervised Bayesian Result Aggregation Model CLNAM

After information extraction, three types of extracted information are available for each

109 piece of news in the test set: machine judgments from news features, hybrid judgments in responses, and human judgments in reports. The goal of CLNAM is to aggregate them and obtain a final prediction as to the veracity of each piece of news.

4.3.2.1 Challenges in Result Aggregation

Although information aggregation is intensively studied in the fields of classifier combination (Mohandes et al. 2018) and crowdsourced answer aggregation (Wei et al. 2017;

Zhang et al. 2016), the proposed aggregation scenario possesses its unique characteristics because of the hybrid human-machine nature, and hence poses new challenges. The first one arises from the potentially low credibility of humans and machines. Specifically, when extracting machine judgments from news features, ML algorithms may make mistakes. In responses, users may doubt true news. When a piece of news is false, only a small portion of users will negate the false news in the responses; most of them post responses that are unrelated to its veracity. In addition, even if a user debunks the news, the ML algorithm may fail to recognize it. A similar problem exists for users’ reports: false news may receive no report and true news may get reported. All these potential cases make the aggregation problem more sophisticated compared with relevant literature where only humans or machines are involved. With humans and machines working together, it’s harder to pinpoint where the unreliability stems from and hence necessitates a more carefully crafted scheme for result aggregation. Second, we need to design a specific model to combine multiple data sources with mixed data types (i.e., continuous and discrete values). Machine judgments from news features and hybrid judgments in responses are continuous values between 0 and 1; and the number of received reports are discrete values. Our scenario is different from those in the existing information aggregation literature where only humans or machines are involved, and the data type is not mixed. Hence, we need to design a

110 specific model for our scenario to combine such mixed data.

4.3.2.2 Technical Insights

We introduce the major technical insights that solve the above challenges. First, we take a probabilistic perspective to model the credibility of humans and machines. As the results of users’ debunking behaviors are binary, we use Bernoulli distribution to depict them. Specifically, we assume users will write a debunking response to false or true news with probability 휂 or 휂̅ respectively. Intuitively, higher 휂 and lower 휂̅ indicate more reliable human intelligence in responses. To model the credibility of human intelligence in reports, we assume the number of received reports is a Poisson distribution, which is popular for modeling the frequency of an event in a specific time interval. Specifically, for false or true news, the number of received reports follows a Poisson distribution parameterized by 휙 or 휙̅ respectively. Intuitively, higher 휙 and lower 휙̅ indicate more reliable human intelligence in reports. In the proposed scenario, modeling the behavior of ML algorithms (in both task T1 and T2) is equivalent to answering what the distribution of the prediction is given the ground truth 0 or 1. The most widely used solution is the two-coin model, which assumes that the classifier predicts correctly with a certain probability when the ground truth is 0 or 1 (Bragg and Weld 2013; Raykar et al. 2010). However, this will incur information loss as it treats predicted probability of 0.51 and 1 as the same. To maximize the value of the predictions generated by classifiers, we assume when the ground truth is 0 (or 1), the predicted probability follows a logistic-normal distribution (Atchison and Shen

1980), which means the logit of the predicted probability follows a normal distribution. The intuition is that many ML algorithms, including deep learning models and logistic regression, often end with a logistic function. We defend this choice by presenting some empirical evidence in Figure 23, which plots histogram of the logit of the predicted probability (blue bar) and the

111 fitted normal distribution (orange line) under various tasks and methods. Figure 23 demonstrates that logistic-normal distribution is able to approximate the histogram quite well, especially for

SVM and CNN. Note that we did not theoretically claim that the prediction follows a logistic- normal distribution. However, compared with modeling the prediction as a Bernoulli distribution, the logistic-normal assumption retains more information. This is also empirically verified in Section 4.4.2 by the superior performance of CAND (logistic-normal assumption) in comparison with BAM (Bernoulli assumption).

To combine multiple data sources with mixed data type, we propose a generative Bayesian model for the underlying mixed data generation process. Specifically, all three types of judgments from the extraction stage are generated based on the unknown ground truth (i.e. whether the news is false or not). With the generative process, we infer the posterior distribution of the ground truth based on the generated data. The details are elaborated in the following section.

Figure 23. Empirical Evidence for the Logistic-Normal Assumption Note: From left to right: SVM, CNN, LSTM, and Bi-LSTM in task T1. See details of these methods in Section 4.4.1.2.

4.3.2.3 CLNAM Model

Building on the proposed techniques for modeling the credibility of humans and machines, we define the generative process of the extracted mixed information. Take news 푛 as an example, to generate machine judgments, we first generate the ground truth 푧푛 for news 푛. We

112 use 훾 ∈ [0, ] to denote the probability of a piece of news being false. Conditioning on 훾, the ground truth 푧푛 ∈ {0, } follows Bernoulli(훾). Depending on the value 푧푛, the classifiers in task

T1 will behave differently (i.e., generate different probability values). When the news is false, the prediction of classifier 푚 will follow a logistic-normal distribution:

2 (1) (1) (1) 푌푛,푚⁡~⁡퐿표푔푖푠푡푖푐푁표푟푚푎푙 (휇푚 , (휎푚 ) ), (58)

(1) (1) where 휇푚 and 휎푚 are the mean and standard deviation of the predicted probability’s logit respectively. When the news is true, the prediction will follow another logistic-normal

(1) (1) distribution parameterized by 휇̅푚 and 휎̅푚 . Finally, following previous Bayesian modeling literature (Blei et al. 2003; He et al. 2019), we place conjugate beta and Normal-Inverse-Gamma

(NIG) priors over Bernoulli and logistic-normal distributions respectively:

훾~퐵푒푡푎(푒0, 푓0), (59)

2 (1) (1) (1) (1) (1) (1) 휇푚 , (휎푚 ) ~푁퐼퐺(휔0 , 휈0 , 훼0 , 훽0 ), (60)

2 (1) (1) (1) (1) (1) ̅(1) 휇̅푚 , (휎̅푚 ) ~푁퐼퐺(휔̅0 , 휈0̅ , 훼̅0 , 훽0 ), (61)

Compared with the generative process of machine judgments, generating hybrid judgments from responses differs in that: (1) the classifier here attempts to predict whether each response is a debunking response, rather than the veracity of the news. Hence, we introduce a latent variable

푑푛,푘 to represent whether response 푘 to news 푛 is a debunking response; (2) Only one classifier is trained to predict the debunking probability. Formally, given a piece of false or true news, users will debunk the news with probability 휂 or 휂̅ respectively, namely,

푑푛,푘⁡~⁡퐵푒푟푛표푢푙푙푖(푧푛휂 + ( − 푧푛)휂̅). (62)

113 Then, the classifier will generate a prediction depending on the value of 푑푛,푘. Specifically, when

푑푛,푘 is 1,

(2) (2) (2) 2 푌푛,푘 ⁡~⁡퐿표푔푖푠푡푖푐푁표푟푚푎푙 (휇 , (휎 ) ), (63)

(2) (2) and when 푑푛,푘 is 0, the logistic-normal distribution is parameterized by 휇̅ and 휎̅ . Similarly, we impose conjugate priors over Bernoulli and logistic-normal distributions:

̅ 휂~퐵푒푡푎(푎0, 푏0),⁡⁡⁡⁡⁡휂̅~퐵푒푡푎(푎̅0, 푏0) (64)

(2) (2) 2 (2) (2) (2) (2) 휇 , (휎 ) ~푁퐼퐺(휔0 , 휈0 , 훼0 , 훽0 ) (65)

(2) (2) 2 (2) (2) (2) ̅(2) 휇̅ , (휎̅ ) ~푁퐼퐺(휔̅0 , 휈0̅ , 훼̅0 , 훽0 ) (66)

The generation of human judgments in reports is straightforward. After drawing the ground truth 푧푛 from a Bernoulli distribution parameterized by 훾, the model generates the number of reports based on the veracity of news 푛:

(3) ̅ 푌푛 ~푃표푖푠푠표푛(푧푛휙 + ( − 푧푛)휙), (67) which means the average numbers of received reports for false and true news are 휙 and 휙̅ respectively. Last, we impose conjugate gamma priors over Poisson distributions to complete our

Bayesian model:

̅ 휙~Gamma(푔0, ℎ0),⁡⁡⁡⁡휙̅~Gamma(푔0̅ , ℎ0). (68)

The graphical representation of CLNAM is presented in Figure 24. The priors are not listed for clarity. At the top, the large outer plate represents that there are 푁 pieces of news to assess; the inner plate on the left represents that 푀 classifiers generate probabilities as to the veracity of each piece of news. Bottom and right plates follow similar meanings. The shaded circles denote

114 observable variables, while empty circles represent latent variables that we want to infer.

Figure 24. Graphical Representation of the CLNAM Model

4.3.2.4 Model Inference

The goal of the model inference is to compute the posterior of latent variables, primarily the

(1) (2) (3) ground truth of each news, i.e., 풛, conditioning on the observable data 풀 , 풀 , and⁡풀 (let 휃표 denote all these observable data). Unfortunately, this conditional density is intractable for exact inference because it necessitates the computation of marginal density of the observations (i.e.,

푝(휃표)), which requires exponential time (Wainwright and Jordan 2008). To address this issue, we resort to approximate inference and take a variational inference approach, which is a faster alternative to Markov Chain Monte Carlo (MCMC) and scales to large-scale data more easily

(Blei et al. 2017). Specifically, we use coordinate ascent mean-field variational inference

(Bishop 2006; Blei et al. 2017) to perform approximate inference under CLNAM model. The key idea is to approximate the posterior distribution by a more tractable variational family called mean-field family and hence cast the conditional inference into an optimization problem; then coordinate ascent algorithm is applied to solve it.

For convenience, we denote all the latent variables as 휃푙 and denote all the hyperparameters

115 (i.e., parameters of the prior distributions) as 휃ℎ. Specifically,

(1) (1) (1) (1) (2) (2) (2) (2) 휃푙 = {풛, 훾, 풅, 휂, 휂̅, 흁 , 흈 , 흁̅ , 흈̅ , 휇 , 휎 , 휇̅ , 휎̅ , 휙, 휙̅}, (69)

̅ (1) (1) (1) (1) (1) (1) (1) ̅(1) 휃ℎ = {푒0, 푓0, 푎0, 푏0, 푎̅0, 푏0, 휔0 , 휈0 , 훼0 , 훽0 , 휔̅0 , 휈0̅ , 훼̅0 , 훽0 ,

(2) (2) (2) (2) (2) (2) (2) ̅(2) ̅ 휔0 , 휈0 , 훼0 , 훽0 , 휔̅0 , 휈0̅ , 훼̅0 , 훽0 , 푔0, ℎ0, 푔0̅ , ℎ0}, (70)

In the proposed inference algorithm, we specify a fully factorized mean-field variational family 풬 over the latent variables:

̅ ̅ 푞(휃푙) = 푞(풛|흀)푞(훾|푒, 푓)푞(풅|흉)푞(휂|푎, 푏)푞(휂̅|푎̅, 푏)푞(휙|푔, ℎ)푞(휙̅|푔̅, ℎ)

∙ 푞(흁(1), 흈(1)|흎(1), 흂(1), 휶(1), 휷(1))푞(흁̅(1), 흈̅(1)|흎̅ (1), 흂̅(1), 휶̅(1), 휷̅(1))

∙ 푞(휇(2), 휎(2)|휔(2), 휈(2), 훼(2), 훽(2))푞(휇̅(2), 휎̅(2)|휔̅(2), 휈̅(2), 훼̅(2), 훽̅(2)), (71) where each latent variable is governed by its own variational distribution: 풛 and 풅 follow

Bernoulli distributions; 훾, 휂, and 휂̅ follow Beta distributions; 휙 and 휙̅ follow Gamma distributions; all pairs of 휇 and 휎 are NIG distributions. For convenience, we denote all variational parameters as 휃푣.

In the variational family 풬, each member serves as a candidate approximation to the exact posterior 푝(휃푙|휃o, 휃ℎ). Hence, our aim is to find the one that is closest in Kullback-Leibler (KL) divergence to the exact posterior. Formally, the inference problem is casted as the following optimization problem:

∗ 푞 (휃푙) = 푎푟푔 푚푖푛 퐾퐿(푞(휃푙)||푝(휃푙|휃표, 휃ℎ)). (72) 푞(휃푙)⁡∈⁡풬

It’s well-known that minimizing KL divergence in the variational approach amounts to

116 maximizing the evidence lower bound (ELBO) ℒ(휃푣), which is equivalent to the KL divergence up an added constant (Blei et al. 2017). The form of the ELBO is as follows:

ℒ(휃푣) = 피푞[푙표푔 푝(휃표, 휃푙|휃ℎ)] - 피푞[푙표푔 푞(휃푙)]. (73)

To maximize the ELBO, we develop a coordinate ascent algorithm. Specifically, we iteratively optimize the variational parameters for each latent variable while holding the others fixed.

Because all the complete conditionals (i.e., distribution of each latent variable conditioned on other latent variables and observable variables) are in the exponential family and the priors are conjugate, each coordinate update in the proposed CLNAM model is available in closed form and the coordinate ascent algorithm is guaranteed to climb the ELBO to a local optimum (Blei et al. 2017).

(1) For ease of notation, for all Y’s, let X’s denote their logits. For example, 푋푛,푚 =

(1) logit(푌푛,푚). The formulas for updating each coordinate are as follows:

(1) (1) ⚫ Formulas for updating the posterior of 휇푚 and 휎푚 :

(74)

(1) (1) To update the posterior of 휇̅푚 ,⁡and 휎̅푚 , we only need to replace 휆푛 with − 휆푛 in the above formulas.

⚫ Formulas for updating the posterior of 휇(2)and 휎(2):

(75)

117 (2) (2) To update the posterior of 휇̅ and 휎̅ , we only need to replace 휏푛,푘 with − 휏푛,푘 in the above formulas.

⚫ Formulas for updating the posteriors of 휙 and 휙̅:

(76)

⚫ Formulas for updating the posterior of 푑푛,푘:

(77)

⚫ Formulas for updating the posterior of 푧푛:

(78)

118 ⚫ Formulas for updating the posterior of 훾:

(79)

⚫ Formulas for updating the posteriors of 휂 and 휂̅:

(80)

In practice, we iteratively update the variational parameters of each latent variable until the change in ELBO falls below some small threshold (e.g., 0.0001) or the algorithm has iterated for a certain number of times (e.g., 200 iterations). Once the algorithm converges, the final prediction is obtained by examining 휆푛, the posterior probability of news 푛 being false.

4.3.2.5 Learning Prior Beliefs from Data

In practice, the performance of Bayesian models is sensitive to prior assumptions, and consequently, it’s necessary to choose proper prior distributions based on prior knowledge (Liu and Aitkin 2008). We observe that we can take advantage of the two-stage characteristics and learn prior beliefs from the information extraction stage. Specifically, in task T1, the rate of false news in the validation dataset should be similar to the test set in which we want to assess the veracity. Likewise, the distribution describing the number of received reports in the validation dataset of task T1 should be similar to the one in the final test set. In tasks T1 and T2, when we train a classifier in the information extraction stage, we can foresee how the classifier will perform in an unknown test dataset by examining its behavior in the validation dataset that is used to estimate the prediction error for model selection. To leverage such information, we

119 propose another graphical model to depict the underlying generative process of the validation datasets in the information extraction stage.

Figure 25. Graphical Representation for Learning Prior Beliefs from Data

Figure 25 presents the graphical representation. Let 퐼 denote the number of news in the validation dataset of task T1. The ground truth of news 푖 is 푧푖̃ and follows Bernoulli(훾).

(1) Following the same generative assumption of 푌푛,푚, we draw the prediction of news 푖 by classifier

̃ (1) 푚 in T1 (denoted as 푌푖,푚 ∈ (0, )) depending on 푧푖̃ . When news 푖 is false (i.e., 푧푖̃ is 1), the

(1) (1) prediction will follow a logistic-normal distribution with parameters 휇푚 and 휎푚 ; when news 푖

(1) (1) is true, the corresponding logit-normal distribution is parameterized by 휇̅푚 and 휎̅푚 . For news 푖

̃ (3) in the same validation dataset, the number of received reports (denoted as 푌푖 ∈ {0, , . . . }) is

̃ (3) ̅ generated based on the ground truth 푧푖̃ : 푌푖 ~Poisson(푧푖̃ 휙 + ( − 푧푖̃ )휙). In task T2, when training a machine learning classifier for debunking response detection, we also have a validation dataset. Let 퐿 denote the size and 푑푙 be whether response 푙 is a debunking response. The

̃ (2) prediction generated by the classifier (denoted as 푌푙 ∈ (0, )) is drawn according to the ground truth 푑푙. When response 푙 in the validation dataset is a debunking response, the predicted

120 probability follows a logistic-normal distribution with parameters 휇(2) and 휎(2); and the parameters are 휇̅(2) and 휎̅(2) when it is not.

We observe that the latent variables in Figure 25 can be divided into independent groups:

(1) (1) (1) (1) (2) (2) (2) (2) ̅ {훾}, {휇푚 , 휎푚 }, {휇̅푚 , 휎̅푚 }, {휇 , 휎 }, {휇̅ , 휎̅ }, {휙} and {휙}. Within each group, the posterior can be derived in a closed form easily. We treat the learned posteriors as priors of the

CLNAM model. In detail, the priors are as follows:

(1) (1) ⚫ Priors of 휇푚 and 휎푚 :

(81)

(1) (1) To get the priors of 휇̅푚 ⁡and 휎̅푚 , we only need to replace 푧푖̃ with − 푧푖̃ in the above formulas.

⚫ Priors of 휇(2)and 휎(2):

(82)

(2) (2) ̃ ̃ To get the priors of 휇̅ and 휎̅ , we only need to replace 푑푙 with − 푑푙 in the above formulas.

⚫ Priors of 휙 and 휙̅:

121 (83)

⚫ Prior of 훾:

(84)

4.4 Empirical Evaluations

In this section, we comprehensively evaluate the proposed CAND framework using two real-world datasets from the social media platforms Weibo and Twitter. First, we describe the experimental design including datasets, benchmark methods, evaluation metrics, and experimental procedure. Then, we report and discuss the experimental results.

4.4.1 Experimental Design

4.4.1.1 Datasets and Preprocessing

The first dataset was collected from Sina Weibo. After scrutinizing the reported posts, Sina

Weibo announces all officially fact-checked false news in the Sina community management center under the false news category. We retrieved all the false posts between February 2012 and

August 2018. When collecting data, we ignored the deleted Weibo posts—actually, for more than 25% of reported records, the original Weibo posts were already deleted by the authors when we collected the data. To obtain a set of true news, we retrieved posts that are not reported from general threads, following the method in the previous literature (Ma et al. 2016). We then collected relevant information for each post, including post content, comments, number of

122 received reports, user profiles, and context information (e.g., timestamp). Unlike English text in which a sentence is a sequence of words delimited by space, a Chinese sentence is a string of

Chinese characters and punctuations without a natural delimiter. Hence, we segmented the sentences into meaningful words using jieba, a popular Chinese word segmentation tool (Peng et al. 2017). In this dataset, the false news and true news were not collected from the same time period. To prevent the model from learning the event-specific features instead of the intrinsic characteristics of false news, we use an effective event detection algorithm, i.e., single-pass clustering method, to discover unique events (Wang et al. 2018). Specifically, we first represent each Weibo post using the tf-idf weighting scheme (Ramos 2003). This algorithm sequentially processes the posts, one at a time, and grows clusters incrementally. A new post is absorbed by the most similar existing cluster if the similarity score (i.e., cosine similarity) goes beyond a threshold of 0.3; otherwise, the post will be the seed of a new cluster. Note that the threshold value will not influence the evaluation procedure much because both the proposed algorithm and the benchmark methods will perform better when a higher threshold is adopted. After clustering, we keep only one post for each event cluster. This is because when applying our approach in social media platforms, the input is usually a post rather than several posts that are already clustered into an event. By representing an event with one post rather than all related posts, our approach is directly applicable in real-world scenarios. In total, the dataset consists of 2,186 piece of false news and 9,455 pieces of true news. Table 9 shows some examples of false and true Weibo posts. In our collected dataset, we record the number of received reports for false news. Unfortunately, the following information is not available: true news that is mistakenly reported and false news that is not reported. To remedy this data unavailability issue, we assume the misreport rate of true news is 0.01, namely, the number of received reports for true news

123 follows a Poisson distribution with parameter of 0.01. In addition, we assume the report rate of false news is 50%. To achieve this, we randomly choose 50% of the false news and set the number of received reports as 0. Although this is one limitation of our dataset, we will conduct stress testing for these two rates. In addition, we will verify the performance of our proposed framework with and without the partially simulated data.

Table 9. Examples of Posts and Responses with Different Labels

Task Dataset Label Examples #Typhoon Mangkhut# On the highway of Shenzhen along the river, a minibus was False news blown over. Hope everyone in the car is Weibo safe [bless] [loc: Shenzhen] T1: feature- My favorite movie is coming. Looking based false True news forward... news detection 42 Million Dead In Bloodiest Black Friday False news Weekend On Record [link] Twitter Senate approves $600 million in border True news security funds [link] Debunking Reported, it’s false! Weibo Not debunking It’s so scaring. T2: debunking @[user] ig but many people saying its not Debunking response true detection Twitter Thanks u i wish your services weren’t Not debunking needed but i am thankful you are here to balance the scales. Note: ll examples are from our datasets. In the content, “[bless]” means an emoji, “[loc: Shenzhen]” refers to a hyperlink about the location, [link] is a link to another website, and [user] is the user name of a Twitter user.

We collected another dataset from Twitter based on three reference datasets: Twitter (Ma et al. 2016), Twitter15, and Twitter16 (Ma et al. 2017), where the ground truth of the news is confirmed through an expert-oriented fact-checking website, i.e., Snopes. When combining these tweets, we only kept those with labels “true” and “false” and discarded those with labels “non- rumor” and “unverified.” For these tweets, we then retrieved relevant information including all responses to the tweets, user profiles, and context information. Note that reporting/flagging is not

124 available on the Twitter platform. Hence, we do not take into account human intelligence in reports when evaluating the proposed approach using the Twitter dataset. Unlike the Weibo dataset, no event detection is needed as the original datasets from the literature are already event- level. In total, the dataset contains 943 pieces of false news and 1,007 pieces of true news. See

Table 9 for some examples of false and true tweets. Note that, in this work, we focus on identifying false news (regardless of the intention and the means of falsifying information) rather than other specific types of false news such as fake news or partisan fake news (Pennycook and

Rand 2019a). When using these two datasets, we make an implicit assumption that each dataset we collected is a set of false and true news.

In the previous literature, the dataset was frequently constructed as balanced (Ma et al.

2015, 2016; Zhang et al. 2015). However, there are many more true posts than false ones in real- world scenarios. To rigorously evaluate our proposed approach, we focused on unbalanced settings and constructed the testbed with different imbalance ratios (i.e., ratio between false and true news, denoted as IR). Specifically, for the Weibo dataset, we varied the IR from 1:5 to

1:100; for the Twitter dataset, we set the IR from 1:1 to 1:2.5. Note that, limited by the size, we are unable to use a drastic IR in the Twitter dataset because a drastic IR will make the constructed dataset too small to train these deep learning-based methods. In the following, we use Weibo as an example to illustrate how the data is prepared for tasks T1 and T2. For each IR, we randomly sampled from the dataset and constructed the largest possible set that met the IR.

Given that the IR can be as drastic as 1:100, we used stratified sampling in our experiment.

Figure 26 shows the case when IR is 1:7, where we randomly sampled 1,350 false Weibo posts and 9,455 true Weibo posts. The remaining non-test dataset is used for training the classifiers in task T1. Specifically, we randomly sampled 6,483 posts (60%) as training posts, 2,161 (20%) as

125 validation posts, and the remaining 2,161 (20%) as test posts. The test posts are intended to be the set of Weibo postings for assessing the veracity (i.e., test set) when we use our framework in real-world platforms. To get the dataset for task T2, we randomly sampled 12,000 responses from the responses in the non-test dataset. For training the model in T2, two authors independently labeled each response based on whether it is a debunking response or not. The

Cohen’s kappa score is 0.827, indicating very good inter-annotator consistency (McHugh 2012).

For inconsistent annotations, a consensus is reached by the annotators after discussion. In the annotated dataset, 17.29% of the responses are debunking responses. See Table 9 for positive and negative examples of responses. Because the aim of task T2 is to extract information from responses to the test posts, we adopted a different data split scheme. We randomly sampled 9,000 responses (75%) as the training set and the remaining 3,000 responses (25%) as the validation set, while taking all the responses to the test posts as test responses. Preprocessing of the Twitter dataset is similar and presented in Figure 27.

126 Raw false Weibo posts Filter out deleted posts Raw true Weibo posts (35,209) Segmentation (28,731) Event detection Partially simulate reports

False Weibo posts Randomly sample and construct True Weibo posts (2,186) with imbalance ratio varying (9,455) from 1:5 to 1:100

Weibo posts (1,350 false, 9,455 true, ratio 1:7 as an example)

Training posts Validation posts Test posts (6,483) (2,161) (2,161)

Randomly sample Take all responses responses Test responses Annotate the dataset (22,646)

Validation Dataset for text-based Responses false news detection (12,000, 17.29% responses (3,000) positive) Dataset for debunking Training response detection responses (9,000) 1

Figure 26. Data Preprocessing for Weibo Dataset

4.4.1.2 Baseline Methods and Evaluation Metrics

Recall that the major innovation of the proposed CAND framework lies in incorporating crowd judgments and developing the CLNAM result aggregation model. Hence, we evaluate the framework from the following two aspects: (1) whether combining crowd judgments in responses and reports will contribute to false news detection; and (2) whether the proposed

CLNAM model is effective compared with other end-to-end or aggregation benchmark methods.

To demonstrate the contribution of crowd judgments, we incorporated the data sources into the

CAND framework in an incremental manner and named it based on the index of the data sources. For example, CAND-12 integrates machine judgments from news features and hybrid judgments in responses. For methods using content and context features, we designed five benchmark methods based on the widely used methods from the false news detection literature:

127 Support Vector Machine (SVM), Convolutional Neural Network (CNN), Kim-CNN, Long-Short

Term Memory (LSTM), and bidirectional LSTM (Bi-LSTM) (Kim 2014; Ma et al. 2016;

Nguyen et al. 2017; Wang 2017; Wang et al. 2018). These benchmark methods also serve as the base classifiers in task T1. The reason we designed many simple benchmark methods for task T1 is to show the value of these base classifiers in our proposed CAND framework. See Section

4.4.2 for the experimental results and Section 4.4.3.3 for the value of these classifiers. When crowd intelligence is taken into account, unfortunately, no benchmark methods use exactly the same set of inputs (i.e., content and context features, user responses, and user reports). We designed two types of benchmark methods, i.e., end-to-end and aggregation, based on the existing literature. For end-to-end models, we mainly consider models that can handle news text and user responses; see a recent survey for all candidates (Sharma et al. 2019).

The first end-to-end benchmark is a concatenation-based deep learning approach (denoted as

Concat). Directly concatenating various types of inputs after extracting hidden representations is widely used in the false news detection literature (Qian et al. 2018; Wang 2017). To learn hidden representations from text, we build two CNN modules for news text and responses respectively—one is used for news text and the other one is shared by all responses. These representations are later concatenated with other news features and the number of reports.

Finally, all these features are fed into a fully connected layer and a softmax layer in sequence to predict the veracity of news.

The second end-to-end benchmark, HSA, uses a more advanced deep learning structure, i.e., hierarchical attention network (Guo et al. 2018; Yang et al. 2016). A hierarchical Bi-LSTM model is built for representation learning and the news features are incorporated into the network via attention mechanism. As the original structure does not consider the number of reports, we

128 adapt it by concatenating the hidden representation with the number of reports before feeding the hidden representation to the fully connected layer.

False tweets Randomly sample and construct True Weibo posts (943) with imbalance ratio varying (1,007) from 1:1 to 1:2.5

Weibo posts (671 false, 1,007 true, ratio 1:1.5 as an example)

Training posts Validation posts Test posts (1007) (336) (335)

Randomly sample Take all responses responses Test responses Annotate the dataset (5,937)

Validation Dataset for text-based Responses false news detection (8,103, 15.53% responses (2,026) positive) Dataset for debunking Training response detection responses (6,077) 1

Figure 27. Data Preprocessing for Twitter Dataset

For aggregation models, none is readily available in the literature for our setting because we have mixed data to aggregate, while there is only one type of data in the relevant literature. To rigorously evaluate the CLNAM aggregation model, we implemented two benchmark methods: a trivial Majority Voting (MV) aggregation model and a non-trivial Binary Aggregation Model

(BAM). For MV, the vote from each classifier in task T1 is positive if the predicted probability is greater than 0.5; the vote from responses is positive when the debunking response rate is higher than the average, and invalid if no response exists; the vote from the last source is positive when the Weibo post receives at least one report. Given the votes, the Majority Voting aggregation model will predict a piece of news to be false if more than half of the received votes are positive.

The BAM model is similar to the CLNAM model except that it depicts the credibility of classifiers in tasks T1 and T2 based on the widely used two-coin assumption, where the 129 predictions conditioned on the ground truth are binary and follow Bernoulli distributions (Bragg and Weld 2013; Raykar et al. 2010; Wei et al. 2017). Specifically, each classifier (or worker in the crowdsourced answer aggregation literature) predicts the label correctly with a certain probability when the ground truth is 1 and with another probability when the ground truth is 0.

As the BAM model is obtained by replacing the logistic-normal assumption in the CLNAM model with the two-coin assumption, we only introduce the parts that differ. The first two types of information (i.e. 풀(1) and 풀(2)) are transformed into their binary prediction results (i.e., 0 or

1). With binary values, the generative processes change accordingly:

, (85) where arg max(x) equals 1 when x > 0.5 and 0 otherwise. Then, we impose conjugate priors over the parameters that depicts the classifiers’ credibility.

. (86)

Last, the coordinate ascent mean-field variational inference algorithm and prior learning from data can be developed in a similar manner as for the CLNAM model, so we omit the details here. Finally, for both end-to-end and aggregation models, we also incorporated data sources in an incremental manner. Table 10 summarizes the full list of methods for benchmarking.

To evaluate the performance of these methods, we adopt widely accepted evaluation metrics, including PR AUC (area under the Precision-Recall curve), F1 score, recall, precision.

However, we use PR AUC, F1 score, and recall as major metrics. We pay the most attention to

PR AUC because when the dataset is highly unbalanced, the PR curve gives an accurate picture

130 of an algorithm’s performance (Davis and Goadrich 2006), while precision-recall pair or F1 score is only one point on the PR curve. After PR AUC, we focus more on F1 score and recall rather than precision and accuracy because (1) in real-world false news detection scenarios, we prefer to recall a larger percentage of false news with a reasonable sacrifice in precision; and (2) accuracy does not provide adequate information on a classifier’s functionality in unbalanced datasets (He and Garcia 2008).

Table 10. Summary of Methods

Data Method Type Benchmark Methods Proposed Methods Sources End-to-end SVM, CNN, Kim-CNN, LSTM, Bi-LSTM - Source 1 Aggregation MV-1, BAM-1 CAND-1 Source 1 End-to-end Concat-12, HSA-12 - & 2 Aggregation MV-12, BAM-12 CAND-12 Source 1, End-to-end Concat-123, HSA-123 - 2, & 3 Aggregation MV-123, BAM-123 CAND-123

4.4.1.3 Experimental Procedure

In the information extraction stage, five feature-based benchmark methods (i.e., SVM,

CNN, Kim-CNN, LSTM, and Bi-LSTM) were trained as the base classifiers for task T1. In task

T2, given a post and one of its responses, we used two Kim-CNN modules to learn the hidden representations respectively. The learned representations were then concatenated and fed into a fully connected layer, followed by a softmax layer. All methods were implemented in Python.

The SVM was implemented using Python’s scikit-learn library (Pedregosa et al. 2012). HSA was coded using TensorFlow. Other deep learning methods were implemented using Python’s Keras library with TensorFlow as the backend. For all deep learning methods, sentences were represented as a sequence of vectors using word embedding, which is a commonly used language modeling and feature learning technique in text classification (Mikolov et al. 2013; M. Zhang et

131 al. 2019). The embedding vectors were initialized with open-source embeddings trained on

Weibo data (Li et al. 2018) or Twitter data (Pennington et al. 2014). For fair comparison, we conducted random grid search of 40 trials to tune the hyperparameters for all methods including the benchmark methods.

The results of tasks T1 and T2, together with the partially simulated number of reports, were fed as the input to the aggregation models (i.e., MV, BAM, and CAND). Regarding CAND, we limited the time of collecting responses to one day and truncated the number of reports to 20 for each piece of news. We defend this choice from the perspective of framework practicability in Section 4.4.3.2. Next, we learned prior beliefs from data and inferred the latent variables. All methods were evaluated for 10 runs with a different randomization seed in each run. Last, the whole procedure was repeated for all IRs in both datasets.

4.4.2 False News Detection Performance: CAND vs. the Benchmark Methods

In this section, we report the experimental results in the form of bar charts and line charts.

For better visualization, the y axis of the following graphs might not start at 0 or end at 1, although all metrics range from 0 to 1. Whenever an error bar is reported, the result is averaged over 10 runs and standard error of the mean (SEM) is used.

4.4.2.1 Results in Dataset Weibo

Figure 28 compares the performance of CAND with the benchmark methods under certain

IRs. First, given that human intelligence (e.g., responses and reports) is still largely under- explored in current literature, we examine the performance of CAND compared with end-to-end benchmark methods without crowd intelligence (i.e., feature-based classifiers in task T1). As shown by Figure 28a, our proposed CAND-123 approach performs significantly better than all benchmark methods across all metrics. For example, compared with the best benchmark, Kim-

132 CNN, CAND-123 increases the AUC score from 91.47% (99.9% CI, 89.92%-93.02%) to

97.19% (99.9% CI, 96.37%-98.01%). Hence, CAND-123 is significantly better than Kim-CNN at the 99.9% confidence interval in a paired t-test. Even if we evaluate our CAND framework without the partially simulated human judgments in reports, CAND-12 still outperforms all benchmark methods across all major metrics by significant margins. In addition, when no human intelligence is considered, CAND-1, as an ensemble of five feature-based classifiers, still performs better than any individual method in AUC score. Overall, the results suggest the superior performance of our framework in comparison with feature-based false news detection methods and the effectiveness of our framework as an aggregation method.

(a) CAND vs. End-to-end Benchmark Methods (b) CAND vs. Aggregation Benchmark

without Crowd Intelligence (IR=1:7, Ref. line: AUC of Methods

CAND-12) (IR=1:7)

133

(c) CAND vs. End-to-end Benchmark Methods (d) CAND vs. End-to-end Benchmark Methods

with Crowd Intelligence (IR=1:7, Ref. lines: AUC of with Crowd Intelligence (IR=1:50, Ref. lines: AUC of

CAND) CAND)

Figure 28. Performance of CAND and Benchmark Methods in Weibo Dataset

Comparison of the aggregation models in Figure 28b demonstrates the effectiveness of incorporating human intelligence and the CLNAM result aggregation model. For CAND, as more data sources are taken into account, the performance across all metrics increases significantly. For example, from CAND-1 to CAND-12 to CAND-123, the F1 score increases from 93.13% (95% CI, 92.17 %-94.09%) to 94.93% (95% CI, 94.30%-95.56%) to 97.19% (95%

CI, 96.70%-97.68%). However, as MV and BAM do not (or, use simple assumption to) model the credibility of humans and machines, they benefit less from human intelligence. The comparison suggests the usefulness of human intelligence and the necessity of modeling the credibility of humans and machines. To evaluate the effectiveness of the proposed CLNAM aggregation model, we compare all aggregation methods by fixing the combination of data sources. Figure 28b shows that across all choices of data sources, especially when human intelligence is considered, CAND has the best performance in AUC, F1 score, and recall. The comparison clearly demonstrates the superior performance of the proposed CLNAM result

134 aggregation model. In addition, the gap between CAND and BAM is attributed to using the logistic-normal assumption in place of the Bernoulli assumption when modeling classifiers’ credibility.

Last, we compare CAND with end-to-end benchmark methods that consider crowd intelligence (i.e., Concat and HSA) in Figure 28c (IR=1:7) and Figure 28d (IR=1:50). Figure 28c indicates that when the dataset is relatively balanced (e.g., IR=1:7), HSA performs comparably to CAND and they both perform slightly better than Concat. However, as IR becomes more drastic (e.g., IR=1:50), Concat and HSA suffer from the data imbalance problem and tend to predict with high precision but low recall. As a result, our CAND method significantly outperforms the benchmark methods across all major metrics.

To further explore the robustness of our method and the benchmark methods when the data is unbalanced, we compare their performance by varying the IR. The results using AUC as the metric are reported in Figure 29. As expected, when the dataset becomes more unbalanced, the performance of CAND and all benchmark methods decreases. Meanwhile, the performance gap between CAND and the benchmark methods increases, and our method is more stable than the benchmark methods (i.e., smaller SEM). The results suggest that our method, as an ensemble of machine intelligence and human intelligence, is less affected by the data imbalance issue. This makes our method more applicable and effective, as the actual news in real-world scenarios is often highly unbalanced.

135

(a) CAND-1 vs. End-to-end Benchmark Methods without Crowd Intelligence

(c) CAND-12 vs. End-to-End Benchmark (d) CAND-123 vs. End-to-End Benchmark

Methods with Crowd Intelligence Methods with Crowd Intelligence

(e) CAND-12 vs. Aggregation Benchmark (f) CAND-123 vs. Aggregation Benchmark

136 Methods Methods

Figure 29. CAND vs. Benchmark Methods under Different IRs in Weibo Dataset (AUC Score as

Metric)

4.4.2.2 Results in Dataset Twitter

Similar to the analysis in the Weibo dataset, we conducted the same set of experiments in the Twitter dataset, including 1) comparing CAND with the benchmark methods across all metrics when IR=1.5; 2) using AUC as the metric and comparing CAND with the benchmark methods by varying the IR from 1:1 to 1:2.5. The graphs are shown in Figure 30 and Figure 31 respectively. Note that CAND-123 is not included because human reports are unavailable in the

Twitter dataset. By examining the figures, we draw the same conclusions as in the Weibo dataset.

(a) CAND vs. End-to-end Benchmark (b) CAND vs. End-to-end Benchmark

Methods without Crowd Intelligence Methods with Crowd Intelligence (IR=1: 1.5)

(IR=1:1.5)

137

(c) CAND vs. Aggregation Benchmark Methods (IR=1: 1.5)

Figure 30. Performance of CAND and the Benchmark Methods in Twitter Dataset (Ref.

lines: AUC of CAND-12)

(a) CAND-1 vs. End-to-end Benchmark (b) CAND-12 vs. End-to-end

Methods without Crowd Intelligence Benchmark Methods with Crowd Intelligence

138

(c) CAND-12 vs. Aggregation Benchmark Methods

Figure 31. CAND vs Benchmark Methods under Different IRs in Twitter Dataset (AUC

Score as Metric)

4.4.2.3 Robustness Check

In the above experiments, we split the dataset into training/validation/test sets to avoid overfitting and fixed the training set percentage as 60%. To show that such specifications will not influence our experimental results, we conducted two robustness checks.

In the first robustness check, we varied the training set percentage from 40% to 90% and evaluated the performance of CAND-1 and CAND-12 in both datasets. The results are visualized in Figure 32. In general, the maximum performance is achieved at about 80%. This is because a small training set impedes effective learning for feature-based classifiers, especially those data- hungry deep learning approaches, and a large training set leads to a small test set to run our unsupervised Bayesian results aggregation model. But overall, the performance is relatively stable against the training set percentage. We also observe that graphs in the Weibo dataset deliver smaller performance variance compared with those in the Twitter dataset. This shows that the impact of training set percentage can be further mitigated by using larger datasets.

139

(a) CAND-1 in Weibo Dataset (IR=1:7) (b) CAND-12 in Weibo Dataset

(IR=1:7)

(c) CAND-1 in Twitter Dataset (d) CAND-12 in Twitter Dataset

(IR=1:1.5) (IR=1:1.5)

Figure 32. Performance of CAND-1 and CAND-12 with Different Training Set Percentages

In the second robustness check, we used k-fold cross validation instead of the training/validation/test split to evaluate the performance. We varied the value of 푘 from 3 to 10 and evaluated the performance of CAND-1 and CAND-12, as shown in Figure 33. The results similarly demonstrate that our evaluation is not sensitive to the selection of 푘. Again, the performance of CAND in the Weibo dataset is more stable than in the Twitter dataset, suggesting that the effect of 푘 becomes more marginal when a larger dataset is used.

140

(a) CAND-1 in Weibo Dataset (IR=1:7) (b) CAND-12 in Weibo Dataset

(IR=1:7)

(c) CAND-1 in Twitter Dataset (d) CAND-12 in Twitter Dataset

(IR=1:1.5) (IR=1:1.5)

Figure 33. Performance of CAND-1 and CAND-12 with Different k in k-Fold Cross

Validation

The results demonstrate that the effect of training set percentage and the value of k in k-fold cross validation is quite marginal; and if any, their impact can be further mitigated by using larger datasets.

141 4.4.3 Analysis and Discussion

4.4.3.1 Complementary Strengths of Humans and Machines

Figure 34 visualizes an example of the relative contributions of the three types of judgments

(i.e., machine judgments from features, hybrid judgments in responses, and human judgments in reports) and prior beliefs when calculating the posterior of ground truth 풛. The first type is the result of machine intelligence based on news features; the second one is a hybrid of human and machine intelligence, while the last one comes from human intelligence. This is an example of

CAND-123 with an imbalance ratio of 1:7. For a certain post (fixed x), we will obtain a thin bar where the ratio of each color represents relative contributions from the three sources and prior beliefs. Overall, machine judgments and hybrid judgments contribute the most. However, the graph shows that no single type of judgments always dominates the others; on the contrary, they complement each other in classifying more than half of the posts. This demonstrates the complementary strengths of human and machine intelligence. In addition, the prior beliefs also play an important role in classifying some posts. In Figure 35, we present one example from the

Twitter dataset. Although less significant, it still shows the complementary value of human and machine intelligence. This graph also suggests that the proposed Bayesian model is highly transparent and interpretable—we know exactly how we obtain the prediction of each post and how much each data source contributes.

Figure 34. A Stacked Bar Graph Showing the Relative Contributions of the Three Types of

142 Judgments and Prior Beliefs in Weibo Dataset (IR=1:7)

Note: When inferring the ground truth of each piece of news, the odds of correctly predicting the ground truth is updated as exp{푠1 + 푠2 + 푠3 + 푠푝}, where 푠∗ refers to the contributions from the three sources and prior beliefs respectively (see the update formula of z in Section 4.3.2.4). The relative contributions are computed based on value exp{푠∗}. Posts are divided into two groups based on the existence of response. Within each group, the posts are reordered for better visualization. Note that hybrid judgments in responses still contribute even when no response exists because the contribution is computed based on the exponential and exp{0} = .

Figure 35. A Stacked Bar Graph Showing the Relative Contributions of the Two Types of

Judgments and Prior Beliefs in Twitter Dataset (IR=1:1.5)

Note: The notations are similar to the ones in Figure 34.

4.4.3.2 Early Detection

In real-world applications, the number of responses and reports usually increases daily.

Also, in the development of false news events, debunking messages often appear at a later stage.

However, it is almost impossible to collect all responses and reports in real-world scenarios.

When fighting false news, we would like to identify the false news as soon as possible so as to stop the propagation process early (Tschiatschek et al. 2018). Hence, a practical question is whether our approach enables early detection of false news. To this end, we set different cutoff times for the responses in CAND-12 and different cutoff values for the number of reports18 in

CAND-123. Figure 36a plots the performance of CAND-12 in the Weibo dataset (IR=1:7) with

18 For reporting, the timestamp is unavailable.

143 the maximum time ranging from one minute to four weeks. As shown by the figure, responses, as one type of human intelligence, start taking effect even in the first few minutes. The performance continues increasing significantly on the first day and stabilizes thereafter. This is because what matters to the CAND framework is the rate of the debunking responses rather than the number of responses. A moderately long time (e.g., 12 hours to one day) is enough as long as we have collected a representative sample of the responses. Although crowd opinions at a later stage might contain more debunking messages, our experiment suggests that aggregating early human intelligence is enough to enable false news detection. Similarly, we varied the maximum number of reports from 0 to 25 in the Weibo dataset (IR=1:7). Results in Figure 36b show that human reports start playing a role even when each false news receives up to one report. Overall, our experiments suggest that our proposed framework is capable of detecting false news at the early stage. We conducted similar analysis for the Twitter dataset and reached the same conclusion.

(a) Performance of CAND-12 with Varying (b) Performance of CAND-123 with Varying

Cutoff Times for Responses Cutoff Values for Number of Reports

Figure 36. Testing Early Detection in Weibo Dataset (IR=1:7)

144 4.4.3.3 Value of Non-Text Features and Multiple Feature-Based Classifiers

Non-Text Content and Context Features: Recall that we represent each piece of news by its content and context features. To investigate whether the non-text content (e.g., has external links or not) and context features (e.g., personal or organizational) could contribute to the performance of our proposed CAND framework, we examined the performance of CAND with and without these features. Figure 37 reports the comparision. The results show that including these non-text content features and context features significantly improves the performance of

CAND in both datasets.

(a) Dataset: Weibo (IR=1:7) (b) Dataset: Twitter (IR=1:1.5)

Figure 37. Performance of CAND with and without Non-Text Content Features and

Context Features (Ref. lines: AUC of CAND(Y))

Note: Methods with (Y) or (N) mean non-text content and context features are included or not, respectively.

Multiple Feature-Based Classifiers: In our experiment, we use multiple classifiers for task

T1 to extract machine judgments. To make our approach practically applicable, we explore the value of these classifiers and answer how many classifiers to use. To this end, we varied the combination of five feature-based classifiers (i.e., SVM, CNN, Kim-CNN, LSTM, and Bi-

LSTM) and examined the performance of CAND-12. The results are presented in Figure 38. 145 Each group of bars with the same color represents the same number of classifiers. For example, ten orange bars correspond to the results of CAND-12 with combinations of two out of five classifiers. Note that we do not mark out the name of each bar; instead, we focus on how CAND-

12 performs against the number of classifiers. In general, as more feature-based classifiers are included, CAND-12 delivers better and more stable performance. The marginal effect, however, is decreasing. The results suggest the significant value of including multiple classifiers, and in practice, a moderate number of classifiers (e.g., five) are sufficient. In addition, the performance of the red bars indicates that the inclusion of a weak classifier (i.e., SVM) will not influence the performance. This is because we have modeled the credibility of each classier in our Bayesian result aggregation model, and the algorithm will assign less weight to weak classifiers during the parameter estimation.

(a) Dataset: Weibo (IR=1:7) (b) Dataset: Twitter (IR=1:1.5)

Figure 38. AUC Score of CAND-12 under Different Combinations of Feature-Based Classifiers

Note: Red bars are used to draw the reference lines.

4.4.3.4 Stress Testing the Simulation Parameters

In our experiment, we partially simulated the reporting data to remedy the data unavailability issue. We assume the misreport rate of true news is 0.01 and the report rate of

146 false news is 0.5. In order to verify whether human intelligence in reports can consistently contribute to false news detection, we conducted stress testing for the simulation parameters.

Specifically, we varied the misreport rate from 0 to 0.6 and examined the performance of

CAND-123. The results are reported in Figure 39a. As expected, when more true news posts are misreported, the performance of CAND-123 keeps decreasing. In particular, even when up to

60% of the true news posts are misreported, CAND-123 still performs better than CAND-12, in which no report information is considered. Similarly, we varied the report rate of false news from 0.05 to 0.9 and presented the results in Figure 39b. We observe that the performance of

CAND-123 increases as a function of the report rate. Even when only 5% of the false news posts are reported, CAND-123 still performs better than CAND-12. In summary, although we partially simulate the human reports due to data unavailability, our experiment demonstrates the unique value of crowd reports under various simulated settings, which may even go beyond the normal case.

(a) Varying Misreport Rates of True News (b) Varying Report Rates of False News

Figure 39. Performance of CAND-123 on Different Misreport Rates of True News and Report

Rates of False News in Weibo Dataset (IR=1:7)

Note: CAND-12 i.e., “Ref.” in the graph is listed as a reference.

147 4.4.3.5 Sensitivity to News Characteristics

News Length. In the task of false news detection, we would like our approach to be feasible to news of different lengths, especially those short ones containing limited information. In the two datasets we used, the news length in Weibo is more varied (ranging from 1 to 2,484) compared with that of Twitter (ranging from 1 to 39). We therefore used the Weibo dataset as a testbed to test the sensitivity of our approach to news length. Specifically, we set the maximum length of news varying from 5 to 2,500 and examined the performance of CAND-1, CAND-12, and CAND-123 (IR=1:7). See the results in Figure 40. As expected, the performance increases as a function of news length because longer news articles usually contain richer information.

CAND-12 and CAND-123 are less affected (lower slope in the graphs) by news length compared with CAND-1. This is because CAND-1 only considers news content and its context features, while CAND-12 and CAND-123 also take into account crowd responses and reports, making them less dependent on news features. The result suggests another advantage of considering crowd intelligence—the false news detection algorithm will be less impacted by news features such as news length.

(a) CAND-1 (b) Performance of CAND-12

148

(c) CAND-123

Figure 40. Sensitivity of CAND to News Length in Weibo Dataset (IR=1:7)

News Type. In this paper, we adopted a straightforward classification of false vs. true news in the types of posts. However, there could be more granularity in news types, and examining the sensitivity of our proposed method and the benchmark methods could potentially deliver valuable insights. To this end, we adopted a classification scheme wherein false information is divided into seven potentially overlapping types: satire or parody, fabricated content, misleading content, imposter content, manipulated content, false connection, and false context (Sharma et al.

2019). See Section 4.2.2 for details of the classification scheme. We selected 1,350 Weibo posts and annotated them with the above-mentioned seven types: satire (106), fabricated (586), misleading (282), imposter (145), manipulated (190), false connection (162), and false context

(44). It is worth noting that these types might overlap with each other, and hence it is possible to observe one post falling within multiple categories. Figure 41 reports the AUC scores of CAND and the selected benchmark methods for different types of Weibo posts with the Imbalance Ratio

(IR) being 1:7. When only news content and context features are available, the performance of these methods (i.e., SVM, Kim-CNN, Bi-LSTM, and CAND-1) exhibits similar distribution over the type of news. Overall, false context news is the easiest to identify because the conflict

149 between the genuine content and the contextual information may provide an evident signal for recognizing false news. Manipulated content is hard to identify because these posts are manipulated to deceive, which may hide obvious language clues and hence fool the classifier.

False connection is also hard to identify because the content of images and videos are not considered in our task. When human intelligence in responses is taken into account, the performance on all types of news increase for all methods (i.e., CAND-12, Concat-12, and HSA-

12) and the performance distribution remains similar. This is consistent with that all types of posts have a high debunking rate. Specifically, the rates are 21.44%, 26.23%, 25.87%, 28.17%,

25.46%, 25.18%, and 27.74%, respectively.

Our work just scratches the surface of this interesting topic and shows that our method and the benchmark methods perform differently on different types of news. We think this direction deserves more exploration to further improve the performance or draw other insightful conclusions.

Figure 41. Sensitivity of CAND and the Selected Benchmark Methods to News Type in Weibo

Dataset (IR=1:7)

Note: When evaluating the performance on certain type of post, only positive instances are

150 available. To make the evaluation feasible, we randomly sample negative instances based on the imbalance ratio.

4.4.3.6 Robustness to Intentional Manipulation of Responses and Reports

In our proposed approach, we utilize crowd intelligence to help detect false news. However, there exist many malicious user accounts (e.g., social bots, cyborg users, and troll) in social media, which automatically post, share, and even comment (Shu et al. 2017). A critical question is how robust our approach is to such crowd manipulation. In Section 4.4.3.4, we have shown that under various misreport rates of true news and report rates of false news, our method with human intelligence in reports (i.e., CAND-123) consistently performs better than the one without human reports (CAND-12); and the performance gap is significant when these two rates are within a reasonable range. This experiment illustrates that, to some degree, our approach is robust to the intentional manipulation of reports. In the following, we test whether our approach is robust to maliciously manipulated crowd responses.

We assume there exist six types of malicious users (listed in Table 11), who blindly manipulate the responses in a certain way. Here “blindly” means such users will manipulate the responses to one type of news indiscriminately, without conducting news-level manipulation

(i.e., without manipulating news one by one). For example, true-only random users will randomly post responses on all true news with a certain probability. For each type of users listed in Table 11, we varied the percentage of randomly manipulated responses from 0% to 90% in

CAND-12 and compared its performance with that of CAND-1. The results are reported in

Figure 42. Based on the AUC score, we find that CAND-12 always performs better than, or at least comparably to, CAND-1. In Figure 42a, as more responses are manipulated by indifferent users, the performance of CAND-12 decreases and is finally comparable to CAND-1. In Figure

42c and Figure 42d, the existence of false-only malicious users actually improves the

151 performance of CAND-12; and the improvement increases as a function of the percentage of manipulated responses. In Figure 42b, Figure 42e, and Figure 42f, as more responses are manipulated, the performance of CAND-12 first decreases; then, after reaching the performance of CAND-1, it rebounds. Although the results seem to be counterintuitive, we next conduct some analytical analysis to demonstrate that they are consistent with our probabilistic framework, which explains why our method is robust to maliciously manipulated crowd responses.

Table 11. Six Types of Simulated Malicious Users

News Types to Manipulate Both false and False only True only true Random user: post random Indifferent False-only True-only responses with a certain random user random user random user probability

Reversal user: reverse the Indifferent False-only True-only news debunking with a reversal user reversal user reversal user certain probability

(a) Indifferent Random User (b) Indifferent Reversal User

152

(c) False-Only Random User (d) False-Only Reversal User

(e) True-Only Random User (f) True-Only Reversal User

Figure 42. Performance of CAND-12 with Various Types of Malicious Users in Weibo

Dataset (IR=1:7)

Note: In x axes, “Ref.” refers to C ND-1. Its AUC score is used to draw the horizontal reference lines.

In our proposed Bayesian CAND framework, we consider different debunking rates for false and true news in the generative process. Then, the estimated difference will contribute to the estimation of the ground truth at the model inference stage. Intuitively, a larger difference will lead to better performance. For convenience, we assume the debunking rates of false and true news in the raw dataset are 푝푓 and 푝푡 respectively, and the percentage of the manipulated

153 responses is 푟. Under each type of simulated malicious users, we compare the theoretical debunking rates of false and true news and predict the theoretical performance trend of CAND-

12 based on the absolute difference between these two debunking rates. The results are presented in Table 12. Comparing them with the experimental results in Figure 42, we found that the actual trends (especially the curve of AUC score) are largely consistent with the theoretical ones. For example, in the Weibo dataset, 푝푓 = 6.50% and 푝푡 = 4.46%. When there exist true-only reversal users, CAND-12 will achieve its theoretical minimum 0 when 푟 = 0. 4, which is consistent with Figure 42f.

Table 12. Theoretical Debunking Rates and Performance Trends

Debunking Rate after User Type Simulation Absolute Difference Performance Trend False News True News Decrease as 푟 Indifferent 푟 푟 + 푝 − 푟푝 + 푝 − 푟푝 |(푝 − 푝 )( − 푟)| increases; minimum 0 random 푓 푓 푡 푡 푓 푡 achieved at 푟 = Indifferent r + 푝푡 Minimum 0 achieved r + 푝푓 − 푟푝푓 |(푝푓 − 푝푡)( − 푟)| reversal − 푟푝푡 at 푟 = 0.5 False-only 푟 Decrease as 푟 increases + 푝푓 − 푟푝푓 푝푡 |푝푓 − 푝푡 + ( − 푝푓)푟| random when 푝푓 < 0.5 False-only Decrease as 푟 increases r + 푝푓 − 푟푝푓 푝푡 |푝푓 − 푝푡 + ( − 푝푓)푟| Reversal when 푝푓 < 0.5 Minimum 0 achieved True-only 푟 푝푓−푝푡 푝푓 + 푝푡 − 푟푝푡 |푝 − 푝 − ( − 푝 )푟| at r = when random 푓 푡 푡 0.5−푝푡 푝푡 < 0.5 Minimum 0 achieved True-only r + 푝 푝푓−푝푡 푝 푡 |푝 − 푝 − ( − 푝 )푟| at r = when 푓 푓 푡 푡 1−2푝 reversal − 푟푝푡 푡 푝푡 < 0.5

In summary, our analysis shows that our probabilistic framework is robust to several types of adversarial attacks and integrating crowd responses will contribute to, or at least not jeopardize, the false news detection task.

154 Finally, in the era of deep fakes and adversarial methods, we think it is important and interesting to further explore this issue in the future. In the current analysis, we simulated several types of malicious users, which might not cover all actual users in real-world social media platforms. It would deliver significant value if we could fight false news even when malicious users manipulate responses or reports of news one by one.

4.4.3.7 Debunking Response Detection as a Multi-Class Classification Problem

In our proposed CAND framework, we treat the task of debunking response detection as a binary classification problem (i.e., debunking or not). However, the literature frequently categorizes the responses into multiple stances or opinions such as supporting, denying, commenting, and querying (Kochkina et al. 2018; Ma et al. 2018; Sharma et al. 2019). Compared with binary labels, multiple labels may offer more information for false news detection. In the following, we explore whether such a multi-class formulation will contribute to our proposed framework.

In this scenario, we are unable to directly extend the proposed CAND framework to the multi-class setting to utilize the multi-class labels because the logistic-normal distribution cannot be naturally generalized to its multivariate form. Actually, its natural generalization is softmax- normal distribution, which yet does not make sense. Hence, we developed two alternative models that tap into the multi-class information. As these two models share many generative processes with CLNAM, we only introduce the parts that differ.

Using the same set of notations as CLNAM, in the multi-class scenario, 푑푛,푘, i.e., stance of response 푘 to news⁡푛, takes the value of 1 (supporting), 2 (denying), 3 (commenting), or 4

155 (2) 4 19 (querying); and 풀푛,푘 ∈ [0, ] is a 3-simplex , meaning the predicted probability of 푑푛,푘 equaling each possible value.

(2) The first model is called Categorical Aggregation Model (CAM). We transform 풀푛,푘 into integers and assume the transformed values follow a categorical distribution. Specifically, given the ground truth of news 푛, the stance of response 푘 follows

. (87)

Then, the transformed machine prediction is generated based on 푑푛,푘:

, (88) where arg max(풙) equals the index of the largest value in 풙 wherein 풙 is a 3-simplex. Finally, we

(2) impose conjugate priors, i.e., Dirichlet distribution, over 휂, 휂̅, and 흁푖 .

In CAM model, we did not take advantage of the logistic-normal assumption, which has shown to be effective in the previous experiments. Further, to simultaneously maintain the benefits of the logistic-normal assumption and the multi-class information, we propose another model called Extended CAM (E-CAM). Specifically, given the ground truth of news 푛, the stance of responses 푘 follows the same distribution as CAM (see Equation E1). Unlike CAM,

(2) where 풀푛,푘 is transformed into four integer values, E-CAM transforms it to a continuous value between 0 and 1, which indicates debunking probability. To this end, we treat denying responses and querying responses as debunking responses. Then, we pose a logistic-normal distribution on

19 An n-simplex is a vector of n+1 positive continuous values that sum up to 1.

156 the debunking probability:

. (89)

In E-CAM model, we see that in order to take advantage of the logistic-normal assumption, the multi-class information is only partially exploited.

Like the binary case, two authors relabeled the responses with labels supporting, denying, commenting, and querying in both datasets (i.e., Weibo and Twitter). Following the same experimental procedure as CAND, we ran the CAM and E-CAM methods and compared their performance with CAND. The plots are shown in Figure 43. The results in both datasets indicate

(2) that CAND significantly outperforms CAM. This is because in CAM, 풀푛,푘 is transformed into integer values and assumed to follow a categorical distribution, rather than the natural multivariate generalization of logistic-normal distribution which doesn’t exist . Given that categorical distribution is the multivariate generalization of Bernoulli distribution and the poor performance of BAM in our experiment (see Section 4.4.2), it is as expected that CAND performs better than CAM. By virtue of the logistic-normal distribution, the extended model E-

CAM performs comparably to CAND. Such results suggest the importance of the logistic-normal assumption in our task. In addition, although the literature has shown the benefit of multi-class formulation (Kochkina et al. 2018; Ma et al. 2018), the performance gain from treating debunking response detection as a multi-class classification problem in our scenario is quite marginal. There are two potential reasons. First, in order to maintain the logistic-normal assumption, E-CAM only partially exploits the multi-class information. Although this assumption cannot be naturally generalized to its multivariate form, it would be interesting to

157 explore whether there exist other alternatives in the future. Second, our task formulation is different from the one in the literature. Specifically, in the literature, stance classification and false news detection are jointly trained under a multi-task learning framework, which is able to take advantage of the shared task-invariant features (Ma et al. 2018). The results of the stance classification are not directly utilized by the task of false news detection. However, in our proposed approach, we use the results of stance classification as the input for false news detection, rather than training them jointly. As the most essential and relevant labeling to recognize the veracity of a piece of news is whether each response is debunking or not, multi- class information might not contribute in our current formulation. In the future, we can try to simultaneously take advantage of the joint training (what was done in the literature) and feeding the results of stance detection as the input of false news detection (what was done in this work) to further improve the performance.

(a) Dataset: Weibo (IR=1:7) (b) Dataset: Twitter (IR=1:1.5)

Figure 43. Performance of CAND in Comparison with CAM and E-CAM (Ref. lines: AUC of

CAND)

4.5 Conclusions

Unlike existing literature where crowd intelligence is often treated as features of black-box

158 methods, we propose to incorporate more interpretable structures in our Bayesian approach. In comparison with the existing literature, our approach has several advantages. First, it achieves better performance. Experimental results show that our approach performs better than end-to-end black-box models, especially when the data is highly unbalanced. Second, our approach is more interpretable and generates more technical insights. To name a few, we distinguish the roles of humans and machines and demonstrate their complementary values; combining human intelligence not only improves performance but also enables early detection; we also show that combining human intelligence is actually robust to several types of intentional manipulation.

Last, given the flexibility of our Bayesian model, we can easily extend it to incorporate more data sources. When interpretable structures exist, we combine them into the Bayesian model; if not, we have the black-box methods in task T1 to handle them.

Our research is of both theoretical and practical value. Theoretically, we show that our approach, which has the black-box methods to handle the unstructured data and uses interpretable structures to combine multiple judgments, performs better than the end-to-end benchmark methods. As mentioned above, our approach has several advantages and generates many valuable insights. Second, understanding the value of scalable crowd judgments (e.g., responses and reports) is of great interest to false news detection research. Although responses are used in some literature to detect false news, the value of such scalable crowd judgments is still unclear. Our interpretable framework illustrates the complementary value of human and machine intelligence in the task of false news detection. Such a conclusion could also contribute to broader literature on hybrid human-machine intelligence and other crowd intelligence applications such as prediction markets (Chen et al. 2017). Third, we propose to model the credibility of both humans and machines in a hybrid human-machine system. Compared with

159 relevant literature where only humans or machines are involved, the proposed hybrid setting is more complicated. We take a probabilistic perspective and carefully craft a Bayesian scheme for result aggregation. In addition, when modeling machine credibility, we tentatively use the logistic-normal assumption rather than the widely used two-coin assumption (Bragg and Weld

2013; Raykar et al. 2010; Wei et al. 2017). The superior performance empirically verifies this assumption’s efficacy. Our proposed techniques to model credibility have significant implications for future research where potentially unreliable humans and machines are involved.

Our research has many practical implications and actionable insights for relevant stakeholders. For social media platforms, the proposed CAND framework serves as a feasible and effective approach for false news detection in social media platforms. As suggested by the experimental results, the proposed framework that combines crowd judgments significantly improves the detection of false news in comparison with the benchmark methods. This improvement will immediately translate into large differences in expense, as it usually costs a social media platform millions of dollars to curb the spread of false news20. Given the unique complementary value of crowd judgments, a platform can encourage its users to actively contribute their intelligence even though their judgments may be unreliable. In addition, social media platforms that do not support user flagging may want to consider designing such a function. For social media platform users, the proposed research provides them with valuable insights as to how they can help curb the false news epidemic—posting responses or reporting when they doubt the content of a social media post.

We conclude our paper by presenting its limitations and future directions. First, in the

20 https://www.bbc.com/news/technology-40287399

160 Weibo dataset, we treat posts that are not reported as true news (Chen et al. 2018; Guo et al.

2018; Ma et al. 2016). It is possible that some of these “true news” are actually undetected false news. To mitigate this issue, we manually checked a small set of random samples and find that most of these “true news” are true. Hence, we think our current strategy will not influence our evaluation much. Second, Weibo will attach warning tags to the fact-checked false news (see

Figure 21). This will create the “implied-truth” effect wherein users will regard other untagged posts as more objective and verified (Pennycook et al. 2020), hence causing fewer people to debunk or report those untagged posts. In the future, we can further improve the usefulness of crowd intelligence by mitigating this effect. Third, as a proof of concept, we represent each piece of news by its content and context features and propose to use two types of human intelligence

(i.e., responses and reports). Given the flexibility of the proposed Bayesian CLNAM model, it is straightforward to extend it by incorporating more news information.

161 5. CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS

5.1 Contributions

Although Internet-based crowd intelligence has triggered the emergence of many new business applications and provides unprecedented opportunities for existing businesses to innovate and improve their practices, significant challenges remain in the process of utilizing crowd intelligence. My research aims to address the following managerial challenges and concerns: 1) what kind of crowd intelligence to use; 2) how to extract or collect crowd intelligence; and 3) how to evaluate or aggregate crowd intelligence, under three relevant business applications, i.e., crowd stance detection, crowd labeling, and false news detection. The conclusions and findings hold great potential to advance the understanding of how to design or improve crowd intelligence applications.

Design science research in information system should 1) deliver actionable and useful descriptive knowledge to the business scenarios from which the problem is drawn, and 2) contribute salient and rigorous prescriptive knowledge to the IS knowledge base that can be used to guide future research (Gregor and Hevner 2013; Hevner et al. 2004; Rai 2017). My dissertation fits in the design science research and is dedicated to developing novel IT artifacts that have significant business value and societal impact (Hevner et al. 2004). The IT artifact of each essay addresses a relevant business problem and contributes computational frameworks, algorithms, design principles, instantiations, or nascent design theories to the IS knowledge base.

The first essay deals with the problem of crowd intelligence extraction, which sets the foundation for many crowd intelligence applications. With the proliferation of UGC in social media, mining crowd stance towards certain target or event from this unstructured data becomes

162 increasingly valuable to support decision making across a variety of business applications. Yet, there is a dearth of work detecting crowd opinions towards complicated targets. To fill this gap, the first essay proposes an interactive attention-based deep learning approach which is motivated by the psychological procedure of human decision making. The IT artifacts include 1) a deep learning framework that is able to accurately detect crowd opinions; 2) an exemplar research for how to motivate technical insights with behavioral psychology; and 3) an instantiation of mimicking the psychological procedure of human decision making with deep learning techniques.

The second essay tackles the problem of how to aggregate crowd intelligence under the microtask crowdsourcing scenario. Microtask crowdsourcing has emerged as a cost-effective approach to collecting large-scale high-quality labeled data across a wide range of business applications. The current design of learning from crowds either makes simple or constrained assumptions, or lacks a principled evaluation such that the conclusions suffer from low interpretability and generalizability. To obtain a set of useful and reusable practices for future design, this study formulates several hypotheses and tests them using a deep generative framework that combines deep learning techniques and graphical models. The IT artifacts include 1) several generalizable design principles that have been empirically tested; 2) a computational framework combining deep learning and graphical models; and 3) a general research pipeline for design science research to obtain generalizable conclusions.

The last essay attempts to address all proposed managerial concerns in tapping into the crowds to tackle the urgent false news crisis. The explosive spread of false news in social media has severely affected many areas such as news ecosystems, politics, and economics. Although human knowledge and intelligence hold great potential to complement machine-based methods,

163 they are largely under-explored in current false news detection research, especially how to efficiently utilize such information. I observe that the crowd contributes to the challenging task of assessing the veracity of news by posting responses or reporting. This study proposes a

Bayesian graphical framework to combine these two types of scalable crowd judgments with machine intelligence. The IT artifacts of this essay include 1) an effective computational framework for detecting and curb the false news in social media; 2) many valuable insights for future research, such as the complementary value of human and machine intelligence, the possibility of using human intelligence for early detection, and the robustness of our approach to intentional manipulation; 3) design principles of how to combine human and machine intelligence for design science research.

5.2 Future Research Directions

As one of the most prominent features of research in Artificial Intelligence (AI) era, crowd intelligence will continue attracting attention from both industry and research communities (Li et al. 2017). This study is one of the first steps to explore how to exploit crowd intelligence from the perspective of data science and design science. This dissertation opens up many opportunities for future research. An incomplete list of future directions can be found below.

1. Hybrid intelligence. Despite the huge technological advances in the last decades,

humans always play a central role in designing, applying, and evaluating these

techniques. Currently, machine intelligence and human intelligence (even the crowd

intelligence) are still loosely coupled in many real-world applications. A salient example

is the AI system. Despite recent advances, AI is still far from being perfect in many

applications. When left alone without human assistance, AI systems occasionally make

mistakes or even completely fail (Kamar 2016). A meaningful combination of human

164 and machine intelligence can not only increase human capabilities but also overcome

the limitations of current AI systems. This is challenging because, compared with

machine intelligence, human resources are a valuable resource with high costs and

constraints. Crowds, as one type of scalable human resource, hold great potential to

address this issue. In the future, an interesting direction would be considering how to

optimize the access to crowd intelligence and symbiotically integrate it with machine

intelligence.

2. Broader business applications. In this dissertation, I focus on three real-world

applications including learning from crowd labeling, crowd opinion detection, and false

news detection. It sets an exemplar of how to exploit crowd intelligence and address

managerial concerns for other crowd intelligence applications such as healthcare

crowdsourcing (Créquit et al. 2018), prediction market (Chen et al. 2017), collective

social reporting (Oh et al. 2013), and influence maximization (Zhang et al. 2018). For

example, with the development of online forums, crowdsourcing healthcare has the

potential to provide more immediate and accessible healthcare to individuals at low cost

and with high throughput. Some questions with significant value would be how to

design the crowd organization to motivate crowd contributions, combine the provided

suggestions from various healthcare service providers, or evaluate the reliability of each

service provider.

3. Theory-driven design in crowd intelligence applications. In this dissertation, the design

of proposed computational frameworks is mostly data science-driven. In the second

essay, I proposed five hypotheses regarding the design of learning from crowd labeling;

however, most hypotheses are build on top of empirical observations and lack

165 theoretical justification. Similarly, the hybrid design in the third essay is motivated by real-world observations, rather than some behavorial or psychological theories. As pointed out by some IS researchers, design science research should build on top of salient and rigorous prescriptive knowledge in the IS knowledge base and contribute such knowledge to guide future research (Gregor and Hevner 2013; Hevner et al. 2004;

Rai 2017). In the future, I plan to delve more into the theoretical foundation of my research, delivering more significant value to IS research and relevant stakeholders.

166 REFERENCES

ddawood, ., Schneider, J., and Bashir, M. 2017. “Stance Classification of Twitter Debates:

The ncryption Debate as a Use Case,” in Proceedings of the 8th International

Conference on Social Media & Society, pp. 1–10.

ggarwal, R., and Singh, H. 2013. “Differential Influence of Blogs across Different Stages of

Decision Making: The Case of Venture Capitalists,” Mis Quarterly, JSTOR, pp. 1093–

1112.

Al- yyoub, M., Rabab’ah, ., Jararweh, Y., Al-Kabi, M. N., and Gupta, B. B. 2018. “Studying

the Controversy in Online Crowds’ Interactions,” Applied Soft Computing (66), Elsevier,

pp. 557–563.

Atanasov, P., Rescober, P., Stone, E., Swift, S. A., Servan-Schreiber, E., Tetlock, P., Ungar, L.,

and Mellers, B. 2017. “Distilling the Wisdom of Crowds: Prediction Markets vs.

Prediction Polls,” Management Science (63:3), pp. 691–706.

(https://doi.org/10.1287/mnsc.2015.2374).

Atarashi, K., Oyama, S., and Kurihara, M. 2018. Semi-Supervised Learning from Crowds Using

Deep Generative Models, presented at the Thirty-Second AAAI Conference on Artificial

Intelligence.

tchison, J., and Shen, S. M. 1980. “Logistic-Normal Distributions: Some Properties and Uses,”

Biometrika (67:2), pp. 261–272. (https://doi.org/10.1093/biomet/67.2.261).

Bahdanau, D., Cho, K., and Bengio, Y. 2015. “Neural Machine Translation by Jointly Learning

to lign and Translate,” in 3rd International Conference on Learning Representations,

167 ICLR 2015.

Bayus, B. L. 2013. “Crowdsourcing New Product Ideas over Time: An Analysis of the Dell

IdeaStorm Community,” Management Science (59:1), pp. 226–244.

(https://doi.org/10.1287/mnsc.1120.1599).

Bi, W., Wang, L., Kwok, J. T., and Tu, Z. 2014. “Learning to Predict from Crowdsourced

Data.,” in UAI, pp. 82–91.

Bishop, C. M. 2006. Pattern Recognition and Machine Learning, Information Science and

Statistics, New York: Springer.

Blei, D. M., Kucukelbir, ., and Mc uliffe, J. D. 2017. “Variational Inference: Review for

Statisticians,” Journal of the American Statistical Association (112:518), pp. 859–877.

(https://doi.org/10.1080/01621459.2017.1285773).

Blei, D. M., Ng, . Y., and Jordan, M. I. 2003. “Latent Dirichlet llocation,” Journal of

Machine Learning Research (3:Jan), pp. 993–1022.

Brabham, D. C. 2008. “Crowdsourcing as a Model for Problem Solving: n Introduction and

Cases,” Convergence: The International Journal of Research into New Media

Technologies (14:1), pp. 75–90. (https://doi.org/10.1177/1354856507084420).

Brabham, D. C. 2013. Crowdsourcing, MIT Press.

Bragg, J., and Weld, D. S. 2013. “Crowdsourcing Multi-Label Classification for Taxonomy

Creation,” in 1st AAAI Conference on Human Computation and Crowdsourcing.

Cabral, L., and Li, L. 2015. “ Dollar for Your Thoughts: Feedback-Conditional Rebates on 168 Bay,” Management Science (61:9), INFORMS, pp. 2052–2063.

Castillo, C., El-Haddad, M., Pfeffer, J., and Stempeck, M. 2014. “Characterizing the Life Cycle

of Online News Stories Using Social Media Reactions,” in Proceedings of the 17th ACM

Conference on Computer Supported Cooperative Work & Social Computing, pp. 211–

223. (https://doi.org/10.1145/2531602.2531623).

Chen, T., Li, X., Yin, H., and Zhang, J. 2018. “Call ttention to Rumors: Deep ttention Based

Recurrent Neural Networks for arly Rumor Detection,” in Pacific-Asia Conference on

Knowledge Discovery and Data Mining, Springer, pp. 40–52.

(https://doi.org/10.1007/978-3-030-04503-6_4).

Chen, W., Li, X., and Zeng, D. D. 2017. “Modeling Fixed Odds Betting for Future Event

Prediction,” Management Information Systems Quarterly (41:2), pp. 645–665.

(https://doi.org/10.25300/MISQ/2017/41.2.14).

Chen, Y., Wang, Q., and Xie, J. 2011. “Online Social Interactions: Natural xperiment on

Word of Mouth versus Observational Learning,” Journal of Marketing Research (48:2),

SAGE Publications Sage CA: , CA, pp. 238–254.

Créquit, P., Mansouri, G., Benchoufi, M., Vivot, ., and Ravaud, P. 2018. “Mapping of

Crowdsourcing in Health: Systematic Review,” Journal of Medical Internet Research

(20:5), JMIR Publications Inc., Toronto, , p. e187.

Dang, Y., Zhang, Y., and Chen, H. 2009. “ Lexicon-Enhanced Method for Sentiment

Classification: n xperiment on Online Product Reviews,” IEEE Intelligent Systems

(25:4), IEEE, pp. 46–53.

169 Davis, J., and Goadrich, M. 2006. “The Relationship between Precision-Recall and ROC

Curves,” in Proceedings of the 23rd International Conference on Machine Learning,

ACM, pp. 233–240. (https://doi.org/10.1145/1143844.1143874).

Dawid, A. P., and Skene, . M. 1979. “Maximum Likelihood stimation of Observer rror‐rates

Using the M lgorithm,” Journal of the Royal Statistical Society: Series C (Applied

Statistics) (28:1), pp. 20–28. (https://doi.org/10.2307/2346806).

Demartini, G., Difallah, D. E., Gadiraju, U., and Catasta, M. 2017. “ n Introduction to Hybrid

Human-Machine Information Systems,” Foundations and Trends® in Web Science (7:1),

pp. 1–87. (https://doi.org/10.1561/1800000025).

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. 2018. “Bert: Pre-Training of Deep

Bidirectional Transformers for Language Understanding,” ArXiv Preprint

ArXiv:1810.04805.

Dey, K., Shrivastava, R., and Kaushik, S. 2018. “Topical Stance Detection for Twitter: Two-

Phase LSTM Model Using ttention,” in European Conference on Information Retrieval,

Springer, pp. 529–536.

Ding, X., Liu, T. ,. Duan, J. ,. and Nie, J. Y. 2015. “Mining User Consumption Intention from

Social Media Using Domain daptive Convolutional Neural Network. ,” in I,

CONF, ,. pp. 2389–2395. Ding, Xiao, Liu, T., Duan, J., and Nie, J.-Y. 2015. “Mining

User Consumption Intention from Social Media Using Domain Adaptive Convolutional

Neural Network.,” in AAAI, pp. 2389–2395.

Dizaji, K. G., and Huang, H. 2018. “Sentiment nalysis via Deep Hybrid Textual-Crowd

170 Learning Model,” in Thirty-Second AAAI Conference on Artificial Intelligence, , April

25. (https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17270).

Doan, A., Ramakrishnan, R., and Halevy, . Y. 2011. “Crowdsourcing Systems on the World-

Wide Web,” Communications of the ACM (54:4), pp. 86–96.

(https://doi.org/10.1145/1924421.1924442).

Du, J., Xu, R., He, Y., and Gui, L. 2017. Stance Classification with Target-Specific Neural

Attention Networks, International Joint Conferences on Artificial Intelligence.

Ferreira, W., and Vlachos, . 2016. “ mergent: Novel Data-Set for Stance Classification,” in

Proceedings of the 2016 Conference of the North American Chapter of the Association

for Computational Linguistics: Human Language Technologies, pp. 1163–1168.

Gadiraju, U., Kawase, R., Dietze, S., and Demartini, G. 2015. Understanding Malicious

Behavior in Crowdsourcing Platforms: The Case of Online Surveys, pp. 1631–1640.

Gregor, S., and Hevner, . R. 2013. “Positioning and Presenting Design Science Research for

Maximum Impact,” MIS Quarterly, JSTOR, pp. 337–355.

Guo, H., Cao, J., Zhang, Y., Guo, J., and Li, J. 2018. “Rumor Detection with Hierarchical Social

Attention Network,” in Proceedings of the 27th ACM International Conference on

Information and Knowledge Management, Torino, Italy: ACM Press, pp. 943–951.

(https://doi.org/10.1145/3269206.3271709).

Gupta, ., Lamba, H., Kumaraguru, P., and Joshi, . 2013. “Faking Sandy: Characterizing and

Identifying Fake Images on Twitter during Hurricane Sandy,” in Proceedings of the 22nd

171 International Conference on World Wide Web, Rio de Janeiro, Brazil: ACM Press, pp.

729–736. (https://doi.org/10.1145/2487788.2488033).

He, H., and Garcia, . . 2008. “Learning from Imbalanced Data,” IEEE Transactions on

Knowledge & Data Engineering (9), pp. 1263–1284.

(https://doi.org/10.1109/TKDE.2008.239).

He, J., Fang, X., Liu, H., and Li, X. 2019. “Mobile pp Recommendation: n Involvement-

Enhanced Approach,” Management Information Systems Quarterly (43:3), pp. 827–849.

(https://doi.org/10.25300/MISQ/2019/15049).

Hercig, T., Krejzl, P., Hourová, B., Steinberger, J., and Lenc, L. 2017. “Detecting Stance in

Czech News Commentaries.,” in ITAT, pp. 176–180.

Hevner, . R., March, S. T., Park, J., and Ram, S. 2004. “Design Science in Information Systems

Research,” MIS Quarterly, JSTOR, pp. 75–105.

Ho, C.-J., Slivkins, ., Suri, S., and Vaughan, J. W. 2015. “Incentivizing High Quality

Crowdwork,” in Proceedings of the 24th International Conference on World Wide Web,

pp. 419–429.

Huang, N., Burtch, G., Gu, B., Hong, Y., Liang, C., Wang, K., Fu, D., and Yang, B. 2019.

“Motivating User-Generated Content with Performance Feedback: Evidence from

Randomized Field Experiments,” Management Science (65:1), INFORMS, pp. 327–345.

Ioffe, S., and Szegedy, C. 2015. “Batch Normalization: ccelerating Deep Network Training by

Reducing Internal Covariate Shift,” in International Conference on Machine Learning,

172 pp. 448–456.

Jin, Z., Cao, J., Jiang, Y.-G., and Zhang, Y. 2014. “News Credibility Evaluation on Microblog

with a Hierarchical Propagation Model,” in 2014 IEEE International Conference on Data

Mining, IEEE, pp. 230–239. (https://doi.org/10.1109/ICDM.2014.91).

Jin, Z., Cao, J., Zhang, Y., and Luo, J. 2016. “News Verification by xploiting Conflicting

Social Viewpoints in Microblogs,” in 13th AAAI Conference on Artificial Intelligence.

Jin, Z., Cao, J., Zhang, Y., Zhou, J., and Tian, Q. 2017. “Novel Visual and Statistical Image

Features for Microblogs News Verification,” IEEE Transactions on Multimedia (19:3),

pp. 598–608. (https://doi.org/10.1109/TMM.2016.2617078).

Johnson, M. J., Duvenaud, D. K., Wiltschko, A., Adams, R. P., and Datta, S. R. 2016.

“Composing Graphical Models with Neural Networks for Structured Representations and

Fast Inference,” in Advances in Neural Information Processing Systems, pp. 2946–2954.

Kajino, H., Tsuboi, Y., and Kashima, H. 2012. “ Convex Formulation for Learning from

Crowds,” in Twenty-Sixth AAAI Conference on Artificial Intelligence.

Kajino, H., Tsuboi, Y., and Kashima, H. 2013. “Clustering Crowds,” in Twenty-Seventh AAAI

Conference on Artificial Intelligence.

Kamar, . 2016. “Directions in Hybrid Intelligence: Complementing I Systems with Human

Intelligence,” in Proceedings of the 25th International Joint Conference on Artificial

Intelligence, AAAI Press, pp. 4070–4073.

Karger, D. R., Oh, S., and Shah, D. 2011. “Iterative Learning for Reliable Crowdsourcing

173 Systems,” in Advances in Neural Information Processing Systems, pp. 1953–1961.

Karimi, H., Roy, P., Saba-Sadiya, S., and Tang, J. 2018. “Multi-Source Multi-Class Fake News

Detection,” in Proceedings of the 27th International Conference on Computational

Linguistics, Santa Fe, New , USA: Association for Computational Linguistics,

August, pp. 1546–1557.

Kazai, G., Kamps, J., and Milic-Frayling, N. 2011. “Worker Types and Personality Traits in

Crowdsourcing Relevance Labels,” Proceedings of the 20th ACM International

Conference on Information and Knowledge Management - CIKM ’11, pp. 1941–1941.

(https://doi.org/10.1145/2063576.2063860).

Kim, H.-C., and Ghahramani, Z. 2012. “Bayesian Classifier Combination,” in Proceedings of the

15th International Conference on Artificial Intelligence and Statistics, pp. 619–627.

Kim, J., Tabibian, B., Oh, A., Schölkopf, B., and Gomez-Rodriguez, M. 2018. “Leveraging the

Crowd to Detect and Reduce the Spread of Fake News and Misinformation,” in

Proceedings of the 11th ACM International Conference on Web Search and Data Mining,

ACM, pp. 324–332. (https://doi.org/10.1145/3159652.3159734).

Kim, Y. 2014. “Convolutional Neural Networks for Sentence Classification,” in Proceedings of

the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746–

1751.

Kingma, D. P., and Ba, J. L. 2015. “ dam: Method for Stochastic Optimization,”

International Conference on Learning Representations 2015.

(https://doi.org/10.1145/1830483.1830503).

174 Kingma, D. P., and Welling, M. 2014. “ uto- ncoding Variational Bayes,” in International

Conference on Learning Representations (ICLR). (http://arxiv.org/abs/1312.6114).

Kochkina, ., Liakata, M., and Zubiaga, . 2018. “ ll-in-One: Multi-Task Learning for Rumour

Verification,” in Proceedings of the 27th International Conference on Computational

Linguistics, pp. 3402–3413.

Kucher, K., Paradis, C., and Kerren, . 2018. “Visual nalysis of Sentiment and Stance in

Social Media Texts.,” in EuroVis (Posters), pp. 49–51.

Küçük, D., and Can, F. 2020. “Stance Detection: Survey,” ACM Computing Surveys (53:1),

pp. 1–37. (https://doi.org/10.1145/3369026).

Kumar, S., and Shah, N. 2018. “False Information on Web and Social Media: Survey,”

ArXiv:1804.08559 [Cs]. (http://arxiv.org/abs/1804.08559).

Kwark, Y., Chen, J., and Raghunathan, S. 2014. “Online Product Reviews: Implications for

Retailers and Competing Manufacturers,” Information Systems Research (25:1),

INFORMS, pp. 93–110.

Lai, M., Cignarella, . T., and H RN ND Z F RI S, D. I. 2017. “Itacos at Ibereval2017:

Detecting Stance in Catalan and Spanish Tweets,” in IberEval 2017 (Vol. 1881), CEUR-

WS. org, pp. 185–192.

Lapidot-Lefler, N., and Barak, . 2012. “ ffects of nonymity, Invisibility, and Lack of ye-

Contact on Toxic Online Disinhibition,” Computers in Human Behavior (28:2), pp. 434–

443. (https://doi.org/10.1016/j.chb.2011.10.014).

175 Lazer, D. M. J., Baum, M. A., Benkler, Y., Berinsky, A. J., Greenhill, K. M., Menczer, F.,

Metzger, M. J., Nyhan, B., Pennycook, G., Rothschild, D., Schudson, M., Sloman, S. A.,

Sunstein, C. R., Thorson, . ., Watts, D. J., and Zittrain, J. L. 2018. “The Science of

Fake News,” Science (359:6380), pp. 1094–1096.

(https://doi.org/10.1126/science.aao2998).

LeCun, Y., Bengio, Y., and Hinton, G. 2015. “Deep Learning,” Nature (521:7553), Nature

Publishing Group, pp. 436–444.

Lee, H. C. B., Ba, S., Li, X., and Stallaert, J. 2018. “Salience Bias in Crowdsourcing Contests,”

Information Systems Research (29:2), pp. 401–418.

(https://doi.org/10.1287/isre.2018.0775).

Leung, L. 2009. “User-Generated Content on the Internet: An Examination of Gratifications,

Civic Engagement and Psychological mpowerment,” New Media & Society (11:8),

SAGE Publications Sage UK: London, England, pp. 1327–1347.

Li, S., Jiang, Y., Chawla, N. V., and Zhou, Z. 2019. “Multi-Label Learning from Crowds,” IEEE

Transactions on Knowledge and Data Engineering (31:7), pp. 1369–1382.

(https://doi.org/10.1109/TKDE.2018.2857766).

Li, S., Zhao, Z., Hu, R., Li, W., Liu, T., and Du, X. 2018. “ nalogical Reasoning on Chinese

Morphological and Semantic Relations,” in Proceedings of the 56th Annual Meeting of

the Association for Computational Linguistics (Volume 2: Short Papers), pp. 138–143.

(https://doi.org/10.18653/v1/P18-2023).

Li, W., Wu, W., Wang, H., Cheng, X., Chen, H., Zhou, Z., and Ding, R. 2017. “Crowd

176 Intelligence in I 2.0 ra,” Frontiers of Information Technology & Electronic

Engineering (18:1), pp. 15–43. (https://doi.org/10.1631/FITEE.1601859).

Lin, W., Hubacher, N., and Khan, M. . 2018. “Variational Message Passing with Structured

Inference Networks,” in ICLR.

Lippi, M., and Torroni, P. 2016. “ rgumentation Mining: State of the rt and merging

Trends,” ACM Transactions on Internet Technology (TOIT) (16:2), ACM New York,

NY, USA, pp. 1–25.

Liu, B. 2010. “Sentiment nalysis and Subjectivity.,” Handbook of Natural Language

Processing (2:2010), pp. 627–666.

Liu, C. C., and itkin, M. 2008. “Bayes Factors: Prior Sensitivity and Model Generalizability,”

Journal of Mathematical Psychology (52:6), pp. 362–375.

(https://doi.org/10.1016/j.jmp.2008.03.002).

Liu, Q., Peng, J., and Ihler, . T. 2012. “Variational Inference for Crowdsourcing,” in Advances

in Neural Information Processing Systems, pp. 692–700.

Lukyanenko, R., Parsons, J., and Wiersma, Y. F. 2014. “The IQ of the Crowd: Understanding

and Improving Information Quality in Structured User-Generated Content,” Information

Systems Research (25:4), pp. 669–689. (https://doi.org/10.1287/isre.2014.0537).

Luo, J., Wei, X., Zeng, D., Yin, J., and Zhang, B. 2018. “Towards Better Learning from Crowd

Labeling: A Variational Inference pproach,” in 28th Workshop on Information

Technologies and Systems, pp. 3212–3222.

177 Luo, Y., Tian, T., Shi, J., Zhu, J., and Zhang, B. 2018. “Semi-Crowdsourced Clustering with

Deep Generative Models,” in Advances in Neural Information Processing Systems, pp.

3212–3222.

Ma, J., Gao, W., Mitra, P., Kwon, S., Jansen, B. J., Wong, K.-F., and Cha, M. 2016. “Detecting

Rumors from Microblogs with Recurrent Neural Networks,” in 25th International Joint

Conference on Artificial Intelligence, pp. 3818–3824.

Ma, J., Gao, W., Wei, Z., Lu, Y., and Wong, K.-F. 2015. “Detect Rumors Using Time Series of

Social Context Information on Microblogging Websites,” in Proceedings of the 24th

ACM International on Conference on Information and Knowledge Management, ACM,

pp. 1751–1754. (https://doi.org/10.1145/2806416.2806607).

Ma, J., Gao, W., and Wong, K.-F. 2017. “Detect Rumors in Microblog Posts Using Propagation

Structure via Kernel Learning,” in Proceedings of the 55th Annual Meeting of the

Association for Computational Linguistics (Volume 1: Long Papers), Vancouver,

Canada: Association for Computational Linguistics, July, pp. 708–717.

(https://doi.org/10.18653/v1/P17-1066).

Ma, J., Gao, W., and Wong, K.-F. 2018. “Detect Rumor and Stance Jointly by Neural Multi-Task

Learning,” in Companion Proceedings of the The Web Conference 2018, Republic and

Canton of Geneva, Switzerland: International World Wide Web Conferences Steering

Committee, pp. 585–593. (https://doi.org/10.1145/3184558.3188729).

Maaten, L. van der, and Hinton, G. 2008. “Visualizing Data Using T-SN ,” Journal of Machine

Learning Research (9:Nov), pp. 2579–2605.

178 McHugh, M. L. 2012. “Interrater Reliability: The Kappa Statistic,” Biochemia Medica:

Biochemia Medica (22:3), pp. 276–282.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. 2013. “Distributed

Representations of Words and Phrases and Their Compositionality,” in Advances in

Neural Information Processing Systems, pp. 3111–3119.

Misra, A., Ecker, B., Handleman, T., Hahn, N., and Walker, M. 2016. “NLDS-UCSC at

SemEval-2016 Task 6: A Semi-Supervised pproach to Detecting Stance in Tweets,”

Proceedings of SemEval, pp. 432–439.

Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., and Cherry, C. 2016. “Semeval-2016 Task

6: Detecting Stance in Tweets,” in Proceedings of the 10th International Workshop on

Semantic Evaluation (SemEval-2016), pp. 31–41.

Mohandes, M., Deriche, M., and liyu, S. O. 2018. “Classifiers Combination Techniques:

Comprehensive Review,” IEEE Access (6), pp. 19626–19639.

(https://doi.org/10.1109/ACCESS.2018.2813079).

Moreno, P. G., Artés-Rodríguez, A., Teh, Y. W., and Perez-Cruz, F. 2015. “Bayesian

Nonparametric Crowdsourcing,” Journal of Machine Learning Research.

Nguyen, T. N., Li, C., and Niederée, C. 2017. “On arly-Stage Debunking Rumors on Twitter:

Leveraging the Wisdom of Weak Learners,” in International Conference on Social

Informatics, Springer, pp. 141–158.

Oh, O., grawal, M., and Rao, H. R. 2013. “Community Intelligence and Social Media Services:

179 A Rumor Theoretic nalysis of Tweets During Social Crises,” Management Information

Systems Quarterly (37:2), pp. 407–426. (https://doi.org/10.25300/MISQ/2013/37.2.05).

Pang, B., and Lee, L. 2005. “Seeing Stars: xploiting Class Relationships for Sentiment

Categorization with Respect to Rating Scales,” in Proceedings of the 43rd Annual

Meeting on Association for Computational Linguistics, pp. 115–124.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,

Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,

Brucher, M., Perrot, M., and Duchesnay, É. 2012. “Scikit-Learn: Machine Learning in

Python,” Journal of Machine Learning Research. (https://doi.org/10.1007/s13398-014-

0173-7.2).

Peng, H., Cambria, ., and Hussain, . 2017. “ Review of Sentiment nalysis Research in

Chinese Language,” Cognitive Computation (9:4), pp. 423–435.

(https://doi.org/10.1007/s12559-017-9470-8).

Pennington, J., Socher, R., and Manning, C. 2014. “Glove: Global Vectors for Word

Representation,” in Proceedings of the 2014 Conference on Empirical Methods in

Natural Language Processing, Doha, Qatar: Association for Computational Linguistics,

October, pp. 1532–1543. (https://doi.org/10.3115/v1/D14-1162).

Pennycook, G., Bear, A., Collins, . T., and Rand, D. G. 2020. “The Implied Truth ffect:

Attaching Warnings to a Subset of Fake News Headlines Increases Perceived Accuracy

of Headlines Without Warnings,” Management Science, INFORMS.

(https://doi.org/10.1287/mnsc.2019.3478).

180 Pennycook, G., and Rand, D. G. 2018. “Who Falls for Fake News? The Roles of Bullshit

Receptivity, Overclaiming, Familiarity, and nalytic Thinking,” Journal of Personality.

(https://doi.org/10.1111/jopy.12476).

Pennycook, G., and Rand, D. G. 2019a. “Lazy, Not Biased: Susceptibility to Partisan Fake News

Is Better xplained by Lack of Reasoning than by Motivated Reasoning,” Cognition

(188), pp. 39–50. (https://doi.org/10.1016/j.cognition.2018.06.011).

Pennycook, G., and Rand, D. G. 2019b. “Fighting Misinformation on Social Media Using

Crowdsourced Judgments of News Source Quality,” Proceedings of the National

Academy of Sciences (116:7), pp. 2521–2526.

(https://doi.org/10.1073/pnas.1806781116).

Phang, C. W., Kankanhalli, ., and Tan, B. C. 2015. “What Motivates Contributors vs. Lurkers?

n Investigation of Online Feedback Forums,” Information Systems Research (26:4),

INFORMS, pp. 773–792.

Potthast, M., Kiesel, J., Reinartz, K., Bevendorff, J., and Stein, B. 2018. “ Stylometric Inquiry

into Hyperpartisan and Fake News,” in Proceedings of the 56th Annual Meeting of the

Association for Computational Linguistics (Volume 1: Long Papers), pp. 231–240.

(https://doi.org/10.18653/v1/P18-1022).

Qian, F., Gong, C., Sharma, K., and Liu, Y. 2018. “Neural User Response Generator: Fake News

Detection with Collective User Intelligence,” in Proceedings of the 27th International

Joint Conference on Artificial Intelligence, Stockholm, Sweden, July, pp. 3834–3840.

(https://doi.org/10.24963/ijcai.2018/533).

181 Qiu, M., Sim, Y., Smith, N. ., and Jiang, J. 2015. “Modeling User rguments, Interactions, and

ttributes for Stance Prediction in Online Debate Forums,” in Proceedings of the 2015

SIAM International Conference on Data Mining, SIAM, pp. 855–863.

Rai, . 2017. “ ditor’s Comments: Diversity of Design Science Research,” MIS Quarterly

(41:1), Society for Information Management and The Management Information

Systems …, pp. iii–xviii.

Rajendran, G., Chitturi, B., and Poornachandran, P. 2018. “Stance-in-Depth Deep Neural

pproach to Stance Classification,” Procedia Computer Science (132), Elsevier, pp.

1646–1653.

Ramos, J. 2003. “Using Tf-Idf to Determine Word Relevance in Document Queries,” in

Proceedings of the 1st Instructional Conference on Machine Learning (Vol. 242), pp.

133–142.

Rathore, . K., Ilavarasan, P. V., and Dwivedi, Y. K. 2016. “Social Media Content and Product

Co-Creation: n merging Paradigm,” Journal of Enterprise Information Management,

Emerald Group Publishing Limited.

Ravi, K., and Ravi, V. 2015. “ Survey on Opinion Mining and Sentiment nalysis: Tasks,

pproaches and pplications,” Knowledge-Based Systems (89), Elsevier, pp. 14–46.

Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. 2010.

“Learning from Crowds,” Journal of Machine Learning Research (11:Apr), pp. 1297–

1322.

182 Rodrigues, F., Lourenco, M., Ribeiro, B., and Pereira, F. C. 2017. “Learning Supervised Topic

Models for Classification and Regression from Crowds,” IEEE Transactions on Pattern

Analysis and Machine Intelligence (39:12), IEEE, pp. 2409–2422.

Rodrigues, F., and Pereira, F. C. 2018. “Deep Learning from Crowds,” in Thirty-Second AAAI

Conference on Artificial Intelligence, , April 25.

(https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16102).

Rodrigues, F., Pereira, F., and Ribeiro, B. 2013. “Learning from Multiple nnotators:

Distinguishing Good from Random Labelers,” Pattern Recognition Letters (34:12),

Elsevier, pp. 1428–1436.

Ruchansky, N., Seo, S., and Liu, Y. 2017. “Csi: Hybrid Deep Model for Fake News

Detection,” in Proceedings of the 2017 ACM on Conference on Information and

Knowledge Management, ACM, pp. 797–806.

(https://doi.org/10.1145/3132847.3132877).

Sailunaz, K., Dhaliwal, M., Rokne, J., and lhajj, R. 2018. “ motion Detection from Text and

Speech: Survey,” Social Network Analysis and Mining (8:1), Springer, p. 28.

Saleh, M. R., Martín-Valdivia, M. T., Montejo-Ráez, A., and Ureña-López, L. A. 2011.

“ xperiments with SVM to Classify Opinions in Different Domains,” Expert Systems

with Applications (38:12), Elsevier, pp. 14799–14804.

Sharma, K., Qian, F., Jiang, H., Ruchansky, N., Zhang, M., and Liu, Y. 2019. “Combating Fake

News: Survey on Identification and Mitigation Techniques,” ACM Transactions on

Intelligent Systems and Technology (10:3), pp. 1–42. (https://doi.org/10.1145/3305260).

183 Sheng, V. S., and Zhang, J. 2019. “Machine Learning with Crowdsourcing: Brief Summary of

the Past Research and Future Directions,” Proceedings of the AAAI Conference on

Artificial Intelligence (33), pp. 9837–9843.

(https://doi.org/10.1609/aaai.v33i01.33019837).

Shu, K., Sliva, ., Wang, S., Tang, J., and Liu, H. 2017. “Fake News Detection on Social Media:

Data Mining Perspective,” ACM SIGKDD Explorations Newsletter (19:1), pp. 22–36.

(https://doi.org/10.1145/3137597.3137600).

Shu, K., Wang, S., and Liu, H. 2018. “Understanding User Profiles on Social Media for Fake

News Detection,” in 2018 IEEE Conference on Multimedia Information Processing and

Retrieval, pp. 430–435. (https://doi.org/10.1109/MIPR.2018.00092).

Shu, K., Wang, S., and Liu, H. 2019. “Beyond News Contents: The Role of Social Context for

Fake News Detection,” in Proceedings of the 12th ACM International Conference on

Web Search and Data Mining, ACM, pp. 312–320.

(https://doi.org/10.1145/3289600.3290994).

Simaki, V., Paradis, C., and Kerren, . 2017. “Stance Classification in Texts from Blogs on the

2016 British Referendum,” in International Conference on Speech and Computer,

Springer, pp. 700–709.

Simpson, E. D., Venanzi, M., Reece, S., Kohli, P., Guiver, J., Roberts, S. J., and Jennings, N. R.

2015. “Language Understanding in the Wild: Combining Crowdsourcing and Machine

Learning,” in Proceedings of the 24th International Conference on World Wide Web -

WWW ’15, Florence, Italy: ACM Press, pp. 992–1002.

184 (https://doi.org/10.1145/2736277.2741689).

Simpson, ., Roberts, S., Psorakis, I., and Smith, . 2013. “Dynamic Bayesian Combination of

Multiple Imperfect Classifiers,” in Decision Making and Imperfection, Springer, pp. 1–

35.

Sobhani, P., Inkpen, D., and Matwin, S. 2015. “From rgumentation Mining to Stance

Classification,” in Proceedings of the 2nd Workshop on Argumentation Mining, pp. 67–

77.

Sobhani, P., Inkpen, D., and Zhu, X. 2017. “ Dataset for Multi-Target Stance Detection,” in

Proceedings of the 15th Conference of the European Chapter of the Association for

Computational Linguistics: Volume 2, Short Papers, pp. 551–557.

Somasundaran, S., and Wiebe, J. 2009. “Recognizing Stances in Online Debates,” in

Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th

International Joint Conference on Natural Language Processing of the AFNLP, pp. 226–

234.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. 2014. “Dropout:

Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine

Learning Research. (https://doi.org/10.1214/12-AOS1000).

Suler, J. 2004. “The Online Disinhibition ffect,” CyberPsychology & Behavior (7:3), pp. 321–

326. (https://doi.org/10.1089/1094931041291295).

Surowiecki, J. 2005. The Wisdom of Crowds, Anchor.

185 Tacchini, E., Ballarin, G., Della Vedova, M. L., Moret, S., and de lfaro, L. 2017. “Some like It

Hoax: utomated Fake News Detection in Social Networks,” in 2nd Workshop on Data

Science for Social Good, CEUR-WS, pp. 1–15.

Timoshenko, ., and Hauser, J. R. 2019. “Identifying Customer Needs from User-Generated

Content,” Marketing Science (38:1), INFORMS, pp. 1–20.

(https://doi.org/10.1287/mksc.2018.1123).

Tschiatschek, S., Singla, ., Gomez Rodriguez, M., Merchant, ., and Krause, . 2018. “Fake

News Detection in Social Networks via Crowd Signals,” in Companion Proceedings of

the The Web Conference 2018, International World Wide Web Conferences Steering

Committee, pp. 517–524. (https://doi.org/10.1145/3184558.3188722).

Tulyakov, S., Jaeger, S., Govindaraju, V., and Doermann, D. 2008. “Review of Classifier

Combination Methods,” in Machine Learning in Document Analysis and Recognition

(Vol. 90), S. Marinai and H. Fujisawa (eds.), Berlin, Heidelberg: Springer Berlin

Heidelberg, pp. 361–386. (https://doi.org/10.1007/978-3-540-76280-5_14).

Van Engelen, J. E., and Hoos, H. H. 2020. “ Survey on Semi-Supervised Learning,” Machine

Learning (109:2), Springer, pp. 373–440.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \Lukasz,

and Polosukhin, I. 2017. “ ttention Is ll You Need,” in Advances in Neural Information

Processing Systems, pp. 5998–6008.

Vaughan, J. W. 2018. “Making Better Use of the Crowd: How Crowdsourcing Can dvance

Machine Learning Research,” Journal of Machine Learning Research (18:193), pp. 1–46.

186 Vosoughi, S., Roy, D., and Aral, S. 2018. “The Spread of True and False News Online,” Science

(359:6380), pp. 1146–1151. (https://doi.org/10.1126/science.aap9559).

Wainwright, M. J., and Jordan, M. I. 2008. “Graphical Models, xponential Families, and

Variational Inference,” Foundations and Trends® in Machine Learning (1:1–2), pp. 1–

305. (https://doi.org/10.1561/2200000001).

Wang, G., Sun, J., Ma, J., Xu, K., and Gu, J. 2014. “Sentiment Classification: The Contribution

of nsemble Learning,” Decision Support Systems (57), Elsevier, pp. 77–93.

Wang, J., Ipeirotis, P. G., and Provost, F. 2017. “Cost-Effective Quality Assurance in Crowd

Labeling,” Information Systems Research (28:1), pp. 137–158.

(https://doi.org/10.1287/isre.2016.0661).

Wang, W. Y. 2017. “‘Liar, Liar Pants on Fire’: New Benchmark Dataset for Fake News

Detection,” in Proceedings of the 55th Annual Meeting of the Association for

Computational Linguistics (Volume 2: Short Papers) (Vol. 2), pp. 422–426.

Wang, Y., Ma, F., Jin, Z., Yuan, Y., Xun, G., Jha, K., Su, L., and Gao, J. 2018. “ NN: vent

Adversarial Neural Networks for Multi-Modal Fake News Detection,” in Proceedings of

the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data

Mining, London, : ACM Press, pp. 849–857.

(https://doi.org/10.1145/3219819.3219903).

Wardle, C. 2017. Fake News. It’s Complicated. (https://firstdraftnews.org/latest/fake-news-

complicated/).

187 Wei, X., Zeng, D. D., and Yin, J. 2017. “Multi-Label Annotation Aggregation in

Crowdsourcing,” ArXiv:1706.06120 [Cs]. (http://arxiv.org/abs/1706.06120).

Wei, X., Zhang, M., and Zeng, D. D. 2019. “The ffectiveness of Non-Persistent Social Status as

an Incentive Mechanism.,” in PACIS, p. 31.

Wei, X., Zhang, Z., Zhang, M., and Zeng, D. D. 2019. “Combining Crowd and Machine

Intelligence to Detect False News in Social Media,” Available at SSRN 3355763.

Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., and Perona, P. 2010.

“Caltech-UCSD Birds 200,” No. CNS-TR-201, Caltech, January 1.

(http://www.vision.caltech.edu/visipedia/CUB-200.html).

Welinder, P., Branson, S., Perona, P., and Belongie, S. J. 2010. “The Multidimensional Wisdom

of Crowds,” in Advances in Neural Information Processing Systems, pp. 2424–2432.

Whitehill, J., Wu, T., Bergsma, J., Movellan, J. R., and Ruvolo, P. L. 2009. “Whose Vote Should

Count More: Optimal Integration of Labels from Labelers of Unknown xpertise,” in

Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. D.

Lafferty, C. K. I. Williams, and A. Culotta (eds.), Curran Associates, Inc., pp. 2035–

2043. (http://papers.nips.cc/paper/3644-whose-vote-should-count-more-optimal-

integration-of-labels-from-labelers-of-unknown-expertise.pdf).

Wojatzki, M., and Zesch, T. 2016. “Ltl. Uni-Due at Semeval-2016 Task 6: Stance Detection in

Social Media Using Stacked Classifiers,” in Proceedings of the 10th International

Workshop on Semantic Evaluation (SemEval-2016), pp. 428–433.

188 Wong, F. M. F., Tan, C. W., Sen, S., and Chiang, M. 2016. “Quantifying Political Leaning from

Tweets, Retweets, and Retweeters,” IEEE Transactions on Knowledge and Data

Engineering (28:8), IEEE, pp. 2158–2172.

Xu, C., Paris, C., Nepal, S., and Sparks, R. 2018. “Cross-Target Stance Classification with Self-

ttention Networks,” ArXiv:1805.06593 [Cs]. (http://arxiv.org/abs/1805.06593).

Xu, R., Zhou, Y., Wu, D., Gui, L., Du, J., and Xue, Y. 2016. “Overview of Nlpcc Shared Task 4:

Stance Detection in Chinese Microblogs,” in Natural Language Understanding and

Intelligent Applications, Springer, pp. 907–916.

Yan, Y., Rosales, R., Fung, G., Schmidt, M., Hermosillo, G., Bogoni, L., Moy, L., and Dy, J.

2010. “Modeling nnotator xpertise: Learning When verybody Knows a Bit of

Something,” in Proceedings of the Thirteenth International Conference on Artificial

Intelligence and Statistics, pp. 932–939.

Yang, S., Shu, K., Wang, S., Gu, R., Wu, F., and Liu, H. 2019. “Unsupervised Fake News

Detection on Social Media: Generative pproach,” in Proceedings of 33rd AAAI

Conference on Artificial Intelligence. (https://doi.org/10.1609/aaai.v33i01.33015644).

Yang, Z., Yang, D., Dyer, C., He, X., Smola, ., and Hovy, . 2016. “Hierarchical ttention

Networks for Document Classification,” in Proceedings of the 2016 Conference of the

North American Chapter of the Association for Computational Linguistics: Human

Language Technologies, pp. 1480–1489.

Yi, J., Jin, R., Jain, S., Yang, T., and Jain, . K. 2012. “Semi-Crowdsourced Clustering:

Generalizing Crowd Labeling by Robust Distance Metric Learning,” in Advances in

189 Neural Information Processing Systems, pp. 1772–1780.

Yin, D., Bond, S. D., and Zhang, H. 2014. “ nxious or ngry? ffects of Discrete motions on

the Perceived Helpfulness of Online Reviews,” MIS Quarterly (38:2), JSTOR, pp. 539–

560.

Yin, L., Liu, Y., Zhang, W., and Yu, Y. 2020. “Truth Inference With a Deep Clustering-Based

Aggregation Model,” IEEE Access (8), pp. 16662–16675.

(https://doi.org/10.1109/ACCESS.2020.2964484).

Zarrella, G., and Marsh, . 2016. “MITR at Sem val-2016 Task 6: Transfer Learning for

Stance Detection,” Proceedings of SemEval, pp. 470–475.

Zhang, C., Hahn, J., and De, P. 2013. “Research Note—Continued Participation in Online

Innovation Communities: Does Community Response Matter qually for veryone?,”

Information Systems Research (24:4), INFORMS, pp. 1112–1130.

Zhang, J., Zhang, Jilong, and Zhang, M. 2019. “From Free to Paid: Customer Expertise and

Customer Satisfaction on Knowledge Payment Platforms,” Decision Support Systems

(127), Elsevier, p. 113140.

Zhang, M., Wei, X., and Chen, G. 2018. “Maximizing the Influence in Social Networks via

Holistic Probability Maximization,” International Journal of Intelligent Systems (33:10),

Wiley Online Library, pp. 2038–2057.

Zhang, M., Wei, X., Guo, X., Chen, G., and Wei, Q. 2019. “Identifying Complements and

Substitutes of Products: A Neural Network Framework Based on Product Embedding,”

190 ACM Transactions on Knowledge Discovery from Data (TKDD) (13:3), ACM New

York, NY, USA, pp. 1–29.

Zhang, M., Wei, X., and Zeng, D. D. 2020. “ Matter of Reevaluation: Incentivizing Users to

Contribute Reviews in Online Platforms,” Decision Support Systems (128), Elsevier, p.

113158.

Zhang, Q., Zhang, S., Dong, J., Xiong, J., and Cheng, X. 2015. “ utomatic Detection of Rumor

on Social Network,” in Natural Language Processing and Chinese Computing, Springer,

pp. 113–122. (https://doi.org/10.1007/978-3-319-25207-0_10).

Zhang, S., Qiu, L., Chen, F., Zhang, W., Yu, Y., and lhadad, N. 2017. “We Make Choices We

Think Are Going to Save Us: Debate and Stance Identification for Online Breast Cancer

C M Discussions,” in Proceedings of the 26th International Conference on World Wide

Web Companion, pp. 1073–1081.

Zhang, X. M., and Zhu, F. 2011. “Group Size and Incentives to Contribute: Natural

xperiment at Chinese Wikipedia,” American Economic Review (101:4), pp. 1601–1615.

(https://doi.org/10.1257/aer.101.4.1601).

Zhang, X., and Wang, C. 2012. “Network Positions and Contributions to Online Public Goods:

The Case of Chinese Wikipedia,” Journal of Management Information Systems (29:2),

Taylor & Francis, pp. 11–40.

Zhang, Y., Chen, X., Zhou, D., and Jordan, M. I. 2016. “Spectral Methods Meet M: Provably

Optimal lgorithm for Crowdsourcing,” Journal of Machine Learning Research (17:1),

pp. 3537–3580.

191 Zhang, Z., Wei, X., Zheng, X., Li, Q., and Zeng, D. 2020. Detecting Product Adoption Intention

via Multi-View Deep Learning, Working paper.

Zhou, X., and Zafarani, R. 2018. “Fake News: Survey of Research, Detection Methods, and

Opportunities,” ArXiv:1812.00315 [Cs]. (http://arxiv.org/abs/1812.00315).

Zhou, Y., Cristea, . I., and Shi, L. 2017. “Connecting Targets to Tweets: Semantic ttention-

Based Model for Target-Specific Stance Detection,” in International Conference on Web

Information Systems Engineering, Springer, pp. 18–32.

Zhu, F., and Zhang, X. Michael . 2010. “Impact of Online Consumer Reviews on Sales : The

Moderating Role of Product and Consumer,” Journal of Marketing (74:2), pp. 133–148.

(https://doi.org/10.1509/jmkg.74.2.133).

192