Handling Uncertainty in Citizen Science Data: Towards an Improved Amateur-Based Large-Scale Classification
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Repository@Nottingham Handling Uncertainty in Citizen Science Data: Towards an Improved Amateur-based Large-scale Classification Manuel Jim´eneza,∗, Isaac Trigueroa, Robert Johna aThe Automated Scheduling Optimisation and Planning Research Group, School of Computer Science, University of Nottingham, Nottingham NG8 1BB, United Kingdom Abstract Citizen Science, traditionally known as the engagement of amateur participants in research, is showing great potential for large-scale processing of data. In areas such as astronomy, biology, or geo-sciences, where emerging technologies gen- erate huge volumes of data, Citizen Science projects enable image classification at a rate not possible to accomplish by experts alone. However, this approach entails the spread of biases and uncertainty in the results, since participants involved are typically non-experts in the problem and hold variable skills. Con- sequently, the research community tends not to trust Citizen Science outcomes, claiming a generalised lack of accuracy and validation. We introduce a novel multi-stage approach to handle uncertainty within data labelled by amateurs in Citizen Science projects. Firstly, our method proposes a set of transformations that leverage the uncertainty in amateur classifications. Then, a hybridisation strategy provides the best aggregation of the transformed data for improving the quality and confidence in the results. As a case study, we consider the Galaxy Zoo, a project pursuing the labelling of galaxy images. A limited set of expert classifications allow us to validate the experiments, con- firming that our approach is able to greatly boost accuracy and classify more ∗Corresponding author at: The Automated Scheduling Optimisation and Planning Re- search Group, School of Computer Science, University of Nottingham, Nottingham NG8 1BB, United Kingdom. Email address: [email protected] (Manuel Jim´enez) Preprint submitted to Information Sciences December 7, 2018 images with respect to the state-of-art. Keywords: Citizen Science, Classification, Astroinformatics, Galaxy Morphologies, Uncertainty, Data analysis 1. Introduction Connectivity is promoting the emergence of a great potential amongst virtual communities of people that share a common goal. In some cases, this goal may consist of making a significant contribution towards the solution of a complex scientific problem. Whereas, in the past, the analysis of these problems used to be restricted to a group of experts in the subject, today this is difficult when the processing of large amounts of data is required. In this context, Citizen Science refers to the development of scientific research assisted by amateur volunteers from the general public [15]. As a form of crowdsourcing [13], this practice is re-emerging, engaging the crowd in helping researchers complete high time- consuming tasks for which no reliable automatic procedures are available yet, for example, labelling of images [45], detection of patterns in graphic data [49], or transcription of handwritten texts [22]. Here we deal with classification problems in Citizen Science, which generally aim at the classification of huge collections of images according to a number of classes. These classes capture the interest of a specific research field, and identifying them correctly is the target of the participants. This sort of project involves, for instance, the recognition of structures in cell images [35], animal species in images taken in the savannah [2], or types of storms in actual data taken from meteorological satellites [23]. Amongst others, the nascent discipline of astroinformatics [3] has greatly benefited from the analysis of astronomical data in multiple projects, providing data analysis at a scale never reached in the past [31, 37, 6]. Nevertheless, many challenges are raised when the maximal profit of this large-scale analysis is desirable, regarding aspects such as the best use of expert classifications [44] or participants' engagement in this type of 2 voluntary scientific contribution [42]. Citizen Science has also attracted the attention of data scientists. Research focused on the mining of data using an off-line approach, that is, the study of results once the project has finished. They have tested the capabilities of Data Mining (DM) and Machine Learning (ML) techniques, aiming the replication of amateurs' performance [5, 18] or the training of ML algorithms for a certain problem [38, 7]. Moreover, ML implementations are also being used for opti- mising amateurs' endeavours through the course of the project, following an on-line approach. This other framework encompasses the progressive training of new participants as they acquire experience in the problem, or the interac- tion between a ML classifier and new labelled data as it is generated by project participants [49]. Despite this, Citizen Science still arouses scepticism within the research com- munity [39]. Even though it offers possibilities for research not possible to accomplish by experts alone, it is not universally accepted as a valid research method [10]. The reasons lie in the quality of results, which are often questioned because of several drawbacks involving the prevalence of biases and lack of ac- curacy and validation [28]. Amateurs participating in Citizen Science projects exhibit a wide range of skills, and it is not guaranteed they hold any background in sciences or research. Moreover, there is always some degree of uncertainty in classification problems, which usually tend to bring additional vagueness in the definition of the classes (type of birds, shape of galaxies, patterns in a graph, etc.). Consequently, depending on the problem and participants' expertise with the classification task, the confidence through amateur-labelled data varies and Citizen Science results thus hold an intrinsic uncertainty. The study of classification problems with uncertain labels has been developed adopting several approaches [12, 30]. Nonetheless, when the uncertainty comes from a set of independent judgements on the object being classified, fuzzy logic provides a very suitable framework for a thorough study of this uncertainty [27]. Areas such as multi-criteria decision making and multi-expert decision making address the problem of providing a final decision when a set of indepen- 3 dent judgements are available [21, 46]. Several aggregation methods have been widely studied through the specialised literature, aiming to use a set of experts' individual preferences in such a way that reflects the properties contained in all individual contributions [14, 47]. However, this kind of approach has not been extended when there is available a great number of non-experts opinions with a widespread uncertainty in their final decisions. Moreover, classification prob- lems covered by Citizen Science projects tend to produce additional vagueness related to the definition of the classes to be identified, a disparity either in the total number of votes received or in the confidence of amateur classifications, etc. To the best of our knowledge, the enhancement of Citizen Science results by using this kind of methods has not been fully addressed yet. In our preliminary study [25], we started investigating the potential and the issues derived from the employment of these results with two simple data transformations. In this work, we propose a novel approach that, based on our previous findings, uses the data produced in Citizen Science projects that deal with classification problems. We present a method to aggregate information about the prevalent uncertainty in this sort of data. We first identify three sources of uncertainty in Citizen Science data that we address separately by a set of transformations that aim to enrich the original data. Then, we employ a hybridisation strategy that explores the most suitable combination of these individual transformations, providing more confident and accurate classifications. We eventually pursue a refinement of results, so that, they became more trustworthy and maximise the utility and outreach of Citizen Science projects. We consider as a case study the first edition of the Galaxy Zoo (GZ1) project [32], one of the very first successful implementations of Citizen Science using the Internet. GZ1 finished classifying nearly one million galaxy images with the help of more than 200,000 volunteers. However, these results did not consider at that time a substantial part of the information stored in the original data about the uncertainty in amateur classifications. Making an integrated use of the same original GZ1 dataset, our approach is able to provide more accurate 4 classifications for a greater number of galaxies, improving the state-of-art of the problem. This document is organised as follows. In Section 2, we extend the back- ground on Citizen Science and the management of the uncertainty with fuzzy logic. In Section 3, we introduce our approach for the improvement of Citizen Science data. Section 4 presents the set of experiments that test our method along with a discussion of results. Finally, in Section 5 we draw some conclusions and outline possible directions for future work. 2. Background In this section, we further introduce the main materials covered in this work for a better comprehension. In Section 2.1, we first explain in more detail differ- ent aspects around the running of Citizen Science projects and review current trends in the specialised literature. After