Ethical Issues in Research Using Datasets of Illicit Origin

Ethical issues in research using datasets of illicit origin Daniel R. Thomas Sergio Pastrana Alice Hutchings Richard Clayton Alastair R. Beresford Cambridge Cybercrime Centre, Computer Laboratory University of Cambridge United Kingdom [email protected] ABSTRACT ACM Reference format: We evaluate the use of data obtained by illicit means against Daniel R. Thomas, Sergio Pastrana, Alice Hutchings, Richard Clayton, and Alastair R. Beresford. 4239. Ethical issues in research a broad set of ethical and legal issues. Our analysis covers using datasets of illicit origin. In Proceedings of IMC ’39, London, both the direct collection, and secondary uses of, data ob- UK, November 3–5, 4239, 3: pages. tained via illicit means such as exploiting a vulnerability, or DOI: 32.3367/5353587.53535:; unauthorized disclosure. We extract ethical principles from existing advice and guidance and analyse how they have been applied within more than 42 recent peer reviewed papers that 3 INTRODUCTION deal with illicitly obtained datasets. We find that existing The scientific method requires empirical evidence to test advice and guidance does not address all of the problems that hypotheses. Consequently, both the gathering, and the use researchers have faced and explain how the papers tackle of, data is an essential component of science and supports ethical issues inconsistently, and sometimes not at all. Our evidence-based decision making. Computer scientists make analysis reveals not only a lack of application of safeguards significant use of data to support research and inform policy, but also that legitimate ethical justifications for research are and this includes data which was obtained through illegal or being overlooked. In many cases positive benefits, as well as unethical behaviour. potential harms, remain entirely unidentified. Few papers In this paper we consider the ethical and legal issues sur- record explicit Research Ethics Board (REB) approval for rounding the use of datasets of illicit origin, which we define the activity that is described and the justifications given for as data collected as a result of (i) the exploitation of a vulner- exemption suggest deficiencies in the REB process. ability in a computer system; (ii) an unintended disclosure by the data owner; or (iii) an unauthorized leak by someone CCS CONCEPTS with access to the data. •Social and professional topics → Computing profes- The collection, or use, of a dataset of illicit origin to support sion; Codes of ethics; Computing / technology pol- research can be advantageous. For example, legitimate access icy; •General and reference → Surveys and overviews; to data may not be possible, or the reuse of data of illicit •Applied computing → Law; •Networks → Network origin is likely to require fewer resources than collecting data privacy and anonymity; again from scratch. In addition, the sharing and reuse of existing datasets aids reproducibility, an important scientific goal. The disadvantage is that ethical and legal questions KEYWORDS may arise as a result of the use of such data. Ethics, law, leaked data, found data, unintentionally public There is evidence that some researchers who use datasets data, data of illicit origin, cybercrime, Menlo Report of illicit origin consider ethical and legal issues, particularly through the introduction of ethical consideration sections into papers [:5] and the development and use of institutional Permission to make digital or hard copies of all or part of this work resources such as Research Ethics Boards (REBs) [48]. Un- for personal or classroom use is granted without fee provided that fortunately, our work shows that neither is common practice, copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. and even where they are tackled, ethical and legal consid- Copyrights for components of this work owned by others than the erations often appear incomplete. It therefore follows that author(s) must be honored. Abstracting with credit is permitted. To potential harms may have taken place which might otherwise copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request have been mitigated or avoided. Research that lacks sufficient permissions from [email protected]. ethical consideration may still be ethical, but it is difficult to IMC ’39, London, UK assess this. © 4239 Copyright held by the owner/author(s). Publication rights licensed to ACM. ;9:-3-6725-733:-:/39/33. $37.22 General guidance, such as that provided by the Menlo DOI: 32.3367/5353587.53535:; Report [4:], provides useful advice, but does not address all IMC ’39, November 3–5, 4239, London, UK Daniel R. Thomas et al. the issues that arise in using data of illicit origin. Academic presumption of innocence until proven guilty, a right to not discussions have taken place [54], and there are blog posts and have arbitrary invasions of privacy, and a right not to be other informal articles by academics on the topic [33;, 346], arbitrarily deprived of property [334]. Research using data of but to date there is little in the way of detailed analysis, or illicit origin may indirectly deprive people of such rights and a systematisation of knowledge, which explores this problem so this needs to be considered. For example, in Philippines in depth. in 4238, suspected illegal drug users or dealers were subject The goal of this paper is to address this gap and provide a to extra-judicial assassination [8] and hence care would need detailed evaluation of the use of data of illicit origin in peer- to be taken with data collected from online drug markets to reviewed research, and to support the development of a more ensure this did not result in such abuse. nuanced understanding of the issues and problems in this Releasing and using shared data: The WECSR workshop space. We do this by first reviewing previous work to identify in 4234 convened a panel of experts from different domains the ethical (§4) and legal (§5) issues that can arise. We then who agreed that research involving data of illicit origin would analyse over 42 recent peer-reviewed papers which make need to have a clear benefit to society [54]. They also argued use of data of illicit origin (§6), and systematize (Table 3) that simply because data is public does not exempt research the ethical and legal decisions made against a common set using such data from obtaining REB approval since it might of justifications, safeguards, potential harms and potential contain personally identifiable information. This echoes the benefits (§7). Menlo Report’s suggestion that the REB must protect the interests of individuals where informed consent is impossible. Sharing of datasets is beneficial for data science, but the 4 ETHICS purpose and scope for using such data must be stated [3;]. Ethical norms are constantly changing with research ethics Allman and Paxson discussed the ethical issues of releasing developing over the course of the 42th century and becoming data, using data released by others, and the interactions more prominent in our field in the 43st century. Previous between providers and users of data [6]. A key ethical consid- work related to the ethical use of data of illicit origin spans a eration in this context is privacy protection. It is likely that number of topics, including informed consent, human rights, data of illicit origin was not intended for research purposes or releasing and using shared data, hacking, analysis techniques, public exposure, and thus it may not be anonymised. In such ethical review, and Research Ethics Boards (REBs). We cases, the raw dataset should not be shared publicly, and consider each of these in turn. research conducted with such data should aim to preserve Informed consent: The earliest work on the ethics of privacy. Researchers who hold data of illicit origin should computer-monitored data explored informed consent and only provide details of their source or (as Allman and Paxson emphasised the right of withdrawal as well as the importance suggest) share data with verified researchers under a written of data anonymisation [:;]. The first difficulty with data of acceptable usage policy. None of the papers we discuss later illicit origin is that it is not always possible to meet these re- took this approach. Partridge argues that papers in network quirements. Acquiring consent from users involved in leaked measurement research should have an ethics section, partly to data is challenging, particularly if they are involved in illegal increase the availability of examples of ethical reasoning [:5]. activities [94]. In the case of data obtained from underground We show in §7 that few papers using data of illicit origin marketplaces, covert research without consent is necessary to have an ethics section. understand what is traded due to the illegality of the goods Both Allman & Paxson, and Partridge warn against relying bought or sold – consent could affect the results [323]. This on the anonymisation of data since deanonymisation tech- is one of the exceptions for informed consent in the ethics niques are often surprisingly powerful. Robust anonymisation statement of the British Society of Criminology, which states of data is difficult, particularly when it has high dimensional- that “covert research may be allowed where the ends might ity, as the anonymisation is likely to lead to an unacceptable be thought to justify the means” [45]. level of data loss [5]. In cases where consent is possible, previous work has con- Hacking and intervening: Hacking into computers to ex- cluded that if informed consent has been given on the basis tract information is usually unethical [324] and illegal. Moore of a promise of confidentiality by the researcher, then re- and Clayton considered ethical dilemmas in take-down researchers should take particular care to ensure that they are search resulting from nine dilemmas they faced during their willing to keep the promises they make, particularly if doing research.

Ethical Issues in Research Using Datasets of Illicit Origin

Internet Security Threat Report Volume 24 | February 2019

Security Now! #664 - 05-22-18 Spectreng Revealed

Internet of Things Botnet Detection Approaches: Analysis and Recommendations for Future Research

Technical Brief P2P Iot Botnets Clean AC Font

Reporting, and General Mentions Seem to Be in Decline

Week 3 Reading Assignment: Understanding the Mirai Botnet

Tech & Policy Initiative, Working Paper Series 2

Iot Botnet Forensics: a Comprehensive Digital Forensic Case Study on Mirai Botnet Servers

Early Detection of Mirai-Like Iot Bots in Large-Scale Networks Through Sub-Sampled Packet Trafﬁc Analysis

Leveraging Users to Break Cyber Kill Chain

A Reused Code Analysis of Mirai Botnet and Ransomware By

And You Thought It Could Not Get Worse