Ethical issues in research using datasets of illicit origin

Daniel R. Thomas Sergio Pastrana Alice Hutchings Richard Clayton Alastair R. Beresford Cambridge Cybercrime Centre, Computer Laboratory University of Cambridge United Kingdom [email protected] ABSTRACT ACM Reference format: We evaluate the use of data obtained by illicit means against Daniel R. Thomas, Sergio Pastrana, Alice Hutchings, Richard Clayton, and Alastair R. Beresford. 4239. Ethical issues in research a broad set of ethical and legal issues. Our analysis covers using datasets of illicit origin. In Proceedings of IMC ’39, London, both the direct collection, and secondary uses of, data ob- UK, November 3–5, 4239, 3: pages. tained via illicit means such as exploiting a vulnerability, or DOI: 32.3367/5353587.53535:; unauthorized disclosure. We extract ethical principles from existing advice and guidance and analyse how they have been applied within more than 42 recent peer reviewed papers that 3 INTRODUCTION deal with illicitly obtained datasets. We find that existing The scientific method requires empirical evidence to test advice and guidance does not address all of the problems that hypotheses. Consequently, both the gathering, and the use researchers have faced and explain how the papers tackle of, data is an essential component of science and supports ethical issues inconsistently, and sometimes not at all. Our evidence-based decision making. Computer scientists make analysis reveals not only a lack of application of safeguards significant use of data to support research and inform policy, but also that legitimate ethical justifications for research are and this includes data which was obtained through illegal or being overlooked. In many cases positive benefits, as well as unethical behaviour. potential harms, remain entirely unidentified. Few papers In this paper we consider the ethical and legal issues sur- record explicit Research Ethics Board (REB) approval for rounding the use of datasets of illicit origin, which we define the activity that is described and the justifications given for as data collected as a result of (i) the exploitation of a vulner- exemption suggest deficiencies in the REB process. ability in a computer system; (ii) an unintended disclosure by the data owner; or (iii) an unauthorized leak by someone CCS CONCEPTS with access to the data. •Social and professional topics → Computing profes- The collection, or use, of a dataset of illicit origin to support sion; Codes of ethics; Computing / technology pol- research can be advantageous. For example, legitimate access icy; •General and reference → Surveys and overviews; to data may not be possible, or the reuse of data of illicit •Applied computing → Law; •Networks → Network origin is likely to require fewer resources than collecting data privacy and anonymity; again from scratch. In addition, the sharing and reuse of existing datasets aids reproducibility, an important scientific goal. The disadvantage is that ethical and legal questions KEYWORDS may arise as a result of the use of such data. Ethics, law, leaked data, found data, unintentionally public There is evidence that some researchers who use datasets data, data of illicit origin, cybercrime, Menlo Report of illicit origin consider ethical and legal issues, particularly through the introduction of ethical consideration sections into papers [:5] and the development and use of institutional Permission to make digital or hard copies of all or part of this work resources such as Research Ethics Boards (REBs) [48]. Un- for personal or classroom use is granted without fee provided that fortunately, our work shows that neither is common practice, copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. and even where they are tackled, ethical and legal consid- Copyrights for components of this work owned by others than the erations often appear incomplete. It therefore follows that author(s) must be honored. Abstracting with credit is permitted. To potential harms may have taken place which might otherwise copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request have been mitigated or avoided. Research that lacks sufficient permissions from [email protected]. ethical consideration may still be ethical, but it is difficult to IMC ’39, London, UK assess this. © 4239 Copyright held by the owner/author(s). Publication rights licensed to ACM. ;9:-3-6725-733:-:/39/33. . . $37.22 General guidance, such as that provided by the Menlo DOI: 32.3367/5353587.53535:; Report [4:], provides useful advice, but does not address all IMC ’39, November 3–5, 4239, London, UK Daniel R. Thomas et al. the issues that arise in using data of illicit origin. Academic presumption of innocence until proven guilty, a right to not discussions have taken place [54], and there are blog posts and have arbitrary invasions of privacy, and a right not to be other informal articles by academics on the topic [33;, 346], arbitrarily deprived of property [334]. Research using data of but to date there is little in the way of detailed analysis, or illicit origin may indirectly deprive people of such rights and a systematisation of knowledge, which explores this problem so this needs to be considered. For example, in Philippines in depth. in 4238, suspected illegal drug users or dealers were subject The goal of this paper is to address this gap and provide a to extra-judicial assassination [8] and hence care would need detailed evaluation of the use of data of illicit origin in peer- to be taken with data collected from online drug markets to reviewed research, and to support the development of a more ensure this did not result in such abuse. nuanced understanding of the issues and problems in this Releasing and using shared data: The WECSR workshop space. We do this by first reviewing previous work to identify in 4234 convened a panel of experts from different domains the ethical (§4) and legal (§5) issues that can arise. We then who agreed that research involving data of illicit origin would analyse over 42 recent peer-reviewed papers which make need to have a clear benefit to society [54]. They also argued use of data of illicit origin (§6), and systematize (Table 3) that simply because data is public does not exempt research the ethical and legal decisions made against a common set using such data from obtaining REB approval since it might of justifications, safeguards, potential harms and potential contain personally identifiable information. This echoes the benefits (§7). Menlo Report’s suggestion that the REB must protect the interests of individuals where informed consent is impossible. Sharing of datasets is beneficial for data science, but the 4 ETHICS purpose and scope for using such data must be stated [3;]. Ethical norms are constantly changing with research ethics Allman and Paxson discussed the ethical issues of releasing developing over the course of the 42th century and becoming data, using data released by others, and the interactions more prominent in our field in the 43st century. Previous between providers and users of data [6]. A key ethical consid- work related to the ethical use of data of illicit origin spans a eration in this context is privacy protection. It is likely that number of topics, including informed consent, human rights, data of illicit origin was not intended for research purposes or releasing and using shared data, hacking, analysis techniques, public exposure, and thus it may not be anonymised. In such ethical review, and Research Ethics Boards (REBs). We cases, the raw dataset should not be shared publicly, and consider each of these in turn. research conducted with such data should aim to preserve Informed consent: The earliest work on the ethics of privacy. Researchers who hold data of illicit origin should computer-monitored data explored informed consent and only provide details of their source or (as Allman and Paxson emphasised the right of withdrawal as well as the importance suggest) share data with verified researchers under a written of data anonymisation [:;]. The first difficulty with data of acceptable usage policy. None of the papers we discuss later illicit origin is that it is not always possible to meet these re- took this approach. Partridge argues that papers in network quirements. Acquiring consent from users involved in leaked measurement research should have an ethics section, partly to data is challenging, particularly if they are involved in illegal increase the availability of examples of ethical reasoning [:5]. activities [94]. In the case of data obtained from underground We show in §7 that few papers using data of illicit origin marketplaces, covert research without consent is necessary to have an ethics section. understand what is traded due to the illegality of the goods Both Allman & Paxson, and Partridge warn against relying bought or sold – consent could affect the results [323]. This on the anonymisation of data since deanonymisation tech- is one of the exceptions for informed consent in the ethics niques are often surprisingly powerful. Robust anonymisation statement of the British Society of Criminology, which states of data is difficult, particularly when it has high dimensional- that “covert research may be allowed where the ends might ity, as the anonymisation is likely to lead to an unacceptable be thought to justify the means” [45]. level of data loss [5]. In cases where consent is possible, previous work has con- Hacking and intervening: Hacking into computers to ex- cluded that if informed consent has been given on the basis tract information is usually unethical [324] and illegal. Moore of a promise of confidentiality by the researcher, then re- and Clayton considered ethical dilemmas in take-down re- searchers should take particular care to ensure that they are search resulting from nine dilemmas they faced during their willing to keep the promises they make, particularly if doing research. They considered the balance between reducing so might require them to break the law [74]. harm uncovered during measurements and the accuracy of Where informed consent is impossible to obtain, the Menlo such measurements, the dangers of telling criminals the flaws Report recommends that the REB must protect the interests in their systems and the importance of ensuring that pro- of the individuals [48]. Thus, the REB has a particularly posed interventions are likely to work [97]. Dittrich et al. important role to play in research which makes use of data provide two case studies on ethical decision making for remote of illicit origin where informed consent is not possible [45]. mitigation of [4;]. They discuss the ethical issues Human rights: Human rights also provide an important involved including the reasons for and against intervening. ethical baseline. These include, the right to life, the right Analysis techniques: The ethics committee of the Associa- to be free of arbitrary arrest, the right to a fair trial, a tion of Internet Researchers’ (AoIR) has produced guidance Ethical issues in research using datasets of illicit origin IMC ’39, November 3–5, 4239, London, UK for ethical decision making in Internet research in 4224 [55] using REBs as they do not add value and may introduce and 4234 [93]. This cross-disciplinary work provides useful many months of delay. By contrast other REBs (such as ours questions to aid researchers in considering the ethics of their in Cambridge) have ICTR specialists and aim to provide a research and defines a process for ethical decision making. response in five working days for simple cases. In general AoIR aimed to publish case studies of the application of their REBs are required because researchers are biased when as- guidelines but this has not yet happened. This paper consid- sessing the ethics of their own research [48]. REBs can help ers over 42 papers which might have used the AoIR ethics researchers identify additional safeguards or improvements guidelines but only one of them (§6.5.4) did so. Keegan in experimental design that make the work ethical, and can and Matias developed a multi-party risk benefit framework help protect researchers from liability. for use in analysing ethical considerations for online com- munity research [78]. While this was implicitly intended for 4.3 Ethical issues research surrounding particular online community platforms, In order to support analysis of case studies in §6, we now the same principles apply to research that considers the on- list the set of ethical issues that require consideration when line community of the Internet, and so it may be a helpful conducting research with data of illicit origin: technique. Identification of stakeholders: The primary, secondary, Ethical review: The Menlo Report [4:] and its compan- and key stakeholders should be identified to support the ion [48] are the primary reference on ethical practice in analysis of the potential harms and benefits of the research. Information Communication Technology Research (ICTR), Primary stakeholders are those directly connected with data, particularly for USA-based researchers. It includes numerous such as those identified in it; secondary stakeholders are questions to help researchers consider ethical issues and case intermediaries in the delivery of benefits or harms, such as studies to illustrate their application. It identifies that ICTR service providers; and key stakeholders are those such as the has a greater scale, speed, coupling, decentralisation, distribu- leaker or the researcher who are critical to the conduct of tion, and opacity than traditional human subject research and the research. hence it re-examines the particular ethical principles required Informed consent: In most of the research we consider it to evaluate ICTR. It identifies four ethical principles [48, was impossible or impractical to obtain informed consent §B]: Respect for persons: individuals should be treated from the primary, and in some cases secondary, stakeholders. as autonomous agents and persons with diminished autonomy However, research may be designed such that informed con- should be given additional protection. Beneficence: min- sent is not required. Since none of the case studies we have imise possible harms, and maximise possible benefits. The considered obtained informed consent for their use of data of researcher should also use safeguards against potential harms. illicit origin, we do not consider it in later analysis. Justice: risks and benefits should be distributed fairly and Identify harms: The potential harms arising from the use not on the basis of protected characteristics such as race, of the data of illicit origin should be identified. or other characteristics that correlated with protected ones. Safeguards: Researchers should apply mechanisms to miti- Respect for law and public interest: in general ethical gate or reduce the potential for harm. research conforms to applicable laws in relevant jurisdictions. Justice: The research does not unfairly advantage or dis- Research should always be in the public interest. Addition- advantage any particular social or cultural group. ally, research should be open, transparent, reproducible and Public interest: The research has been published, is repro- peer-reviewed. ducible, and there is a “social acceptability” exceeding the Research Ethics Boards (REBs): Program committees or harms [57]. journal editors are able to review the ethics of work after it has been conducted but before it is published. REBs, known as Institutional Review Boards (IRBs) in many US 5 LEGAL ISSUES institutions or Ethics Committees in some UK institutions, Legal issues surrounding research with data collected ille- review the ethics of proposed research before it is conducted. gally can be complex, particularly as the laws of multiple Many REBs were originally formed in response to a review jurisdictions are likely to be applicable. We are not lawyers of the ethics of medical research following revelations of un- and researchers should seek their own legal advice. Countries ethical medical research conducted prior to the 3;92s [48, whose laws that may apply to research being conducted in- §A.3]. In this context there were clearly human subjects who clude those where individuals or systems that these data refer had rights that needed to be protected. The term “human to reside, the countries where data was stored, the countries subject” is now deprecated in ethical review in favour of where the researchers conducted the research, and possibly considering the wide variety of people who might be “par- also any countries that data transited during any part of ticipants” in the research, even if they are not aware the this process. Researchers often travel and so they should research is being conducted. However, some REBs are still consider the impact of committing offences, both in their structured around serving this original purpose, and thus home jurisdiction and in countries that they visit or that they lack the expertise to understand ICTR or the process they might be extradited to. to evaluate research whose risks and challenges differ from The key legal issues applicable to research with data of a medical trial. Such structures discourage researchers from illicit origin are as follows: IMC ’39, November 3–5, 4239, London, UK Daniel R. Thomas et al.

Computer Misuse: Most jurisdictions now have laws against terrorist materials, such as discussion of planned attacks or the misuse or abuse of computers such as the UK [43], the techniques, to ensure the researcher is protected [335]. US [3], and Germany [5:, 5;, 62, 63]. These can cover generic Indecent images: Possession of indecent images of children actions such as the unauthorised use of a computer system is an offence in many jurisdictions including the UK [::], (even if there was no technical measure in place to prevent USA [4], and Germany [59]. In general there are no exemp- it), and the use of , or ‘dual use’ tools that may be tions for research. Hence, care may need to be taken when used for malicious purposes. scraping or receiving some types of data dump in case they Copyright: The right to produce copies, including, in some contain such material. jurisdictions, database rights, and trade secrets may apply to National Security: Data obtained may be protected by data obtained by researchers. In particular, it may affect the national security legislation. Therefore unauthorised use further sharing of data with other researchers as that might or publication of these data may expose the researchers to constitute the creation of copies. There are exemptions to legal risks. Even if data is publicly available it may still be copyright such as “fair use”, which vary with jurisdiction. classified [58]. This is discussed further in §6.7.4. Data Privacy: Data may contain personally identifiable Contracts: Researchers may be exposed to civil liability information which may mean it needs to be protected and resulting from breach of contract by using certain data if processed in accordance with relevant Data Privacy and Data doing so violates terms of service or other contracts that Protection rules. In several jurisdictions IP addresses may researchers have agreed to. be considered personal data, which complicates their use, Occasionally research may be illegal but still ethical. In particularly where consent has not been obtained. This is such cases researchers should be transparent about what the case in Germany [337, p4;], though a European Court of they are doing and both they and their institutions should Justice ruling has found that personally identifiable data can be willing to accept the consequences. REB approval is still be processed without consent for security purposes (e.g. essential in such cases. In such circumstances researchers IP addresses in web server logs) [6:]. There is an exemption should actively engage with lawmakers to improve the law for the use of the personal data in Germany “if it is necessary so that the ethical work that they want to do is made legal for a research entity in order to conduct a scientific research, in future [74]. the scientific interest to conduct the research project substan- There are generic defences against legal liability that may tially predominates over the interest of the data subject in apply. Mens rea: In some cases if the researcher can demon- exclusion of the change of purpose the data was collected for, strate lack of criminal intent then a criminal prosecution and if the research cannot be conducted otherwise or can oth- cannot succeed; REB approval may be a useful way to demon- erwise be conducted under disproportional effort.” (German strate this. Not in the public interest to prosecute: This is Federal Data Protection Code §4:.4.5 [337]). an especially generic defence, but it is uncertain. The General Data Protection Regulation (GDPR) [44] The use of an REB may transfer legal risks to the re- applies from May 423: to the collection and processing of searcher’s institution or ensure that the institution provides personal data in the EU and to organisations that offer goods legal assistance to the researcher. This alone is a strong or services to individuals in the EU [72]. It provides specific incentive for individual researchers to use an REB. measures to allow processing of personal data for scientific research in the public interest, subject to appropriate safe- guards such as encryption, pseudonymisation, and data min- 5.3 Related work on law and ICTR imisation. It mentions that scientific research should increase Others have discussed legal issues in ICTR for particular knowledge and that personal data should not be included kinds of research in particular jurisdictions. in publications. It specifically allows the processing of data Soghoian discusses legal risk arising from conducting phish- collected for other purposes for scientific or historical research ing experiments, with four case studies and a discussion of (Article 7). It requires (Article 36.7.b) that the interests of the copyright, trespass to chattel, trademark, terms of ser- data subjects be protected and that information about the vice violation, computer fraud, and anti-phishing issues in data collected, how it is being processed and safeguarded, the USA [322]. He recommends all such research should be and who is responsible be made publicly available. It encour- subject to ethical review as well as legal advice, and that the ages the use of approved codes of conduct surrounding data institution IT staff and anyone else likely to receive a cease processing and it may be helpful for research communities and desist letter related to the project should be consulted. to develop such codes of conduct. Penalties for violating He also notes the importance of not gaining any financial the GDPR include fines of up to EUR 42 million, or 6% of benefit from the work, for example, adverts should not be worldwide turnover, whichever is higher. shown on pages associated with the work. Terrorism: In some jurisdictions (e.g. UK) it may be an Ohm et al. examine the legal issues for network measure- offence to fail to report terrorist activity [32:], including any ment research arising from USA Federal Law [:3]. They discovered during a research project. Additionally, possession suggest capturing only the required data, scrubbing IP ad- of terrorist materials may be an offence unless specific excep- dresses, encrypting data when not analysing it, restricting tions for research are met. REB approval and institutional monitoring to the smallest possible network, and ensuring oversight are likely to be necessary if the research involves that measuring tools do not store the full packets to disk. Ethical issues in research using datasets of illicit origin IMC ’39, November 3–5, 4239, London, UK

Burstein discusses the legal issues surrounding cyberse- were incorrect [3:]. They noted ethical concerns without giv- curity research, and in particular research using network ing details, and referred the reader to the Menlo report [48]. traces, running malware honeypots, or mitigating attacks by To prevent harm, CAIDA only looked at data targeting their interfering with malicious systems [37]. own darknet. Malécot and Inoue took a similar approach, analysing their network telescope [92]. However, they then 6 CASE STUDIES realised that they knew the IP addresses of the devices as they were the sources of the probes of their network tele- The ethical and legal considerations of conducting and pub- scope. These IP addresses thus once belonged to devices that lishing academic research with data of illicit origin depends could be easily brute forced as they had weak Telnet pass- on the context [346]. Many factors may come into play, such words, in doing so the authors Identify harms. The Safeguards as the type of data involved, the measurement techniques they used were that they kept these IP addresses confidential used or the results of the research. Thus, ethical and legal pending finding an ethically acceptable and practical way of consideration must be conducted separately for each research dealing with the situation.3 activity. This section presents a series of concrete case studies Krenc, Hohlfeld and Feldmann published a non-peer re- where different kinds of data of illicit origin was used. We viewed editorial note where they analysed the scan results [84]. do not attempt to provide a complete survey on each topic. They found numerous technical problems with the data, and Instead, for each topic we consider recent and relevant work, the authors concluded that given that Carna scan made no focussing on those that mention or expose interesting ethi- technical contributions, it had been unethical to conduct. cal issues. We selected from papers we were already aware While they did not provide an opinion on whether it is ethi- of, those we found from searches, and from following refer- cal to use these data for research, they did use it for these ences forwards and backwards. Decisions on which papers purposes. to include were necessarily subjective. We consider work by When these scan data were first released, it was not neces- people who self-identify as researchers even if they do not sary to use them to answer research questions as researchers publish in academic venues, although we have focussed on could conduct their own scans legally, ethically, and with publications which have undergone peer review. We highlight better technical validity. For new research that requires com- the ethical (e.g. Identification of stakeholders) and legal (e.g. parison with the Internet of 4234, these Carna scan data Computer misuse) issues defined in §4.3 and §5 in each case, might be of use. However, due to substantial technical prob- and the overall results are summarised in §7. We do not cover lems with these Carna scan data [84] in many cases there research using active measurements such as the controversial will be no good argument for using them. Encore work discussed by others [9:]. Dittrich, Carpenter and Karir use the Menlo report to present a thorough analysis of the ethics of the Carna bot- 6.3 Malware and exploitation net [49], from which they conclude that there is a “lack of Research conducted using hacking tools such as botnets or a common understanding of ethics in the computer security exploit kits is necessary to understand how malicious actors field”. use them and to provide countermeasures. However, the use 6.3.4 AT&T iPad users database brute force. In 4232 re- of such tools can be harmful, and sometimes researchers have searchers from discovered a web service run used them without considering legal or ethical issues. In this by AT&T that, when provided with the ICC-ID of a 5G section we cover three particular cases: the use of a botnet to iPad, would return the associated email address. They used scan the IPv6 address space, the exploitation of a discovered this to obtain the email addresses for 336 222 iPad users vulnerability to gather email addresses, and research using and passed this information to Gawker [328] as well as mak- of malware specimens. ing the vulnerability known to third parties. They did not contact AT&T to report the vulnerability. The subsequent 6.3.3 Carna scan. In 4234 an individual claimed FBI investigation [329] resulted in the authors being found to have carried out a complete scan of all IPv6 addresses on guilty of Computer misuse and one of them was sentenced all ports and made the dataset publicly available. However, to 63 months imprisonment [52]. While it was argued that the means they used to do this was a botnet of 642 222 this imprisonment was an overreaction [;:] the research was devices with default passwords [3:]. Creating a botnet of clearly both unethical and illegal. compromised computers to carry out research is illegal since it Finding vulnerabilities in third party systems without is Computer misuse. Is research that uses these data ethical? permission can be ethical, since this helps that party to CMU CERT wrote a blog post explaining how these data Identify harms and to avoid future attacks. However, in could be used and, while they pointed out there were some this case it was unethical, because: i) The authors collected ethical issues, they noted that: “As far as we can tell, this far more personal data than they needed to prove it to data set does not contain legally restricted information, like classified information, personally identifiable information, or 3In 4238 the Mirai botnet was also built by brute forcing Telnet trade secrets.” [;;]. CAIDA found that there were some passwords. As a security research community we did not successfully mitigate the risks that the demonstrated [;3]. We were problems with this data since some bot-devices were behind not able to contact Malécot and Inoue to discover if they found a HTTP proxies which meant that the results for port :2 scans solution to ethically using the IP addresses. IMC ’39, November 3–5, 4239, London, UK Daniel R. Thomas et al. be a real vulnerability, ii) The authors shared this data the research works using source code mentioned ethical or and the existence of the vulnerability (including the exploit legal issues, nor whether they have obtained permission from script) with third parties before reporting the problem to their corresponding REB. Calleja et al. shared a dataset with AT&T. Hence, the authors identified risks for iPad users, metrics from the source code, but not the sources themselves, but exploited the vulnerability and, given that they did not as Safeguards that allow for reproducibility without releasing contact AT&T, they failed to implement Safeguards. Indeed, the malware [39]. Additionally, stylometry analysis allows the research work is not in the Public interest, since they did code writers to be identified [38]. By identifying who has not minimise harm or maximise benefit, except to their own written some piece of code, it is possible to group malware notoriety (a lack of Justice). families, detect plagiarism, and attribute attacks. Thus, the release of source code (not just malware code) should be done 6.3.5 Malware source code. Malware source code has been with care, since it can be used to identify the authors. publicly released on multiple occasions. This code has then been used by malicious actors to attack computer systems or networks. Additionally, malware source code can be obtained 6.4 Password dumps from public databases such as vxHeaven or Contagio Dump. Password dumps are the archetypal leaked dataset – a list For example, the source code for was leaked in 4233 [67]. of passwords that has been been made public, normally by Since then, many variants of Zeus have been reported by illegal action; criminals regularly compromise databases (e.g. anti-virus vendors. Similarly the source code for Mirai botnet by SQL injection) and then publish the contents online [342]. was released in 4238 [82]. The release of malware source Since people include personal data in passwords, the password code often lowers the barrier to entry for using the malware: alone can be sensitive [326] and the lists may also include after Mirai’s source code was released, myriad Mirai based other personal data such as email addresses or names. botnets began operation. Release of source code is sometimes Research into password dumps is controversial but widely correlated with prosecution of its author, as in the example of practised [326], not least because they provide researchers Agobot in 4226 [65]. Additionally, the release of source code with ground truth data, that otherwise would be difficult might result in changed business models for malware authors to obtain [54]. A variety of dumps have been investigated. such as the Zeus botnet as-a-service [67]. The source code of One of the largest dumps, is the RockYou dump, others malware can also be obtained from its operational settings. include MySpace leaked in 4228, or Facebook leaked in 4232. Stone-Gross et al. [325] identified and obtained access to some These dumps and many others can be found online by using of the C&C servers for the Pushdo/ (used common search engines. mainly for spam delivery) by contacting the hosting providers Weir et al. use lists of compromised and publicly disclosed (i.e. the authors first performed Identification of stakeholders). passwords [343]. They say that “while publicly available, these They obtained sensitive data such as the statistics of infection, lists contain private data; therefore we treat all password target email addresses and the source code of the malware. lists as confidential” and that “due to the moral and legal Kotov and Masacci collected source code of exploit kits issues with distributing real user information, we will only from a public repository as well as underground forums where provide the lists to legitimate researchers who agree to abide code was leaked or released [7:]. Their analysis showed how by accepted ethical standards”. In a later paper by the same exploit kits evade anti-crawling techniques and how malware authors they note that it is a “mixed blessing” that research authors protect against code analysis. However, as the au- into passwords used for high value targets, such as bank thors state, the fact that the code was leaked biased their accounts, will not be possible until relevant breaches occur analysis, for example due to removal of obfuscation from the and the data is made public [342]. source code. Calleja et al. analysed 373 malware samples Das et al. study trends in password reuse across different dating from 3;97 to 4237. They show how the malware sites by analysing passwords obtained through an internal landscape has evolved using traditional software metrics on a survey and several hundred thousand leaked passwords [46]. dataset of malware code from various repositories, including They state that the ethics of conducting research with data vxHeaven, GitHub, -related magazines, and P4P net- acquired illegally is under debate, but they justify their work works. The authors do not share the collected source code, saying that: 3) these datasets were used in several previous but only provide a dataset containing the metrics obtained studies, 4) they protected users privacy by only working from the malware pieces [39]. with hashed email addresses, 5) they obtained approval from Research using malware source code has substantial ben- their REB to conduct the survey. Finally, the authors state efits as better understanding of how malware works, so as that their results help system administrators and researchers to facilitate developing new detection and mitigation tech- understand how cross-site password attacks work. niques. Publishing the results and explaining the tools and Kelley et al. used two datasets of leaked passwords as well techniques used to analyse the malware makes this research as passwords collected through an online survey [79]. The of Public interest. Both possession and accidental disclosure authors received approval from their REB for this survey, and of malware could be illegal. Researchers could use secure they discuss the ethics of using leaked databases of passwords. storage, enforce retention policies, and not publicly distribute They argue that, given these data were already public, using the malware source code as Safeguards. However, none of it for research does not increase harm to users, since no Ethical issues in research using datasets of illicit origin IMC ’39, November 3–5, 4239, London, UK further connection with real identities is sought. Moreover, 6.5 Leaked databases given that attackers can use these datasets to construct While password dumps are a specific restricted form of data, their dictionaries or cracking tools, system administrators in this section we consider more general leaking of entire can benefit from research and can prepare defences such as databases that contain more detailed information such as improved password policies. This view is also shared by Ur messages or logs of activity. These databases are available in et al., who use three different password dumps to compare underground forums or from public repositories, sometimes real-world cracking techniques with those proposed in the for free. In this section, we cover three cases: Databases research literature [336]. of distributed denial of service (DDoS) providers (booters), Durmuth et al. proposed OMEN, a cracking algorithm the Patreon database (a web site aimed at crowd-founding that outperforms state of the art crackers [53]. The authors projects), and the databases of several underground forums. tested their algorithm on leaked databases from MySpace, Facebook and the website RockYou. The authors justify this 6.5.3 Booter databases leak. Booters (sometimes called by claiming that these datasets have been used in several stressers) provide DDoS-as-a-Service. While their operators previous studies, and they have been made public. Moreover, might sometimes claim that this is legal [68], this activity 4 they claimed that these data have been treated carefully and is almost always illegal (Computer misuse), and it is also they do not reveal actual information about the passwords. unethical. Several approaches have been used to understand Bonneau used leaked password databases to investigate this criminal ecosystem: interviewing operators after contact- the statistical properties of passwords and developed the ing them through their websites [68], measuring the attacks α-guesswork metric for password strength [35]. He notes that they produce [332], and using their leaked databases and “care was taken to ensure that no activities undertaken for source code. Karami et al. analysed a database dump of the research made any user data public which wasn’t previously” TwBooter service. Their Safeguards to make this research and rejects “the appropriateness of ever collecting cleartext ethical were to not publish personally identifiable data, ex- user passwords, with or without additional identifying infor- cept when this was already publicly known [76]. Later they mation”. By this argument, leaking a password database and analysed database dumps from Asylum and LizardStresser making a leaked password database more available than it and scraped data from VDOS [77]. For the latter they ob- would otherwise be are both unethical. tained an REB exemption on the basis these data did not Researchers mostly use two arguments to justify the ethics contain any personally identifiable information and used pub- of conducting research on password dumps. First, since the licly leaked data. In some jurisdictions (e.g. Germany [337]) passwords are public, studying them might not increase harm IP addresses may be personally identifiable data and the and could help advance science. Moreover, since attackers dumps likely contained email addresses which can be simi- may have access to these lists as well, the defences derived larly identifiable. Santanna et al. analysed database dumps from analysis of these passwords may protect people, making from 37 distinct booters and used Karami’s procedures to these works of Public interest. Second, most authors state justify it ethically [;5]. Thomas et al. used database dumps that they do not reveal personal information derived from and scraped data from booters to evaluate the coverage of the passwords, and some of them claim that they “treat their honeypot based measurement of DDoS attacks, they these lists with the necessary precautions” [53] or that they argued that using this data was necessary as there was no “treat them as confidential” [343], which should include secure other ground truth on attacks initiated by booters [332]. storage of the lists. All the works covered Identify harms and All these papers had some Identification of stakeholders, provide Safeguards. Some authors justify the ethics of their Identify harms and used Safeguards. These dumps can con- research on the basis that previous research was conducted tain details of user accounts including names, email addresses, with these dumps. password hashes and security questions; details of the backend Some uses of leaked password databases are clearly not and frontend servers used for attacks (including compromised ethical. leakedsource.com was shut down and its operators hosts); logs of connections to the site including IP addresses arrested as a result of its use of leaked passwords [83]. It and user agent strings; logs of attacks including target IP made password hashes (or even cracked passwords) available addresses, ports, domain names and the method used; tickets to anyone who was willing to pay for the access. This is in and messages sent between users and site owners; records of marked contrast to the ethical service haveibeenpwned.com payments; details of pricing plans; and chat logs of site oper- which never makes passwords available and doesn’t expose ators. Concretely, Thomas et al. used the attack logs [332]; any personal information without verification of control of Santanna et al. [;5] and Karami et al. [77] used attack logs, the email address for that leaked information [66]. havei- payment records, pricing plans and counts of users. The beenpwned.com maximises benefit by enabling users to find password hashes could be used for password research and the out if their data has been leaked and notifying them if their ticket databases for qualitative research into the attitudes of data is leaked in the future while minimising harm as it does booter users and operators [68]. not share private data with anyone or use it for any other purpose. 4The aim is usually illegal: attempting to stop someone’s Internet connection from working; and the mechanism is often also illegal: using UDP amplification attacks that make unauthorised use of misconfig- ured UDP servers or using botnets of compromised machines. IMC ’39, November 3–5, 4239, London, UK Daniel R. Thomas et al.

6.5.4 Patreon crowd-funding. In October 4237 the Patreon works are undoubtedly valuable to law enforcement agencies crowd-funding website was hacked and the entire site made and crime prevention strategies that arise as a result can available. This included data on projects, private messages, provide social benefits (e.g. preventing terrorist attacks, cy- source code, email addresses, and passwords. Poor and berattacks, or child sexual exploitation). However, there Davidson, who were conducting research based on incomplete are other ethical considerations: the research should con- data obtained by scraping the Patreon website would have sider potential harms to the users of the forums, including liked to use this data but concluded it would be unethical prosecution or physical threat, and compare them with the to do so [:7]. These data were publicly available and the benefits. None of the works mentioned use Safeguards to researchers hoped to serve the public through their research. protect the data, which was originally illegally obtained [338]. However, it would be hard for them to distinguish between Some authors have publicly re-released leaked datasets, even public and private data within the dump, and they might see including private information [36, :8]. While the goal is to private data unintentionally. Furthermore using the dump provide other researchers with datasets to conduct their ex- might legitimize criminal activity, violate user’s expectations periments and for reproducibility, public release means that of privacy, and the use of this data would be without their these datasets are potentially being shared with malicious consent. Importantly they also did not need to use this actors. data to do their research, as scraping the Patreon website would also provide the data they needed, without the risk 6.6 Financial data leaks of accidentally including private data. This is an example Leaks of commercially sensitive data such as data on contracts where the authors Identify harms and chose not to use data and client relationships can reveal the hidden behaviour of of illicit origin as a result since they could not ensure that companies and individuals in ways that would be difficult there would be no additional harm. to achieve through external measurement. In this section we describe the leak of data from a law firm that revealed 6.5.5 Underground forums. Underground forums focus on the financial behaviour of companies and individuals. Leaks trading, learning and discussing illegal or criminal topics, such of peering arrangements between ISPs together with traffic as hacking material, credit cards and drugs. These forums statistics likely present similar ethical issues, particularly if often also cover other non-illegal topics, such as video-games, the data demonstrate illegal behaviour such as contravening financial help, politics, and sport. network neutrality rules or the imposition of censorship. Several forum databases have been leaked in the past. The w2rm.ws forum database was hacked by an anonymous group Panama / Mossack Fonseca papers leak. In 4237 the internal of under the pseudonym “Peace of Mind”, allegedly in database of the Panamanian law firm Mossack Fonseca was response to some prior attack on the forum “Hell” [338]. All leaked to the German newspaper Süddeutsche Zeitung, which the forum content, personal data, as well as exploits and hack- shared it with the International Consortium of Investigative ing material was made public. The database of carders.cc, a Journalists (ICIJ) [32;]. In 4238 the consortium and its German forum focused on financial information trading (such partners released numerous reports based on analysis of the as credit cards or bank accounts) was hacked and leaked [7;]. leaked data [64]. World leaders, celebrities, and companies In 4238, the database of the forum nulled.io was leaked, were found to be using Mossack Fonseca’s services for tax containing “758 286 user accounts with :22 7;5 user per- evasion and other criminal purposes. Some of these data were sonal messages, 7 7:4 purchase records and 34 822 invoices made publicly available and Europol identified 5 722 crimi- which seem to include donation records” [;2]. Motoyama et nals among the clients of the firm using this data [:6]. There al. presented one of the first works analysing underground was a substantial Public interest to this release as it identified forums using leaked databases, however, they did not discuss criminal and unethical activity by numerous individuals and ethics [98]. Yip et al. perform social network analysis using companies. This revelation may result in money laundering a database of three carding forums (Cardersmarket, Dark- and tax evasion becoming more difficult in future due to market and Shadowcrew) which included private messages greater transparency, international cooperation, and fear of of the participants [345]. This research showed that forums disclosure. However, not all the clients of Mossack Fonseca are a preferred way for criminals to communicate. They were engaged in illegal or unethical behaviour. Trautman do not provide any discussion about the ethics of their re- surveys the consequences of the leak describing many of the search, however they indicate that the marketplace actors are media reports and investigations that resulted from it [333]. anonymous, so it is not possible to obtain Informed consent. Murphy states that money laundering and tax evasion are Analysing data from underground forums can provide valu- illegal, while tax avoidance is unethical [99]. He says tax is able insights into how markets for stolen data work [69], how the rightful property of the government, tax evasion is the malware configurations are shared [67], new forms of criminal theft of the government’s rightful property, and tax avoidance networks that arise in cyberspace [98], common goods and is a “con trick”. Promoting tax competition between countries assets being traded [:8], economy of spam campaigns [325] is seeking to undermine national sovereignty and subvert the or even to provide Indicators of Compromise (IoC) and other democratic (or other political) process and hence unethical. useful information for threat intelligence [8;, ;4]. This makes The Guardian reports that ;7% of Mossack Fonseca’s work it of Public interest. Indeed, the conclusions drawn from these involves selling financial products to avoid taxes [64] and Ethical issues in research using datasets of illicit origin IMC ’39, November 3–5, 4239, London, UK hence almost all of the work they do would be described by do not agree on the morality of using this information for Murphy as unethical. teaching foreign policy studies, despite the cables being a The Panama papers have been of interest to researchers, valuable teaching tool [47]. Barnard “borrowed” classified as well as journalists, and law enforcement. Sharife uses the documents from the “controversial WikiLeaks” to analyse Panama papers in a historical analysis to understand the covert relationships between USA and South Africa during downfall of Banco Espírito Santo [;9]. Walkowski uses the the Cold War [;]. The author claims that there were no Panama papers to argue for reform in legislation to increase ethical dilemmas since all the classified data used was open transparency and avoid tax competition [339]. O’Donovan source and declassified. However, there is no evidence that et al. evaluated the impact of the Panama papers on firm any of Manning’s WikiLeaks dump has been declassified. values and found it reduced market capitalisation of 5;9 Talarico and Zamparini analysed the smuggling of tobacco in firms implicated in the leak by US$357 billion or 2.9%[9;]. Italy between 4226 and 4236 [327]. They used a confidential Omartian used the Panama papers to investigate investor document from the American Embassy in Italy, obtained response to changes in tax legislation in terms of offshore through WikiLeaks that said that the USA government had entity usage [:4]. He uses the introduction of legislation: blacklisted an Italian harbour because of collusion by harbour European Union Savings Directive (EUSD), Tax Information staff. Berger references several Manning cables to study the Exchange Agreements (TIEAs), the Foreign Account Tax international restrictions on the trade of weapons with North Compliance Act (FATCA), and the Common Reporting Stan- Korea [34]. For example, Berger mentioned that the United dard for information exchange (CRS) as natural experiments, Arab Emirates bought missiles from North Korea, and a evaluating the impact on offshore activity as revealed by the diplomatic cable where USA thanked Iran for its cooperation Panama papers. He finds that they do have a significant in blocking one cargo from North Korea. impact. Researchers have used this data to better understand the None of these papers explicitly discuss the ethics of using diplomatic position of the USA government in several inter- this data; they implicitly argue that they are in the public national conflicts. As Barnard points out, these documents interest. O’Donovan et al. [9;], and Oei and Ring [:2] Identify are controversial [;]. However, none of the studied works harms resulting from the data being released while Omartian discussed the ethics of their research. Some consider Manning provides evidence for tax laws that provide more Justice [:4]. as a traitor while others consider her as a freedom fighter. In McGregor et al. use the successful collaborative investiga- the research community, there is no consensus as to whether tion into the Panama Papers that the ICIJ conducted as a publishing results based on these documents is ethical, and case study of secure collaboration [96]. They used a survey in general, authors prefer not to confront the question. of the journalists who used ICIJ’s systems and IRB approved interviews of ICIJ’s staff but did not analyse the content of 6.7.4 Snowden’s NSA data leak. In 4235 Edward Snow- the Panama Papers. They detail many of the safeguards den, a contractor for the USA’s National Security Agency used to protect the data and the investigation. (NSA), leaked large amounts of NSA and GCHQ data to journalists [:]. He was the latest in a long line of NSA leakers 6.7 Classified materials who have revealed various aspects of NSA programmes [344]. In this section we cover two well-known leaks of classified Landau provides an overview of the data that was revealed by information: Manning’s Wikileaks dump and Snowden’s NSA Snowden, covering early leaks [86] and later leaks [85]. She data leak. In both cases, classified documents from the USA criticises the ethics of some of the leaks since “the specifics government detailing war decisions, espionage or diplomatic on China had little to do with privacy and security of indi- activities were publicly leaked. viduals”, but is mostly critical of the NSA/GCHQ and the USA and UK governments, and she is mostly positive about 6.7.3 Manning’s WikiLeaks dump. In 4232 Chelsea (then Snowden’s actions. Bradley) Manning leaked 922 222 documents and diplomatic Several uses of the Snowden leaks make no mention of the cables from the USA’s government systems to WikiLeaks [95]. ethical considerations of doing so, but are implicitly critical This information was available to more than three million of the ethics of the activities exposed. In a newspaper article, USA government employees [87] and so it is likely that other Schneier uses documents leaked by Snowden to explain how parties such as Russia and China that have large intelligence the NSA unconditionally exploits Tor users’ browsers to agencies would already have had access to at least some of this install implants that exfiltrate data [;7]. In a magazine article information. Originally WikiLeaks shared the unredacted he argues that the metadata collection that was exposed documents with carefully selected journalists. However, later represents ubiquitous surveillance of everyone [;8]. RFC 9846 journalists published what they believed to be a temporary uses the Snowden leaks to inform a threat model for pervasive encryption password only to discover that a copy of the surveillance, in order to inform protocol design, such that archive encrypted under that password had been shared on the activities detailed in the Snowden leaks would be more BitTorrent [9]. Hence, the full and unredacted cables were difficult in future [32]. Lustgarten argued in 4237 that the publicly released. American Psychological Association had not taken account of The use of WikiLeaks diplomatic cables and documents the Snowden leaks in its ‘Ethics Code’ and ‘Record Keeping in the academy is controversial. For example, professors Guidelines’ as the leaks showed that the NSA would have IMC ’39, November 3–5, 4239, London, UK Daniel R. Thomas et al. access to client data stored on cloud servers. Clinicians were to be highly compartmentalised Top Secret (source code for responsible for protecting this client data, and had legal weaponised zero day exploits), but, due to the fact that if it protections against enforced disclosure. This then raises were classified it could not be deployed on enemy systems, ethical concerns for psychologists, as their clients would not it was unclassified. Additionally, as the USA government have given informed consent for access to their data by the cannot own copyrights, the source code for this state level NSA [89]. malware was not even protected by copyright [7]. The lack Others used the fact that the leaks had happened and their of legal protections might make it easier for researchers to impact rather than the actual content of the leaks for their work with this data. research. Preibusch used Snowden’s revelations to conduct an experiment on the privacy behaviour of people after a major privacy incident; he found little change in user behaviour [:9]. There has been substantial discussion of whether Snow- 7 ANALYSIS den’s actions were legal and ethical and whether the NSA’s In this section we analyse the case studies with respect to activities were legal or ethical. Scheuerman argues that the ethical and legal considerations described in Sections 4.3 Snowden’s actions were ethical civil disobedience and serve and 5. We first discuss common justifications made by au- as an example of correct behaviour [;6]. Kadidal describes thors to justify the use (or not) of these data. The main goal the impact on civil liberties of mass surveillance based on is to understand how the authors approached the legal and what Snowden (and others) revealed [75]. Walsh and Miller ethical issues as well as the justifications, safeguards, harms, provide an ethical and policy analysis of intelligence agency and benefits. A summary of this analysis is shown in Table 3. activity on the basis of Snowden’s revealing what current The acronyms used for safeguards, harms and benefits in the practice was [33:]. Lucas discusses the application of the table are given in brackets below. principle of informed consent to mass surveillance as revealed by Snowden, suggesting that revealing the outline of what kind of activity is being conducted to the public is necessary for the public to consent to it [88]. Barnett argues that 7.3 Justifications the NSA’s activities revealed in the Snowden leaks were un- We have observed common justifications made by the re- constitutional under the fourth amendment [33]. Inkster, a searchers regarding ethical issues in the case studies, which former British Secret Intelligence Service Director, provides a are summarized in Table 3. Here, we describe them and counter-narrative, claiming that the exposed activities were provide comments in italics. not illegal, but rather entirely proper, in particular he states Not the first Previous research using these data was pub- that collecting data on everyone is not mass surveillance if lished and peer-reviewed, and so our work must be ethical. this data is only processed by computer programs and not This is a poor argument: not all published work is ethical un- read by humans [73]. He implicitly argues that the newspa- der current norms, and while the work that was published may pers that published the leaked data acted unethically in doing have been ethical, if your work does something different with so and Snowden’s actions are clearly depicted as unethical these data then that requires its own justification. Public and illegal. data Since these data are publicly available, anything we do Since the leaked data was classified, and much of it remains with these data is ethical. The ethics of the work must still classified despite being publicly available, the use of it for be considered and in some cases REB review may still be re- research or in court cases may have unexpected difficulties. quired [54]. Researchers may develop or apply new techniques For example, in 4237 Barton Gellman gave a talk at Purdue to public data that, for example, deanonymise these data, University that included some classified NSA slides. Pur- and this may cause harm [6]. No additional harm Any due University had a facility security clearance to perform harms that might arise have already occurred and therefore classified USA government research and this incident was our work produces benefits and no (or negligible) additional treated as a classified information spillage. As a result, the harm it is ethical. For there to be no additional harms the video recording of the talk was destroyed [58]. There is thus research should not identify any natural persons and data an additional risk for researchers at institutions with facility may need to be stored and managed securely. In some cases security clearances as if they work with leaked classified data any use of the data of illicit origin is considered additional then they may find that all their resulting work is destroyed harm, for example with images of child abuse, every viewing is by facility authorities. In 4239 the UK government was considered additional abuse of the victim. Fight malicious considering making it an offence to obtain sensitive infor- use These data are already used by malicious actors, and so mation, and if that became law then any researcher using we also need to use these data to defend against them. If leaked classified data could be imprisoned, possibly for up researchers can use the same data to prevent or reduce harm to 36 years [56]. In the USA court cases have been thrown caused by malicious actors, without creating greater harm by out because the evidence that the government had the sup- doing so, then it may be ethical to do so. Necessary data plicants under surveillance was classified (even though the This research cannot be conducted without using this data. supplicants had that evidence) [75]. In contrast, the Vault 9 This might be a good justification if there is sufficient benefit leak of CIA data [8:] contained data that might be expected to the work (Public interest) and there is no additional harm. Ethical issues in research using datasets of illicit origin IMC ’39, November 3–5, 4239, London, UK

7.4 Safeguards 7.6 Benefits When dealing with leaked data, holders must take care as to In this section, we enumerate academic and social benefits how it is processed and stored so as to avoid further disclosure particular to research using data of illicit origin. Repro- of sensitive information. Here we analyse the actions taken ducibility (R) The data allows the comparison of different by the researchers to maintain the confidentiality, integrity algorithms or tools. This is the great benefit of data sharing, and privacy of these data and the stakeholders. Secure but Controlled Sharing is required when these data con- storage (SS) to protect the integrity and confidentiality tains sensitive information. Uniqueness (U) Data is either of these data maintained, e.g. by means of encryption and unique (cannot be obtained through other means) or histori- access control to avoid accidental leakage. Privacy (P) No cal (can no longer be obtained by other means), so similar deanonymisation is attempted and no identities are revealed. measurements on the same topic are hard or even impossible Controlled Sharing (CS) Researchers publish only partial to attain. This only becomes a benefit if the data is also / anonymised data, provide it under legal agreements that useful. Defence Mechanisms (DM) Data can be used to prevent harms or do not make these data publicly available. study the underground economy, new forms of cybercrime This includes approaches such as letting researchers visit the or new attack techniques. This allows new defences to be institution holding the data to analyse it, or the holding designed, such as anti-malware tools or efficient password institution performing analysis on behalf of other researchers, policies. Anthropology and Transparency (AT) Data such as by running their code. contains ground truth on the behaviour of human beings, which other methods could only obtain in a filtered or biased way. For example, data can reveal real human behaviour when creating passwords without the reporting bias of sur- veys or experiments with human participants. Additionally, 7.5 Harms data can provide transparency through information that aids Conducting research using data containing sensitive informa- understanding of government surveillance activities, exter- tion entails risks dependant on the consequences of the data nal relationships, or of company behaviour. The additional gathering, the research itself, or any further leakage of these transparency into state or corporate actors may have greater data maintained by the researchers. In general only harms benefits than if the data concerned individuals, as it may have to natural persons (rather than corporate persons), or to the additional public benefit by providing checks and balances environment, are considered during ethical review. Illicit on power. measurement (I) Research obtained these data by means of illicit activities such as hacking or paying the offenders, which can lead to researchers being prosecuted. Potential 7.7 Discussion Abuse (PA) Research results from these data can be used We observe a wide variation in the ethical issues mentioned by malicious actors to cause additional harm, for example by by the authors and their justifications for using these data, means of designing evasive malware or updating password even when they are using the same data. This is clear from cracking policies. De-Anonymization (DA) Research on studying Table 3. Two works stated that they were exempt these data can be used to de-anonymise or re-identify people from REB approval, two received REB approval and 46 did or networks. Also, identification of group of individuals may not mention REBs. The reasons given for exemption were: raise ethical concerns such as discrimination or violence to- no human subjects or ethical concerns [332]; no personally wards identified groups [57]. Sensitive Information (SI) identifiable information (despite email addresses and IPs) These data contains sensitive and private information, which and public data [77]. Both of these works used Safeguards to can be used to harm natural persons. For example, if the mitigate potential Harms and have clear ethical justifications. user password from one service is leaked, their credentials to These exemptions are all based on the absence of direct other services can be compromised due to password reuse [46]. human subjects. However, in each case they were measuring Researcher Harm (RH) The research can lead to the re- human behaviour and if they had tried to identify individuals searchers being prosecuted by law enforcement, since these they might have been successful. The absence of human data may include illegal material. Researchers could be subjects appears an artificial distinction in these cases, as threatened by criminals, e.g. in underground forums [94], there were human participants. Both of the papers that or by state or industry actors that dislike the work. There received REB approval [79, 46] obtained it not because of may also be a risk of emotional trauma to researchers if their usage of data of illicit origin, but because they also they come across distressing content, such as pornography conducted surveys or other human subject research. Not or violence, during the work. Behavioural Change (BC) having REB approval solely on the basis that data is public The research can change the behaviour of the stakeholders is contrary to the opinions of experts [54], since this data of these data, which may have negative consequences. For may contain private information. example, a market vendor can provide fake information if Explicit ethics sections were included in 34 of the 4: pa- she knows that she is being measured [323]. Alternatively pers. However, since we specifically selected papers for this the research may encourage future collection or use of data table because they talked about ethics, this is unlikely to of illicit origin. be representative. Nonetheless, it does show that a high IMC ’39, November 3–5, 4239, London, UK Daniel R. Thomas et al.

Legal issues Ethical issues Justifications XX

Sources 42 Harms Justice Benefits Category Reference Terrorism Copyright Safeguards Safeguards Year Public data Not the first Data privacy Ethics section REB approval Identify harms Public interest Necessary data Indecent images National security Computer misuse No additional harm Fight malicious use Identification of stakeholders AT&T database [328]a 32 ••    I,PA,SI,RH Pushdo/Cutwail botnet [325] 33 •••    R,U,DM 52 exploit kits [7:] 35 ••    DM,AT Carna Scan [3:]a 35 •    [92] 35 •    P,CS PA [84]a 36 •    Malware & exploitation [49]b 36 •   ∅ RH,BC 373 malware pieces [39] 38 ••    CS R,U,AT MS + 4 others [343] 2; ••    SS,P,CS SI,BC R,DM e MS,RY + 6 others [79] 34 ••    P SI DM MS,YV,FB + 9 others [46] 36 ••    P SI DM,AT

dumps MS,RY,YV [336] 37 ••    P SI DM Password MS,RY,FB [53] 37 ••    SS,P SI DM 8 underground forums [98] 33 •••••    U,DM,AT 5 carding forums [345] 35 ••••    DM,AT TwBooter [76] 35 •••    P SI TwBooter, 36 others [;5] 35 •••    P SI Asylum, Lizard, Vdos [77] 37 •••    E P SI Leaked c  G  databases Patreon [:7] 38 ••• ∅ SI,RH U,AT Vdos, CMDBooter [332] 39 ••    E P,CS SI,BC U,AT 6 underground forums [:8] 39 •••••    R,DM,AT Manning Wikileaks [34] 37 ••••    [;]a 38 ••••    [327] 39 ••••    Snowden NSA leaks [86] 35 •••••    [;7]a 35 •••••    materials Classified [32] 37 •••••   d  [33:] 38 •••••    Mossack Fonseca [:4] 38 ••••    DM database [9;] 38 ••••    BC data Financial a These works were not peer reviewed. b This paper analysed the ethics of the Carna scan and its use, but did not use it. c The authors did not use the leaked database. d Here the argument is that the NSA is the malicious actor. e MS: MySpace, RY: RockYou, FB: Facebook, YV: Yahoo Voices Table 3: Summary of the legal/ethical issues and the justifications made by the authors for each paper. Legal issues are defined in §5, ethical issues in §4.3 and justifications in §7.3. • means that the legal issue is applicable (even if it was not discussed).  means the issue was discussed or the justification was used,  means it was not. G means that the authors decided that the work could not be justified and did not use the dataset. In the REB approval column, ∅ means not applicable and E means exempt. Safeguards: SS Secure Storage, P Privacy, CS Controlled Sharing. Harms: I Illicit measurement, PA Potential Abuse, DA De-Anonymization, SI Sensitive Information, RH Researcher Harm, BC Behavioural Change. Benefits: R Reproducibility, U Uniqueness, DM Defence Mechanisms, AT Anthropology and Transparency. Ethical issues in research using datasets of illicit origin IMC ’39, November 3–5, 4239, London, UK proportion of papers using data of illicit origin do already in Information Communication Technology Research (ICTR), have ethics sections. We do not have enough information and to provide timely and well informed responses. to show any trend in this behaviour, and because we would In any case, papers using data of illicit origin should al- expect this behaviour to be field dependent, we would need ways have an ethics section, explaining how these data were a large representative sample from each field to be able to obtained, how it has been protected, analysing the harms, show any trend. benefits, and need for using such data. Conferences and Discussion of safeguards, harms and benefits in the papers journals should explicitly require such ethics sections and is highly variable. We included those that were implicitly or highlight the importance of considering ethics in their Calls explicitly discussed in the papers. However, we are aware for Papers. Researchers using data that has been shared that many more would apply but they were not mentioned with them under acceptable usage policies should, as Allman explicitly. In general, we observe that researchers appear to and Paxson suggest, cite the acceptable usage policy they be more reluctant to express the potential harms resulting are operating under [6]. Data providers should make their from their work than their benefits. acceptable usage policies publicly available so that they can Privacy preservation is one of the safeguards applied most be cited. Additionally, current efforts to advise on ethics, frequently, since it is easy to do (e.g. by refusing to attempt such as the Menlo Report [4:], should be updated and ex- to de-anomymize a data set or refusing to reveal private tended to provide a more comprehensive coverage of ICTR information). Only four of the papers discussed controlled and legal considerations to guide researchers aiming to use sharing (CS). To provide the maximum benefit, papers need datasets of illicit origin. Ethics is an issue that is receiving to be reproducible and so while open data is often not appro- greater interest within ICTR and standards are improving. priate for data of illicit origin, controlled sharing of data with Hence, we are hopeful that in the future better information researchers is important. It is possible the authors would on current practice, and better guidance, will be available. be willing to share if they were asked, but mentioning this increases the likelihood it will happen. There is a cost in ACKNOWLEDGMENTS controlled sharing as it places a burden on the authors of Daniel R. Thomas is supported by a grant from ThreatSTOP the work to put a robust legal framework in place and an Inc. All authors are supported by the EPSRC [grant number ongoing cost of vetting requests to access the data. However, EP/M242542/3]. The opinions, findings, and conclusions or this cost can be delegated, for example using initiatives such recommendations expressed are those of the authors and do as IMPACT [6;] in the USA or the Cambridge Cybercrime not necessarily reflect those of any of the funders. Thanks to Centre [42] in the UK. José Jair Santanna, Alan Blackwell, Justin Schlosberg, Brian Trammell (our shepherd) and to the anonymous reviewers for helpful comments on this paper. 8 CONCLUSION It is common to see research using data of illicit origin, such REFERENCES as leaked databases or classified documents. The use of [3] 3;:6. 3: U.S. Code § 3252 – Fraud and related activity in these data provide researchers with the opportunity to test connection with computers. their hypotheses against ground truth data. For example, it https://www.law.cornell.edu/uscode/text/3:/3252. improves our understanding of the evolution of malware or [4] 4225. 3: U.S. Code § 3688A – Obscene visual how users re-use passwords. This in turn helps to improve representations of the sexual abuse of children. (Apr. 52, cyber defenses or password policies. However, these data 4225). were originally collected by illicit means, and the research https://www.law.cornell.edu/uscode/text/3:/3688A. [5] Charu C. Aggarwal. 4227. On k-anonymity and the curse community must be aware of the ethical considerations of of dimensionality. In Proceedings of the 53st International using such data. Conference on Very Large Data Bases (VLDB ’27). By analysing current advice and previous work on ethics, VLDB Endowment, Trondheim, Norway, ;23–;2;. isbn: we have drawn out a set of ethical issues that should be 3-7;7;5-376-8. considered and reported when publishing results based on [6] Mark Alllman and Vern Paxson. 4229. Issues and data of illicit origin. However, we have shown that there is etiquette concerning use of shared measurement data. In a lack of consistency in the consideration of ethical issues, Proceedings of the 9th ACM SIGCOMM conference on which results in cases where insufficient safeguards are used Internet measurement. ACM, 357–362. to prevent harm. Few authors consulted their institution [7] Julian Assange. 4239. Vault 9: CIA hacking tools revealed. REB, and in most cases they were exempted on the basis WikiLeaks. (Mar. 4239). Retrieved Mar. 9, 4239 from https://wikileaks.org/ciav9p3/, https://archive.fo/iK6kO. that there were no human subjects involved in the research. [8] Associated Press in Manila. 4238. Kill drug dealers and This narrow focus on whether the research involves “human I’ll give you a medal, says Philippines president. subjects”, rather than a risk based analysis of the potential theguardian. Retrieved Sept. 4:, 4239 from https: harms to human participants is unhelpful. If research has //www.theguardian.com/world/4238/jun/27/kill-drug- potential to harm humans, even in absence of direct human dealers-medal-philippines-president-rodrigo-duterte, subjects, REB approval should be sought. REBs may need to https://perma.cc/7KTA-VR7R. adapt to understand the possible affect on humans of research IMC ’39, November 3–5, 4239, London, UK Daniel R. Thomas et al.

[9] James Ball. 4233. Unredacted US embassy cables available framework. Retrieved Sept. 4:, 4239 from online after WikiLeaks breach. theguardian. (Sept. 4233). https://www.cambridgecybercrime.uk/data.html, Retrieved Feb. 46, 4239 from https: https://perma.cc/8KM7-K6Q5. //www.theguardian.com/world/4233/sep/23/unredacted- [43] 3;;2. Computer Misuse Act. us-embassy-cables-online, https://archive.fo/ozPM9. http://www.legislation.gov.uk/ukpga/3;;2/3:/contents. [:] James Ball, Julian Borger, and Glenn Greenwald. 4235. [44] Council of the European Union. 4238. Regulation (EU) Revealed: How US and UK spy agencies defeat Internet 4238/89; of the European Parliament and of the Council privacy and security. theguardian. (Sept. 4235). Retrieved on the protection of natural persons with regard to the Feb. 45, 4239 from processing of personal data and on the free movement of https://www.theguardian.com/world/4235/sep/27/nsa- such data, and repealing Directive ;7/68/EC (General gchq-encryption-codes-security, Data Protection Regulation). European Union, (4238). https://archive.fo/;82QN. http://eur-lex.europa.eu/legal- [;] Tjaart Barnard. 4238. A cold relationship: United States content/EN/TXT/PDF/?uri=CELEX: foreign policy towards South Africa, 3;82–3;;2. 54238R289;&from=EN. Ph.D. Dissertation. Stellenbosch University. [45] British Society of Criminology. 4237. Statement of ethics. [32] Richard Barnes, Bruce Schneier, Cullen Jennings, Retrieved Sept. 4:, 4239 from Ted Hardie, Brian Trammell, Christian Huitema, and http://www.britsoccrim.org/ethics/, Daniel Borkmann. 4237. RFC 9846: Confidentiality in the https://perma.cc/K5MY-UG7U. face of pervasive surveillance: A threat model and [46] Anupam Das, Joseph Bonneau, Matthew Caesar, problem statement. Informational RFC. Internet Nikita Borisov, and XiaoFeng Wang. 4236. The tangled Architecture Board (IAB), (Aug. 4237), 3–46. web of password reuse. In Network and Distributed [33] Randy Barnett. 4237. Why the NSA data seizures are System Security Symposium (NDSS), 45–48. doi: unconstitutional. Harvard Journal of Law & Public 32.36944/ndss.4236.45579. Policy, 5:, 3, 5–42. [47] Jessica Deahl. 4233. Professors differ on ethics of using [34] Andrea Berger. 4237. North Korea in the global arms WikiLeaks cables. NPR. (Feb. 9, 4233). Retrieved Mar. 4;, market. Whitehall Papers, :6, 3, 34–56. Taylor & Francis. 4239 from [35] Joseph Bonneau. 4234. Guessing human-chosen secrets. http://www.npr.org/4233/24/29/355556524/professors- Tech. rep. :3;. University of Cambridge, Computer differ-on-ethics-of-using-wikileaks-cables, Laboratory, (May 4234). https://archive.fo/Q89:4. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR- [48] David Dittrich, Michael Bailey, and Erin Kenneally. 4235. :3;.html. Applying ethical principles to information and [36] Gwern Branwen et al. 4237. Dark net market archives, communication technology research: A companion to the 4233-4237. (July 4237). Retrieved Mar. 36, 4239 from Menlo Report. Tech. rep. U.S. Department of Homeland http://www.gwern.net/DNM%42archives, Security, (Oct. 4235). doi: 32.435;/ssrn.4564258. https://perma.cc/4Z98-JJ8W. [49] David Dittrich, Katherine Carpenter, and Manish Karir. [37] Aaron J. Burstein. 422:. Conducting cybersecurity 4236. An ethical examination of the Internet Census 4234 research legally and ethically. In Usenix Workshop on dataset: A Menlo report case study. In IEEE Large-Scale Exploits and Emergent Threats (LEET). International Symposium on Ethics in Science, USENIX, ::3–:::. Technology and Engineering (ETHICS). IEEE, (May [38] Aylin Caliskan-Islam, Richard Harang, Andrew Liu, 4236). doi: 32.332;/ETHICS.4236.8:;5638. Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and [4:] David Dittrich and Erin Kenneally. 4234. The Menlo Rachel Greenstadt. 4237. De-anonymizing programmers Report: Ethical principles guiding information and via code stylometry. In 46th USENIX Security communication technology research. Tech. rep. U.S. Symposium (USENIX Security), Washington, DC. Department of Homeland Security, (Aug. 4234). doi: [39] Alejandro Calleja, Juan Tapiador, and Juan Caballero. 32.435;/ssrn.4667324. 4238. A look into 52 years of malware development from [4;] David Dittrich, Felix Leder, and Tillmann Werner. 4232. a software metrics perspective. In International A case study in ethical decision making regarding remote Symposium on Research in Attacks, Intrusions, and mitigation of botnets. International Conference on Defenses (RAID). Springer, 547–567. Financial Cryptography and Data Security (FC), LNCS [3:] 4235. Carna Botnet Scans. CAIDA website. (May 4:, 8276, 438–452. doi: 32.3229/;9:-5-864-36;;4-6_42. 4235). Retrieved Feb. 39, 4239 from [52] John E. Dunn. 4235. AT&T hacker ‘’ sentenced to 63 https://www.caida.org/research/security/carna/, months for iPad leak. techworld. (Mar. 4235). Retrieved https://archive.fo/uTygt. Feb. 39, 4239 from [3;] Jonathan Cave. 4238. The ethics of data and of data http://www.techworld.com/news/security/att-hacker- science: an economist’s perspective. Philosophical weev-sentenced-63-months-for-ipad-leak-5657;:7/, Transactions of the Royal Society of London A: https://archive.fo/NCASE. Mathematical, Physical and Engineering Sciences, 596, [53] Markus Dürmuth, Fabian Angelstorf, Claude Castelluccia, 42:5. The Royal Society. issn: 3586-725X. doi: Daniele Perito, and Abdelberi Chaabane. 4237. OMEN: 32.32;:/rsta.4238.2339. Faster password guessing using an ordered Markov [42] Richard Clayton, Julia Powles, and Cambridge University enumerator. In International Symposium on Engineering Legal. 4238. Cambridge Cybercrime Centre: Legal Secure Software and Systems. Springer, 33;–354. Ethical issues in research using datasets of illicit origin IMC ’39, November 3–5, 4239, London, UK

[54] Serge Egelman, Joseph Bonneau, Sonia Chiasson, Electronic Crime Research (eCrime). IEEE, Scottsdale, David Dittrich, and Stuart Schechter. 4234. It’s not AZ, USA. doi: 32.332;/ECRIME.4239.9;67274. stealing if you need it: A panel on the ethics of performing [68] Alice Hutchings and Richard Clayton. 4238. Exploring research using public data of illicit origin. In International the Provision of Online Booter Services. Deviant Conference on Financial Cryptography and Data Security Behaviour, 59, 32, 3385–339:. Taylor & Francis. issn: (FC). Vol. LNCS 95;:, 346–354. doi: 2385-;847. doi: 32.32:2/2385;847.4238.338;:4;. 32.3229/;9:-5-864-5685:-7_33. [69] Alice Hutchings and Thomas J. Holt. 4237. A crime script [55] Charles Ess and AoIR ethics working Committee. 4224. analysis of the online stolen data market. British Journal Ethical decision-making and Internet research. of Criminology, 77, 5, 7;8–836. issn: 3686574;. doi: Association of Internet Research Ethics Working 32.32;5/bjc/azu328. Committee. (Nov. 4224). Retrieved Sept. 4:, 4239 from [6:] M. Ilešič, A. Prechal, A. Rosas, C. Toader, and http://aoir.org/reports/ethics.pdf, E. Jarašiunas.¯ 4238. Breyer v Germany: Judgement of the https://perma.cc/8;UY-ETC9. court (Second Chamber). InfoCuria - Case-law of the [56] Rob Evans, Ian Cobain, and Nicola Slawson. 4239. Court of Justice. (Oct. 4238). Retrieved Feb. 43, 4239 Government accused of ‘full-frontal attack’ on from http://curia.europa.eu/juris/document/document. whistleblowers. theguardian. (Feb. 4239). Retrieved jsf?docid=3:688:&doclang=EN&cid=32;7733, Sept. 4:, 4239 from https://www.theguardian.com/uk- https://archive.fo/kl5m9. news/4239/feb/34/uk-government-accused-full-frontal- [6;] 4239. IMPACT cyber trust. Retrieved May 37, 4239 from attack-prison-whistleblowers-media-journalists, https://www.impactcybertrust.org/, https://archive.fo/ZwdDK. https://perma.cc/Y9FG-;QU9. [57] Luciano Floridi and Mariarosaria Taddeo. 4238. What is [72] Information Commisioners Office (ICO). 4239. Overview data ethics? Philosophical Transactions of the Royal of the General Data Protection Regulation (GDPR). Society of London A: Mathematical, Physical and (4239). Retrieved Aug. 33, 4239 from Engineering Sciences, 596, 42:5. The Royal Society. issn: https://ico.org.uk/for-organisations/data-protection- 3586-725X. doi: 32.32;:/rsta.4238.2582. reform/overview-of-the-gdpr/, [58] Barton Gellman. 4237. I showed leaked NSA slides at https://perma.cc/LZ;W-UQNR. Purdue, so feds demanded the video be destroyed. [73] Nigel Inkster. 4236. The Snowden Revelations: Myths and arstechnica. (Oct. 4237). Retrieved Feb. ;, 4239 from misapprehensions. Survival, 78, 3, 73–82. issn: 225;-855:. https://arstechnica.co.uk/tech-policy/4237/32/i-showed- doi: 32.32:2/225;855:.4236.::4373. leaked-nsa-slides-at-purdue-so-feds-demanded-the-video- [74] Mark Israel. 4226. Strictly confidential? Integrity and the be-destroyed/, https://archive.fo/Fkc:9. disclosure of criminological and socio-legal research. [59] 4235. German criminal code Section 3:6b: Distribution, British Journal of Criminology, 66, 7, 937–962. doi: acquisition and possession of child pornography. (Oct. 32, 32.32;5/bjc/azh255. 4235). https://www.gesetze-im- [75] Shayana Kadidal. 4236. NSA surveillance: The internet.de/englisch_stgb/englisch_stgb.html#p387;. implications for civil liberties. I/S: A Journal of Law and [5:] 4235. German criminal code Section 424a: Data Policy for the Information Society, 32, 4, 655–6:2. espionage. (Oct. 32, 4235). https://www.gesetze-im- Heinonline. internet.de/englisch_stgb/englisch_stgb.html#p396;. [76] Mohammad Karami and Dammon McCoy. 4235. Rent to [5;] 4235. German criminal code Section 485a: Computer pwn: Analyzing commodity booter DDoS services. Usenix fraud. (Oct. 32, 4235). https://www.gesetze-im- login; 5:, 8, 42–45. USENIX. internet.de/englisch_stgb/englisch_stgb.html#p43:9. [77] Mohammad Karami, Youngsam Park, and Damon McCoy. [62] 4235. German criminal code Section 525a: Data 4238. Stress testing the Booters: Understanding and tampering. (Oct. 32, 4235). https://www.gesetze-im- undermining the business of DDoS services. In internet.de/englisch_stgb/englisch_stgb.html#p469;. Proceedings of the 47th International Conference on [63] 4235. German criminal code Section 525b: Computer World Wide Web (WWW). ACM, 3255–3265. doi: sabotage. (Oct. 32, 4235). https://www.gesetze-im- 32.3367/4:94649.4::5226. internet.de/englisch_stgb/englisch_stgb.html#p469;. [78] Brian C. Keegan and J. Nathan Matias. 4237. Actually, [64] Guardian Reporters. 4238. Panama Papers: a special it’s about ethics in computational social science: A investigation. theguardian. (Apr. 4238). Retrieved multi-party risk-benefit framework for online community Feb. 45, 4239 from https: research. In CSCW workshop on human centered data //www.theguardian.com/news/series/panama-papers, science. https://archive.fo/g35Sn. [79] Patrick Gage Kelley, Saranga Komanduri, [65] Thorsten Holz. 4227. A short visit to the bot zoo Michelle L. Mazurek, Richard Shay, Timothy Vidas, [malicious bots software]. IEEE Security & Privacy, 5, 5, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, and 98–9;. IEEE. doi: 32.332;/MSP.4227.7:. Julio Lopez. 4234. Guess again (and again and again): [66] Troy Hunt. 4239. Thoughts on the LeakedSource take Measuring password strength by simulating down. troyhunt.com. (Jan. 4239). Retrieved Mar. 3, 4239 password-cracking algorithms. In IEEE Symposium on from https://www.troyhunt.com/thoughts-on-the- Security and Privacy (SP). IEEE. San Francisco, CA, leakedsource-take-down/, https://archive.fo/xSMBe. USA, 745–759. doi: 32.332;/SP.4234.5:. [67] Alice Hutchings and Richard Clayton. 4239. Configuring [7:] Vadim Kotov and Fabio Massacci. 4235. Anatomy of Zeus: A case study of online crime target selection and exploit kits. In International Symposium on Engineering knowledge transmission. In APWG Symposium on Secure Software and Systems (ESSoS). Springer, 3:3–3;8. doi: 32.3229/;9:-5-864-58785-:_35. IMC ’39, November 3–5, 4239, London, UK Daniel R. Thomas et al.

[7;] Brian Krebs. 4232. Fraud bazaar Carders.cc hacked. Policy, 57, :6–;3. Elsevier. doi: KrebsonSecurity. (May 4232). Retrieved Apr. 32, 4239 32.3238/j.drugpo.4238.27.228. from https://krebsonsecurity.com/4232/27/fraud-bazaar- [95] Ciara McCarthy. 4239. Chelsea Manning thanks Barack carders-cc-hacked/, https://perma.cc/P4JT-HJGJ. Obama for ‘giving me a chance’. theguardian. (Jan. 4239). [82] Brian Krebs. 4238. Source code for IoT botnet ‘Mirai’ Retrieved Feb. 46, 4239 from https: released. KrebsonSecurity. (Oct. 4238). Retrieved Feb. 45, //www.theguardian.com/us-news/4239/jan/3;/chelsea- 4239 from https://krebsonsecurity.com/4238/32/source- manning-thanks-barack-obama-commuted-sentence, code-for-iot-botnet-mirai-released/, https://archive.fo/jFtmJ. https://archive.fo/cNPLI. [96] Susan E. McGregor, Elizabeth Anne Watkins, [83] Brian Krebs. 4239. Who ran Leakedsource.com? Nasrullah Al-Ameen, Kelly Caine, and KrebsonSecurity. (Feb. 4239). Retrieved Feb. 39, 4239 Franziska Roesner. 4239. When the weakest link is strong: from https://krebsonsecurity.com/4239/24/who-ran- Secure collaboration in the case of the Panama Papers. In leakedsource-com/, https://archive.fo/;ZHIQ. 48th USENIX Security Symposium (USENIX Security). [84] Thomas Krenc, Oliver Hohlfeld, and Anja Feldmann. USENIX, Vancouver, BC, Canada, (Aug. 4239), 727–744. 4236. An Internet census taken by an illegal botnet — A isbn: ;9:3;53;9362;. qualitative assessment of published measurements. ACM [97] Tyler Moore and Richard Clayton. 4234. Ethical SIGCOMM Computer Communication Review, 66, 5, dilemmas in take-down research. Financial Cryptography 325–333. ACM. doi: 32.3367/4878:99.4878:;5. and Data Security (FC 4233), LNCS 9348, 376–38:. doi: [85] Susan Landau. 4236. Highlights from making sense of 32.3229/;9:-5-864-4;::;-;_36. Snowden, Part II: What’s significant in the NSA [98] Marti Motoyama, Damon McCoy, Kirill Levchenko, revelations. IEEE Security & Privacy, 34, 3, 84–86. IEEE. Stefan Savage, and Geoffrey M. Voelker. 4233. An analysis doi: 32.332;/MSP.4235.383. of underground forums. In Proceedings of the ACM [86] Susan Landau. 4235. Making sense from Snowden: What’s SIGCOMM Internet measurement conference (IMC). significant in the NSA surveillance revelations. IEEE ACM, 93–:2. Security & Privacy, 33, 6, 76–85. IEEE. doi: [99] Richard Murphy. 4237. The Joy of Tax. Bantam Press. 32.332;/MSP.4235.;2. isbn: ;9:-27;5297395. [87] David Leigh. 4232. US embassy cables leak sparks global [9:] Arvind Narayanan and Bendert Zevenbergen. 4237. No diplomatic crisis. theguardian. (Nov. 4232). Retrieved encore for Encore? Ethical questions for web-based Feb. 45, 4239 from censorship measurement. Technology Science, 3–45. issn: https://www.theguardian.com/world/4232/nov/4:/us- 3778-728:. doi: 32.435;/ssrn.488736:. embassy-cable-leak-diplomacy-crisis, [9;] James O’Donovan, Hannes Wagner, and Stefan Zeume. https://archive.fo/TQz2I. 4239. The Value of Offshore Secrets – Evidence from the [88] George R. Lucas. 4236. NSA management directive #646: Panama Papers. https://ssrn.com/abstract=49932;7. Secrecy and privacy in the aftermath of Edward Snowden. [:2] Shu-Yi Oei and Diane M. Ring. 423:. Leak-driven law. Ethics & International Affairs, 4:, 3, 4;–5:. doi: UCLA Law Review, 87. doi: 32.435;/ssrn.4;3:772. 32.3239/S2:;489;6352226::. [:3] Paul Ohm, Douglas C. Sicker, and Dirk Grunwald. 4229. [89] Samuel D. Lustgarten. 4237. Emerging ethical threats to Legal issues surrounding monitoring during network client privacy in cloud communication and data storage. research. Proceedings of the 9th ACM SIGCOMM Professional Psychology: Research and Practice, 68, 5, Internet measurement conference (IMC), 363–36:. doi: 376. American Psychological Association. 32.3367/34;:528.34;:529. [8:] Ewen MacAskill. 4239. WikiLeaks publishes ‘biggest ever [:4] Jim Omartian. 4238. Tax information exchange and leak of secret CIA documents’. theguardian. (Mar. 4239). offshore entities: Evidence from the Panama Papers. doi: Retrieved Mar. 9, 4239 from https: 32.435;/ssrn.4:58857. //www.theguardian.com/media/4239/mar/29/wikileaks- [:5] Craig Partridge and Mark Allman. 4238. Ethical publishes-biggest-ever-leak-of-secret-cia-documents- considerations in network measurement papers. hacking-surveillance, https://archive.fo/Ce:2x. Communications of the ACM, 7;, 32, 7:–86. doi: [8;] Mitch MacDonald, Richard Frank, Joseph Mei, and 32.3367/4:;8:38. Bryan Monk. 4237. Identifying digital threats in a hacker [:6] David Pegg. 4238. Panama Papers: Europol links 5,722 web forum. In Advances in Social Networks Analysis and names to suspected criminals. theguardian. (Dec. 4238). Mining (ASONAM), 4237 IEEE/ACM International Retrieved Feb. 45, 4239 from https: Conference on. IEEE, ;48–;55. //www.theguardian.com/news/4238/dec/23/panama- [92] Erwan Le Malécot and Daisuke Inoue. 4236. The Carna papers-europol-links-5722-names-to-suspected-criminals, Botnet through the lens of a network telescope. https://archive.fo/NGkLj. Foundations and Practice of Security, LNCS :574, [:7] Nathaniel Poor and Roei Davidson. 4238. The ethics of 648–663. Springer. doi: 32.3229/;9:-5-53;-27524-:_48. using hacked data: Patreon’s data hack and academic [93] Annette Markham and Elizabeth Buchanan. 4234. Ethical data standards. Data and Society. Tech. rep. Council for decision-making and Internet research: Recommendations big data, ethics and society, (Mar. 4238), 3–9. Retrieved from the AoIR Ethics Working Committee (Version 4.2). Sept. 4:, 4239 from http://bdes.datasociety.net/council- AOIR, (Aug. 4234), 3–3;. output/case-study-the-ethics-of-using-hacked-data- http://www.aoir.org/documents/ethics-guide. patreons-data-hack-and-academic-data-standards/, [94] James Martin and Nicolas Christin. 4238. Ethics in https://perma.cc/H:T8-QPYK. cryptomarket research. International Journal of Drug Ethical issues in research using datasets of illicit origin IMC ’39, November 3–5, 4239, London, UK

[:8] Rebecca Portnoff, Sadia Afroz, Greg Durrett, (Oct. 4235). Retrieved Feb. 39, 4239 from Jonathan Kummerfeld, Taylor Berg-Kirkpatrick, https://insights.sei.cmu.edu/cert/4235/32/working-with- Damon McCoy, Kirill Levchenko, and Vern Paxson. 4239. the-internet-census-4234.html, https://archive.fo/gOO9R. Tools for automated analysis of cybercriminal markets. In [322] Christopher Soghoian. 422:. Legal risks for phishing Proceedings of 48th International World Wide Web researchers. eCrime Researchers Summit. IEEE. issn: conference (WWW). ACM, 879–888. doi: 3778-728:. doi: 32.332;/ECRIME.422:.68;8;93. 32.3367/525:;34.5274822. [323] Kyle Soska and Nicolas Christin. 4237. Measuring the [:9] Sören Preibusch. 4237. Privacy Behaviors After Snowden. longitudinal evolution of the online anonymous Communications of the ACM, 7:, 7, 6:–77. ACM. doi: marketplace ecosystem. 46th USENIX Security 32.3367/4885563. Symposium, August, 55–6:. [::] 3;9:. Protection of Children Act. [324] Eugene H. Spafford. 3;;4. Are computer hacker break-ins http://www.legislation.gov.uk/ukpga/3;9:/59/contents. ethical? The Journal of Systems and Software, 39, 3, [:;] Ronald E. Rice and Christine L. Borgman. 3;:5. The use 63–69. doi: 32.3238/2386-3434(;4);229;-Y. of computer-monitored data in information science and [325] Brett Stone-Gross, Thorsten Holz, Gianluca Stringhini, communication research. Journal of the American Society and Giovanni Vigna. 4233. The underground economy of for Information Science and Technology, 56, 6, 469–478. spam: A botmaster’s perspective of coordinating doi: 32.3224/asi.6852562626. large-scale spam campaigns. Large-scale exploits and [;2] Risk-based-security. 4238. Nulled.IO: Should’ve expected emergent threats (LEET). the unexpected! Risk Based Security. (May 4238). [326] Alexandra Strigunkova. 4237. “To whom it’s not concern” Retrieved Apr. 4:, 4239 from (ethical problems of information leaks research). https://www.riskbasedsecurity.com/4238/27/nulled-io- PasswordsCon. Retrieved Feb. 45, 4239 from shouldve-expected-the-unexpected/. https://www.youtube.com/watch?v=XM7MO29fvtM. [;3] Bob Rudis and Deral Heiland. 4238. From Carna to Mirai: Video. Recovering from a lost opportunity. DarkReading. [327] Luca Talarico and Luca Zamparini. 4239. Intermodal (Dec. :, 4238). Retrieved Feb. 44, 4239 from http: transport and international flows of illicit substances: //www.darkreading.com/endpoint/from-carna-to-mirai- Geographical analysis of smuggled goods in Italy. Journal recovering-from-a-lost-opportunity/a/d-id/3549855, of Transport Geography, 82, 3–32. Elsevier. https://archive.fo/4hGOt. [328] Ryan Tate. 4232. Apple’s worst security breach: 336,222 [;4] Sagar Samtani, Ryan Chinn, and Hsinchun Chen. 4237. iPad owners exposed. Gawker. (June 4232). Retrieved Exploring hacker assets in underground forums. In IEEE Feb. 39, 4239 from http://gawker.com/777;568/apples- International Conference on Intelligence and Security worst-security-breach-336222-ipad-owners-exposed, Informatics (ISI). IEEE, 53–58. https://archive.fo/XqVE:. [;5] José Jair Santanna, Romain Durban, Anna Sperotto, and [329] Ryan Tate. 4232. FBI investigating iPad breach (update). Aiko Pras. 4237. Inside booters: An analysis on Gawker. (June 4232). Retrieved Feb. 39, 4239 from http: operational databases. Proceedings of the 4237 //gawker.com/7782764/fbi-investigating-ipad-breach, IFIP/IEEE International Symposium on Integrated https://archive.fo/P3A;d. Network Management, IM 4237, 654–662. doi: [32:] 4222. Terrorism Act. 32.332;/INM.4237.9362542. http://www.legislation.gov.uk/ukpga/4222/33/contents. [;6] William E. Scheuerman. 4236. Whistleblowing as civil [32;] The International consortium of Investigative Journalists disobedience: The case of Edward Snowden. Philosophy & (ICIJ). 4238. The Panama papers: Politicians, criminals Social Criticism, 62, 9, 82;–84:. doi: and the rogue industry that hides their cash. ICIJ website. 32.3399/23;3675936759485. (Apr. 4238). Retrieved Feb. 45, 4239 from [;7] Bruce Schneier. 4235. Attacking Tor: how the NSA https://panamapapers.icij.org/, targets users’ online anonymity. The Guardian, (Oct. https://archive.fo/aC73M. 4235). Retrieved Apr. 5, 4239 from [332] Daniel R. Thomas, Richard Clayton, and https://www.theguardian.com/world/4235/oct/26/tor- Alastair R. Beresford. 4239. 3222 days of UDP attacks-nsa-users-online-anonymity, amplification DDoS attacks. In APWG Symposium on https://archive.fo/gpShR. Electronic Crime Research (eCrime). IEEE, Scottsdale, [;8] Bruce Schneier. 4236. Metadata = Surveillance. IEEE AZ, USA. doi: 32.332;/ECRIME.4239.9;67279. Security and Privacy, 34, 4, :6. IEEE. issn: 3762-9;;5. [333] Lawrence J. Trautman. 4238. Following the money: doi: 32.332;/MSP.4236.4:. Lessons from the Panama Papers. Penn State Law [;9] Khadija Sharife. 4238. Out of luck: The fall of Banco Review, 3–99. https://ssrn.com/abstract=49:5725. Espírito Santo. World Policy Journal, 55, 5, 329–333. [334] United Nations General Assembly. 3;6:. Universal World Policy Institute. doi: 32.3437/29624997-5935299. Declaration of Human Rights. United Nations. http: [;:] Tom Simonite. 4234. Jail looms for man who revealed //www.un.org/en/universal-declaration-human-rights/. AT&T leaked iPad user e-mails. MIT Technology Review. [335] Universities UK. 4234. Oversight of security-sensitive (Nov. 4234). Retrieved Feb. 39, 4239 from research material in UK universities: guidance. (Oct. https://www.technologyreview.com/s/729883/jail-looms- 4234), 38 pages. Retrieved Sept. 4:, 4239 from for-man-who-revealed-att-leaked-ipad-user-e-mails/, http://www.universitiesuk.ac.uk/policy-and- https://archive.fo/s57I:. analysis/reports/Documents/4234/oversight-of-security- [;;] Timur Snoke, Deana Shick, and Angela Horneman. 4235. sensitive-research-material.pdf, Working with the Internet Census 4234. CERT/CC Blog. https://perma.cc/U:N9-KG8X. IMC ’39, November 3–5, 4239, London, UK Daniel R. Thomas et al.

[336] Blase Ur et al. 4237. Measuring real-world accuracies and [33;] Patrick L. Warren. 4238. Research with leaked data. biases in modeling password guessability. In USENIX Society for Institutional & Organizational Economics. Security, 685–6:3. (May 4238). Retrieved Apr. 32, 4239 from [337] Liis Vihul, Christian Czosseck, Katharina Ziolkowski, https://www.sioe.org/news/research-leaked-data, Lauri Aasmann, Ivo A. Ivanov, and https://archive.fo/F;2KP. Sebastian Brüggemann. 4234. Legal Implications of [342] Matt Weir, Sudhir Aggarwal, Michael Collins, and Countering Botnets. Tech. rep. NATO Cooperative Cyber Henry Stern. 4232. Testing metrics for password creation Defence Centre of Excellence, the European Network, and policies by attacking large sets of revealed passwords. In Information Security Agency (ENISA), 3–89. Retrieved Proceedings of the 39th ACM conference on Computer Sept. 4:, 4239 from and communications security (CCS). ACM, 384–397. https://ccdcoe.org/sites/default/files/multimedia/pdf/ [343] Matt Weir, Sudhir Aggarwal, Breno De Medeiros, and VihulCzosseckZiolikowskiAasmannIvanovBr%C5% Bill Glodek. 422;. Password cracking using probabilistic BCggemann4234_ context-free grammars. In 52th IEEE Symposium on LegalImplicationsOfCounteringBotnets_2.pdf, Security and Privacy, 5;3–627. https://perma.cc/8FH8-UGRU. [344] David Murakami Wood and Steve Wright. 4237. Before [338] Wagas. 4238. Hacking, Trading Forum w2rm.ws Hacked; and after Snowden. Surveillance & Society, 35, 4, 354–35:. Exploit Kits, Database Leaked. HackRead. Retrieved issn: 3699-96:9. Sept. 4:, 4239 from https://www.hackread.com/hacking- [345] Michael Yip, Nigel Shadbolt, and Craig Webber. 4235. forum-w2rm-ws-hacked-data-leaked/, Why forums? An empirical analysis into the facilitating https://perma.cc/V5SU-EUWX. factors of carding forums. In Proceedings of the 7th [339] Maciej Walkowski. 4238. The problem of mounting Annual ACM Web Science Conference. ACM, 675–684. income inequalities in the world vis-a-vis the phenomenon doi: 32.3367/4686686.4686746. of harmful tax competition. The ICIJ tracking down the [346] Ben Zhao. 4237. Is it legal, ethical and publishable to do greatest financial scandals of the 43st century. Przegląd academic research using data from the Ashley Madison Politologiczny, 4, 359–376. doi: 32.36968/pp.4238.43.4.33. leak? Quora. (Nov. 4237). Retrieved Apr. 32, 4239 from [33:] Patrick F. Walsh and Seumas Miller. 4238. Rethinking https://www.quora.com/Is-it-legal-ethical-and- ‘Five Eyes’ security intelligence collection policies and publishable-to-do-academic-research-using-data-from- practice post Snowden. Intelligence and National Security, the-Ashley-Madison-leak. 53, 5, 567–58:. Routledge. doi: 32.32:2/248:6749.4236.;;:658.