A Framework for Technology-Assisted Sensitivity Review: Using Sensitivity Classification to Prioritise Documents for Review
Total Page:16
File Type:pdf, Size:1020Kb
A Framework for Technology-Assisted Sensitivity Review: Using Sensitivity Classification to Prioritise Documents for Review Graham McDonald Submitted in fulfilment of the requirements for the Degree of Doctor of Philosophy School of Computing Science College of Science and Engineering University of Glasgow March 2019 Abstract More than a hundred countries implement freedom of information laws. In the UK, the Freedom of Information Act 2000 (c. 36) (FOIA) states that the government’s documents must be made freely available, or opened, to the public. Moreover, all central UK government departments’ documents that have a historic value, for example the minutes from significant meetings, must be transferred to the The National Archives (TNA) within twenty years of the document’s cre- ation. However, government documents can contain sensitive information, such as personal in- formation or information that would likely damage the international relations of the UK if it was opened to the public. Therefore, all government documents that are to be publicly archived must be sensitivity reviewed to identify and redact the sensitive information, or close the document un- til the information is no longer sensitive. Historically, government documents have been stored in a structured file-plan that can reliably inform a sensitivity reviewer about the subject-matter and the likely sensitivities in the documents. However, the lack of structure in digital document collections and the volume of digital documents that are to be sensitivity reviewed mean that the traditional manual sensitivity review process is not practical for digital sensitivity review. In this thesis, we argue that the automatic classification of documents that contain sensitive information, sensitivity classification, can be deployed to assist government departments and human reviewers to sensitivity review born-digital government documents. However, classify- ing sensitive information is a complex task, since sensitivity is context-dependent. For example, identifying if information is sensitive or not can require a human to judge on the likely effect of releasing the information into the public domain. Moreover, sensitivity is not necessarily topic- oriented, i.e., it is usually dependent on a combination of what is being said and about whom. Furthermore, the vocabulary and entities that are associated to particular types of sensitive in- formation, e.g., confidential information, can vary greatly between different collections. We propose to address sensitivity classification as a text classification task. Moreover, through a thorough empirical evaluation, we show that text classification is effective for sensitivity clas- sification and can be improved by identifying the vocabulary, syntactic and semantic document features that are reliable indicators of sensitive or non-sensitive text. Furthermore, we propose to reduce the number of documents that have to be reviewed to learn an effective sensitivity classi- fier through an active learning strategy in which a sensitivity reviewer redacts any sensitive text in a document as they review it, to construct a representation of the sensitivities in a collection. i ABSTRACT ii With this in mind, we propose a novel framework for technology-assisted sensitivity review that can prioritise the most appropriate documents to be reviewed at specific stages of the review process. Furthermore, our framework can provide the reviewers with useful information to assist them in making their reviewing decisions. Our framework consists of four components, namely the Document Representation, Document Prioritisation, Feedback Integration and Learned Pre- dictions components, that can be instantiated to learn from the reviewers’ feedback about the sensitivities in a collection or provide assistance to reviewers at different stages of the review. In particular, firstly, the Document Representation component encodes the document features that can be reliable indicators of the sensitivities in a collection. Secondly, the Document Prioritisa- tion component identifies the documents that should be prioritised for review at a particular stage of the reviewing process, for example to provide the sensitivity classifier with information about the sensitivities in the collection or to focus the available reviewing resources on the documents that are the most likely to be released to the public. Thirdly, the Feedback Integration component integrates explicit feedback from a reviewer to construct a representation of the sensitivities in a collection and identify the features of a reviewer’s interactions with the framework that indicate the amount of time that is required to sensitivity review a specific document. Finally, the Learned Predictions component combines the information that has been generated by the other three components and, as the final step in each iteration of the sensitivity review process, the Learned Predictions component is responsible for making accurate sensitivity classification and expected reviewing time predictions for the documents that have not yet been sensitivity reviewed. In this thesis, we identify two realistic digital sensitivity review scenarios as user models and conduct two user studies to evaluate the effectiveness of our proposed framework for as- sisting digital sensitivity review. Firstly, in the limited review user model, which addresses a scenario in which there are insufficient reviewing resources available to sensitivity review all of the documents in a collection, we show that our proposed framework can increase the num- ber of documents that can be reviewed and released to the public with the available reviewing resources. Secondly, in the exhaustive review user model, which addresses a scenario in which all of the documents in a collection will be manually sensitivity reviewed, we show that pro- viding the reviewers with useful information about the documents in the collection that contain sensitive information can increase the reviewers’ accuracy, reviewing speed and agreement. This is the first thesis to investigate automatically classifying FOIA sensitive information to assist digital sensitivity review. The central contributions of this thesis are our proposed frame- work for technology-assisted sensitivity review and our sensitivity classification approaches. Our contributions are validated using a collection of government documents that are sensitivity reviewed by expert sensitivity reviewers to identify two FOIA sensitivities, namely international relations and personal information. The thesis draws insights from a thorough evaluation and analysis of our proposed framework and sensitivity classifier. Our results demonstrate that our proposed framework is a viable technology for assisting digital sensitivity review. Acknowledgements First and foremost I would like to thank my supervisors Iadh Ounis and Craig Macdonald for their endless support and inspiration, for the rigorous and lively debates that kept the work moving forward and for teaching me how to write better. Most importantly, without whom, and their attention to detail, I would not be in the position that I am now submitting this thesis. A great deal of gratitude is due to my external supervisors from The National Archives. Firstly, to Tim Gollins who’s foresight and determination started this ball rolling. I am very happy to be able to submit this work in recognition of your efforts that made the undertaking of this thesis possible. Secondly, to Simon Lovett, who’s invaluable experience and insights about sensitive information brought life to the subject. Thirdly, to David Willcox who’s boundless enthusiasm for the project meant that no operational difficulty was ever an obstacle. Lastly, to Anthea Seles and Anna Sexton for stepping into the role to make sure that there was always support from The National Archives. It has been a real pleasure working with all of you. I would like to express my sincere gratitude to all the anonymous reviewers at The National Archives, the Foreign and Commonwealth Office and beyond who played such a vital role in constructing the test collection for this research. This thesis would not have been possible with- out all of your efforts. I would also like to thank Douglas W. Oard and Mary Ellen Foster for their insights and thoughtful feedback during my PhD viva. Thanks also to all my friends and colleagues in the TerrierTeam and The University of Glasgow who I have had the pleasure of working with, or just hanging out with, over the years. You have made this a very enjoyable endeavour. Lastly, I would like to give an extra big thank you to: Mom and John, for always encouraging me to pursue whatever interests me in life; Anita and Martin for all the fun times and fantastic food over the years, you have kept me slightly sane; and Anita for being a true life-long friend. iii Contents Abstract i Acknowledgements iii 1 Introduction 1 1.1 Introduction . .1 1.2 Motivation . .3 1.3 Scope of the Thesis . .7 1.4 Thesis Statement . .8 1.5 Contributions . .9 1.6 Origins of Material . 10 1.7 Outline of Thesis . 11 2 Background 14 2.1 Introduction . 14 2.2 Text Classification . 15 2.2.1 Test Collection . 15 2.2.2 Document Representation . 16 2.2.3 Feature Reduction . 18 2.2.4 Effective Classifiers for Text Classification . 20 2.2.5 Evaluation and Metrics . 22 2.3 Active Learning . 24 2.3.1 Selecting Informative Documents . 25 2.4 Technology-Assisted Review . 27 2.5 Conclusions . 30 3 Classification of Sensitive Information 31 3.1 Introduction . 31 3.2 Identified Sensitivities . 32 3.2.1 Section 27: International Relations . 33 3.2.2 Section 40: Personal Information . 38 iv CONTENTS v 3.3 Previous Approaches for Classifying Sensitive Data . 43 3.4 A Test Collection for Sensitivity Classification . 46 3.5 Conclusions . 49 4 A Framework for Technology-Assisted Sensitivity Review 50 4.1 Introduction . 50 4.2 User Models for Technology-Assisted Sensitivity Review . 51 4.3 Framework Overview . 55 4.4 Document Representation . 56 4.5 Document Prioritisation . 58 4.6 Feedback Integration . 60 4.7 Learned Predictions . 62 4.8 Conclusions . 64 5 Sensitivity Classification Baseline 65 5.1 Introduction . 65 5.2 Masking Information that is Supplied in Confidence .