Cloak and Swagger: Understanding Data Sensitivity Through the Lens of User Anonymity Sai Teja Peddinti∗, Aleksandra Korolova†, Elie Bursztein†, and Geetanjali Sampemane† ∗Polytechnic School of Engineering, New York University, Brooklyn, NY 11201 Email:
[email protected] †Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043 Email: korolova, elieb,
[email protected] Abstract—Most of what we understand about data sensitivity In this work we explore whether it is possible to perform is through user self-report (e.g., surveys); this paper is the first to a large-scale behavioral data analysis, rather than to rely on use behavioral data to determine content sensitivity, via the clues surveys and self-report, in order to understand what topics that users give as to what information they consider private or sensitive through their use of privacy enhancing product features. users consider sensitive. Our goal is to help online service We perform a large-scale analysis of user anonymity choices providers design policies and develop product features that during their activity on Quora, a popular question-and-answer promote user engagement and safer sharing and increase users’ site. We identify categories of questions for which users are more trust in online services’ privacy practices. likely to exercise anonymity and explore several machine learning Concretely, we perform analysis and data mining of the approaches towards predicting whether a particular answer will be written anonymously. Our findings validate the viability of usage of privacy features on one of the largest question-and- the proposed approach towards an automatic assessment of data answer sites, Quora [7], in order to identify topics potentially sensitivity, show that data sensitivity is a nuanced measure that considered sensitive by its users.