Characterizing Web Pornography Consumption from Passive Measurements
Total Page:16
File Type:pdf, Size:1020Kb
Passive and Active Measurement Conference Puerto Varas (Chile) – March 27-29th, 2019 Characterizing web pornography consumption from passive measurements Andrea Morichetta Martino Trevisan Luca Vassio Introduction Pornography and technology Pornography and technology Pornography and technology Pornography and technology Pornography and technology Pornography and technology ● Accessibility ● Anonymity ● Privacy Web Pornography (WP) Definition: Any online service related to bring about sexual stimulation ● Important fraction of the Internet traffic – 13 PB of daily traffic for PornHub ● Number of websites – we identified and monitored more than 300 000 ● Amount of users – 93 millions daily for PornHub Web Pornography (WP) Definition: Any online service related to bring about sexual stimulation ● Important fraction of the Internet traffic – 13 PB of daily traffic for PornHub ● Number of websites – we identified and monitored more than 300 000 ● Amount of users – 93 millions daily for PornHub Studying WP consumption: ● Allows understanding human behavior ● It is crucial for medical and psychological research Limitations of previous literature ● From medical/psychology community ● Based on surveys of very few volunteers Limitations of previous literature ● From medical/psychology community ● Based on surveys of very few volunteers Lack of data! People lie, even not consciously! Surveys are intrinsically unreliable and limited We should use other sources of information User behaviour papers: quick check User behaviour papers: quick check ● YouTube: ~200 000 ● Facebook: ~1 700 000 ● PornHub/YouPorn/MindGeek: ~400 User behaviour papers: quick check ● YouTube: ~200 000 ● Facebook: ~1 700 000 ● PornHub/YouPorn/MindGeek: ~400 We are the first to use passive measurements to study users behavior on web pornography Methodology Detect interactions on web porn ● Network monitoring of residential customers ● HTTP level passive traces ● 15 000 broadband users for 3 years (2014 – 2017) Detect interactions on web porn ● Network monitoring of residential customers ● HTTP level passive traces ● 15 000 broadband users for 3 years (2014 – 2017) March 2017 First big player (MindGeek) switched to encrypted data, quickly followed by most of the others... Passive network measurements ISP network Tstat: ● Captures traffic and processes it in real-time ● Logs information from every HTTP request and response Passive network measurements ISP network Tstat: ● Captures traffic and processes it in real-time ● Logs information from every HTTP request and response Definition of user: Concatenation of the client IP address and the user-agent from HTTP header A single person may appear multiple times if she uses multiple devices or major software updates occur Identify user actions User action: URL of a webpage intentionally visited by a user Webpage image HTML Webpages have up to 100+ embedded objects (HTTP requests) image CSS HTML (Frame) Identify user actions User action: URL of a webpage intentionally visited by a user Webpage image HTML Webpages have up to 100+ embedded objects (HTTP requests) image CSS HTML (Frame) Goal: Find user actions from all HTTP requests in a trace We introduced an automatic machine learning approach using several traffic features and trained on real traces, improving previous solutions Accuracy 99.6%, F-measure 91.3% Tested on smartphone traffic Identify porn websites and sessions Porn websites: ● Whitelist-based approach from public available lists ● Lists provide a set of domain names that offer different WP content ● We combine 3 different sources: ● 310 252 unique entries ● 460 top-level domains Identify porn websites and sessions Porn websites: ● Whitelist-based approach from public available lists ● Lists provide a set of domain names that offer different WP content ● We combine 3 different sources: ● 310 252 unique entries ● 460 top-level domains Porn session: ● When a user accesses a porn website we open a new porn session ● We terminate a session if we do not observe any entry to WP for a period of 30 minutes Dataset HTTP log • Use real traffic from HTTP logs data • Tstat installed at an Italian ISP User actions Spark on a 20-machine Hadoop cluster Performances: ~48 hours to extract 1.5 billion user actions Clients Log size HTTP Years User Porn User Porn Requests actions actions domains 15 000 20.5 TB 138 billion 3 1.5 billion 58 million 59 898 ~4% of all user actions are directed towards porn websites Ethical concerns and limitations ● Only regional sample of households in a single country ● Passive measurements threaten users’ privacy ● Data collection approved by the partner ISP ● Study subject to privacy impact assessment from our institution ● Several countermeasures to avoid recording any personally identifiable information: – All client identifiers are anonymized using Crypto-PAn algorithm – URLs are truncated to avoid recording URL-encoded parameters – Encryption keys are varied on a monthly basis, to avoid persistent users tracking – Sensitive information such as cookies and Post data are not monitored – Logs are stored in a secured data center in an encrypted format ● We only refer to adult pornography websites from public lists, limited to legal content in the territories of EU and USA Results How many? How many? Daily accesses: WP 12% – Netflix 3%, Instagram 25%, YouTube 45%, Facebook 60% How many? Daily accesses: WP 12% – Netflix 3%, Instagram 25%, YouTube 45%, Facebook 60% How many? Daily accesses: WP 12% – Netflix 3%, Instagram 25%, YouTube 45%, Facebook 60% How many? Daily accesses: WP 12% – Netflix 3%, Instagram 25%, YouTube 45%, Facebook 60% How many? Daily accesses: WP 12% – Netflix 3%, Instagram 25%, YouTube 45%, Facebook 60% How many? Daily accesses: WP 12% – Netflix 3%, Instagram 25%, YouTube 45%, Facebook 60% With which device? Total daily time spent per active user Porn sessions 1/2 Session duration [minutes] Sessions per day for an active user Porn sessions 1/2 Session duration [minutes] Sessions per day for an active user Porn sessions 1/2 Session duration [minutes] Sessions per day for an active user Porn sessions 2/2 Accessed webpages per session Accessed websites per session Porn sessions 2/2 Accessed webpages per session Accessed websites per session How does it change during the day? How does it change during the day? ● Peaks after lunch (14-15) and during night (23-2) How does it change during the day? ● Peaks after lunch (14-15) and during night (23-2) ● Less consumption in the morning (9-12) and in the afternoon (17-22) How does it change during the week? How does it change during the week? ● Drop on usage on Saturdays evening How does it change during the week? ● Drop on usage on Saturdays evening ● Increase in the mornings (9-13) of Saturdays, Sundays and Mondays How does it change during the week? ● Drop on usage on Saturdays evening ● Increase in the mornings (9-13) of Saturdays, Sundays and Mondays ● Cumulatively, Mondays has most usage and Saturdays the least Web porn services Web porn services ● Cumulative visits astonishingly similar to overall traffic: Top-15 websites accounts for around 60% of visits, Top-200 for around 90% Web porn services ● Cumulative visits astonishingly similar to overall traffic: Top-15 websites accounts for around 60% of visits, Top-200 for around 90% ● Ecosystem is lead by few big players in a dominant position: MindGeek websites account for more than 20% of accesses Conclusions Conclusion ● Precise quantitative information about interactions with pornographic websites: enhance the visibility and understanding of those topics ● Employed metrics taken from the surveys reported in medical literature – Time and frequency of use – Habits and trends ● Less mediated overview of users behaviors, partially confirming what emerges from medical surveys Open data! ● Anonymized datasaset of visits to webpages belonging to web pornographic domains https://smartdata.polito.it/adult-clickstreams/ ● It is the only public datasets that includes WP accesses from regular Internet users Passive and Active Measurement Conference Puerto Varas (Chile) – March 27-29th 2019 Characterizing web pornography consumption from passive measurements Luca Vassio [email protected].