Web-Browsing History

Web-browsing history As seen from the client side Michael Sonntag Institute of Networks and Security ELEMENTS OF WEB-BROWSING HISTORY History The list of URLs visited (+at which time, intentionality…) Provides general information on time and location of activity URL's may also contain form information: GET request parameters Might be very extensive, as a single page may consists of very many independent elements = URLs Pictures, Ads, JavaScript, CSS, iFrames, redirections… Depends on how “History” is interpreted/stored by the browser Cookies Which websites were visited when + additional information May allow determining whether the user was logged in Can survive much longer than the history Depends on the expiry date of the Cookie and the configuration Theoretically: Collection of third-party cookies can determine site Practically: Too much overlap; too many cookies 2 ELEMENTS OF WEB-BROWSING HISTORY Cache The content of the pages visited Incomplete: E.g. ad's will rarely be cached (“no-cache” headers) Provides the full content of what was seen, e.g. (in theory!) webmail content More exactly: What was delivered by the server to the client and was marked as “cacheable” Mostly not that interesting, as the relevant part (e.g. the actual Facebook content, the chat messages etc) are nowadays “dynamic” and therefore not cached any more What you get is the “standard border” every page has plus the icons, stylesheets, scripts, fonts etc 3 INTENTIONALITY Did the user visit the webpage intentionally? In general: If it's in the cache/history/cookie file: Yes See also: Bookmarks! BUT: What about pop-ups? E.g.: Pornography ads (no one views such content intentionally )! Password protected pages? Images/JavaScript can supply passwords too, when opening a URL! Plugins, Prefetching…? Investigation of other files/trying it out/content inspection/… are needed to verify, whether a page that was visited, was actually intended to be visited (and whether the content was expected!) Usually this should not be a problem: Logging in into the mail account Visiting a website after entering log-in data (username+password) Downloading files 4 INCREASING PROBLEM: DYNAMIC PAGES Web browsing history works well for “a page” Server sends something, browser displays, user clicks on link But today often content is retrieved dynamically by JavaScript One “page” with lots of different content over time Problems: JavaScript requests are rarely cached JavaScript requests do no show up in the browser history The “page” might be completely empty at the start “Repeating” works less well Many pages change very frequently Not every few weeks, but minutes/seconds! The “page” might be the same, but the content (see above) not Content may be adjusted to the viewer, i.e. dependent on language, browser, previous browsing (cookies!) etc. 5 WEB BROWSING PROCEDURE (classical!) User enters the URL Browser determines the IP address for the host part Browser connect to the IP address (+port if specified) Sends request With additional information, e.g. what compression is allowed May contain cookie(s) Retrieves response Headers and actual content Header may contain cookie Saved to memory (and perhaps the disk in the cache file) Depends on headers, settings etc Connection is closed Note: HTTP 1.1 typically keeps the connection open for further requests (incl. pipelining). This is especially useful for images, stylesheets, scripts etc from the same site! 6 THE HTTP PROTOCOL Basis of HTTP is a reliable stream protocol (usually TCP) The HTTP state diagram is very simple With some exceptions, e.g. authorization There is only a single request and a single response HTTP request methods: GET: Retrieve some content Should never change the state on the server! Especially important if caching takes place somewhere Parameters (optional) are encoded in the URL POST: Send data for processing and retrieve result To be used for requests changing the server state! Parameters are sent in the request body HEAD, PUT, DELETE, TRACE, OPTIONS, CONNECT… Of less importance and (very) rare Head (check freshness), Put (file upload) Forget the rest (or: suspicious if present investigate!) 7 THE HTTP PROTOCOL The response always includes a status code 1xx Informational 2xx Success 3xx Redirection (request should be sent again differently) 4xx Client side error (e.g. incorrect request, not existing) 5xx Server side error (should not be retried) Caching of HTTP: Commonly performed through proxies Must either be validated with the source ( HEAD) Or it is "fresh enough" according to client, server, and cache Note: Browsers often ignore this E.g. IE can be configured to never check for a newer version even if the cached page has already expired! This has no influence on what proxies on the transmission path do! 8 THE HTTP PROTOCOL Local (=browser) caches If a page has expired, it is not necessarily deleted from the local cache It might remain there for much longer Can keep even pages marked as "no-cache" and "no-store" "no-cache": Should not be cached for future requests But might still be written to disk (e.g. Firefox) "no-store": Should only be held in memory Users are still allowed to use "Save As"! All dynamically generated pages are typically marked as expiring in the past, no cache etc. Why? Because previously there was no specification, so browsers did different things and servers provided headers for every browser make and version to be on the safe side! Such pages must normally be generated anew for every visitor (=no caches on transmission path) and every visit (=no caching in browser) 9 THE HTTP PROTOCOL Local (=browser) caches This cache can be very large and contain very old files Important for computer forensics! Manual deletion or cleaner programs are simple and effective But must be used every time after surfing Attention: Many such programs just delete the files, only the more serious ones overwrite them securely! Deleted by Browser before cleaning No overwriting! Also, fragments of files might remain in unused areas, so all free sectors and slack spaces would have to be cleaned every time! See also swap file/partition, hibernation file ( URLs) See also “Private browsing mode” 10 THE HTTP PROTOCOL EXAMPLE: HTTPS://WWW.GOOGLE.AT/ Request: GET / HTTP/1.1 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Encoding: gzip, deflate, br Accept-Language: de,en-US;q=0.7,en;q=0.3 Cache-Control: max-age=0 Connection: keep-alive Cookie: NID=119=mg3iUhrNJHA5CfE6CG5KVd…017-12-14-7; OGPC=5061821-25: DNT: 1 Host: www.google.at Upgrade-Insecure-Requests: 1 User-Agent: Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/57.0 11 THE HTTP PROTOCOL EXAMPLE: HTTPS://WWW.GOOGLE.AT/ Response: HTTP/1.1 200 OK alt-svc: hq=":443"; ma=2592000; quic=51303431; quic=51303339; quic=51303338; quic=51303337; quic=51303335,quic=":443"; ma=2592000; v="41,39,38,37,35" cache-control: private, max-age=0 content-encoding: br “Do not cache content-type: text/html; charset=UTF-8 this response” date: Thu, 14 Dec 2017 07:32:10 GMT Page, not Cookie! expires: -1 p3p: CP="This is not a P3P policy! See g.co/p3phelp for more info." server: gws set-cookie: <Lots of cookie data>; expires=Fri, 15-Jun-2018 07:36:24 GMT; path=/; domain=.google.at; HttpOnly strict-transport-security: max-age=3600 X-Firefox-Spdy: h2 x-frame-options: SAMEORIGIN x-xss-protection: 1; mode=block 12 HTTP/2 The new(er) protocol introduces a few changes Data compression: Not really a problem – The content stays the same, compression was negotiable before too Mostly an issue for wiretapping Request pipelining: More pages over one connection Was also part of HTTP/1.1 Wiretapping issue Multiplexing requests: Concurrent/interleaved requests possible Similar as pipelining: Wiretapping issue Server push: Problem for intentionality The server can send you data before you requested it, because you are going to need it anyway (e.g. stylesheets, scripts) This can be “repurposed” to send you arbitrary data Which might be cached (even if not shown, because it is not needed!) 13 COOKIES What is a "cookie"? Small (max. 4 kB) text file with information Originates form the server Stored locally Transmitted back to server on "matching" requests Content (with exemplary data): Name: "session-id" Value: "303-1195544-4348244" Domain: ".amazon.de" Sent to all requests ("/") of Website path: "/" subdomains of ".amazon.de" None Till browser is Expiry date and time: 15.10.2007, 00:02:22 closed ("session cookie") Secure(https): * Will be sent also on non-HTTPS connections The data may have any meaning Very rarely this is some "plain-text data" Some part of it might be the IP address or the user name But usually it is just a (more or less!) random unique number 14 THIRD-PARTY COOKIES “Real” Third-Party Cookies: Site A sends Cookie with Domain value set to site B Would NOT be sent back to site A, but only to site B! Browsers generally do not accept such cookies any more Third-Party Cookies: Page is from site A, but some iframe/picture/… is from site B – and the request for it sets a Cookie for site B Cookies originates at site B and is sent back to site B, but the “overall”/”encompassing”/”top-level” page is from site A Many browsers block these now (optionally/by default); also AdBlockers often drop them Forensically the difference is unimportant: All are either stored or not, and all are stored in the same way Theoretically

Web-Browsing History

On the Uniqueness of Browser Extensions and Web Logins

Replication: Why We Still Can't Browse in Peace

Website Error Messages

Examining Older Users' Online Privacy-Enhancing

Web Tracking: Mechanisms, Implications, and Defenses Tomasz Bujlow, Member, IEEE, Valentín Carela-Español, Josep Solé-Pareta, and Pere Barlet-Ros

An Analysis of Web-Based User Tracking and Online Privacy

Emacs-W3m User's Manual

Erasing Your Web Browsing History

A Reference Architecture for Web Browsers

M Anaging Dependencies and Constraints Assembled Software

Using Web Browsing Histories to Facilitate Multi-Method Research

Clearing Browser Cache