DNS Weighted Footprints for Web Browsing Analytics
Total Page:16
File Type:pdf, Size:1020Kb
DNS weighted footprints for web browsing analytics Jos´eLuis Garc´ıa-Dorado,Javier Ramos, Miguel Rodr´ıguez,and Javier Aracil High Performance Computing and Networking Research Group, Universidad Aut´onomade Madrid, Spain. NOTE: This is a version of an unedited manuscript that was accepted for publica- tion. Please, cite as: J.L. Garc´ıa-Dorado,J. Ramos, M. Rodr´ıguez,J. Aracil. DNS weighted footprints for web browsing analytics. JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 111, 35-48, (2018). The final publication is available at: https://doi.org/10.1016/j.jnca.2018.03. 008 Abstract prints), we have measured that our proposal is able to identify visits and their durations with false and true positives rates between 2-9% and over 90%, The monetization of the large amount of data that respectively, at throughputs between 800,000 and ISPs have of their users is still in early stages. 1.4 million DNS packets per second in diverse sce- Specifically, the knowledge of the websites that spe- narios, thus proving both its refinement and ap- cific users or aggregates of users visit opens new plicability. Index Terms: Browsing analytics; opportunities of business, after the convenient san- Data monetization; DNS footprint; web-user pro- itization. However, the construction of accurate files; DNSPRINTS. DNS-based web-user profiles on large networks is a challenge not only because the requirements that capturing traffic entails, but also given the use of 1 Introduction DNS caches, the proliferation of botnets and the complexity of current websites (i.e., when a user The commercial exploitation of web browsing ana- visit a website a set of self-triggered DNS queries lytics is currently a fact, whereby the collaboration for banners, from both same company and third between both marketing and Internet communi- parties services, as well for some preloaded and ties has progressively strengthened after each busi- prefetching contents are in place). In this way, ness success. The information about users' prefer- we propose to count the intentional visits users ences that the different actors in the Internet arena make to websites by means of DNS weighted foot- gather can be used not only as inputs for tailored prints. Such novel approach consists of considering online-ads but for improved and novel mechanisms that a website was actively visited if an empirical- to monetize the rich set of available data [1, 2], estimated fraction of the DNS queries of both the even more with the advent of the Big Data era. In own website and the set of self-triggered websites paying attention to web access profiles, it becomes are found. This approach has been coded in a final apparent that the identification of a competitor at- system named DNSPRINTS. After its parameteri- tracting the interest among the customers of a given zation (i.e., balancing the importance of a website company is of paramount interest to react accord- in a footprint with respect to the total set of foot- ingly. Similarly, the interest in terms of visits of the 1 population of a given geographical area is a useful ated by actual users from a web browser, but input to plan the expansion for many retailers or from scripts that poll DNS servers with sin- to assess the impact of an advertising campaign. A gle queries searching for updates, information real-world example of this is Google AdWords ser- about the service or mail server addresses as vice [3], such that Google exploits keywords, statis- well as to perform Distributed Denial of Ser- tics of keywords, and search results of its users [4] vice (DDoS) attacks. and customizes ads according to their profiles. Nonetheless, the capacity of creating users' web- • The tangled web [10]: Almost any web includes profiles is not the exclusive preserve of web search a set of external contents (e.g., resources of engines as of today, but also others, such as ISPs, other webs of the same company or, simply, have the opportunities to exploit such business banners or ads) whose domain names have to model. Actually, in a richer fashion given that be resolved in order to fully load the requested search engines are only aware of the few first ac- web itself. This causes that additional DNS cesses users make to a website. After that, typi- traffic is generated without being explicitly re- cally, web names are retained in the browsing his- quested by users and makes users' web-profiles tory (or bookmarked), or simply memorized by the imprecise, especially for business exploitation. users. This way, search engines are no longer used • Websites' embedded content preload: as gateways to such sites. 1 Alternatively to client intrusive software and HTML5 recommendation defines special add-ons installed on clients' web browsers, such as websites' attributes such as preload that toolbars [5] or cookies [6], the ISP approach to this calls for an early load of resources not being issue is the apparently simple task of capturing part of either the main website loading or the and inspecting HTTP traffic. However, two con- rendering process. Such resources generate cerns arise. The unstoppable increase of HTTPS additional DNS queries. and the implicit difficulty of capturing (needless • Link prefetching: Popular browsers such as to say, inspecting) large volumes of HTTP traf- Google Chrome, Mozilla Firefox or Inter- fic potentially distributed in different geographical net Explorer perform link prefetching which areas. While the latter remains as an open re- generates non-explicitly requested extra traf- search area [7], the authors in [8, 9] pointed out fic. Prefetching can be performed at dif- that DNS can play a major role to reveal the traf- ferent levels. Nowadays almost any browser fic behind a HTTPS flow. Further, the authors pre-resolves domains present in every link of in [10, 11] successfully correlated traffic flows [12] a website. Moreover, some browsers may and temporally-close DNS queries, thus identify- not only resolve link addresses but prefetch ing IP addresses of users and the name of the ac- cacheable contents from websites referred by cessed hosts. Unfortunately, such approaches still such links, if possible. Even more, some require to collect all HTTP/HTTPS and DNS traf- browsers, such as Internet Explorer, perform fic to construct the flows to which relate the DNS image pre-rendering. Furthermore, websites protocol, and, importantly, other collateral issues may also indicate browsers to prefetch certain emerge: links, as defined formally in the HTML5 rec- • DNS Cache: Most of DNS resolvers and clients ommendations by means of special attributes implement a cache for DNS traffic where the of the link tag. In sum, all these approaches associated IP of a given domain name is tem- generate both HTTP/HTTPS and DNS extra porarily stored. As Time To Live (TTL) for traffic that impedes generating precise users' the cache entry can be long [13], chances are browsing profiles. that a point of presence monitoring traffic can- not see clients' DNS queries although they are In this light, we search for a mechanism that re- effectively visiting a web. lying on DNS traffic to detect every time a user ac- cesses a website is able to circumvent the problem • Automated clients, botnets and worms [14, 15, 16]: Often, DNS queries are not gener- 1https://www.w3.org/TR/html5/ 2 User requests a website with a browser DNS query to resolve root domain name DNS server resolves root domain name I HTTP request for the site HTTP server sends HTML file for the site Web browser identifies Figure 1: Sample website layout, with content that dependencies on the HTML DNS Queries for generates self-triggered DNS queries embedded on other domains several banners and external services DNS server resolves domains II III that some queries are hidden by caches, and, im- IV portantly, with the capacity to distinguish between HTTP requests for requested website visits and unintentional ones (let domains' content us refer to the latter as self-triggered webs or self- triggered traffic). To do so, we propose to explic- HTTP responses for itly exploit the tangled characteristic of websites domains' content and DNS preload/prefetch capabilities of current websites/browsers. Figure 1 features a sample layout for a website Web browser loads and displays the website with several embedded ads, banners and links to both external and same company services. After the main website is requested, while users wait for Figure 2: Flow graph describing the packet flow loading, a set of DNS queries/responses are gen- of DNS and HTTP traffic when a site is requested erated. Such behavior is illustrated in Figure 2, using a web browser where multiple resolution processes are in place. The set of different resolved domains that a given website triggers, defines the DNS footprint of such a website. This way, given a list of root domain websites, i.e., those websites of interest for business purposes, we construct their associated DNS foot- prints that will be used to search in traffic aggre- gates to determine the visits per user, potentially sanitized, or per aggregation in market segments and geographical areas. 3 In fact, being less restrictive with the footprint instance car manufacturers. To this end, the de- (i.e., accept a match although not all the websites partment provides a set of websites of interest for of a footprint are found) circumvents the cache the analysis, namely those of the relevant car man- problem while being more restrictive rules out self- ufacturers. Our purpose is detecting when a user triggered visits, in other words \false" resolutions has requested a certain website using a browser, out with respect to the actual website requested by the of such set of websites, only by inspecting the DNS user.