A Measurement Study of Web Redirections in the Internet

Krishna Bhargrava Douglas Brewer Kang Li Vangapandu Deparment of Computer Deparment of Computer Deparment of Computer Science Science Science University of Georgia University of Georgia University of Georgia 415 Boyd GSRC 415 Boyd GSRC 415 Boyd GSRC Athens, GA Athens, GA Athens, GA [email protected] [email protected] [email protected]

ABSTRACT legitimate (non-spam) also employ redirection for rea- The use of URL redirections has been recently studied to sons such as load balancing, link tracking, and bookmark filter spam as email and web spammers use redirection to preservation. To build accurate and effective spam detec- camouflage their web pages. However, many web sites also tion based on URL redirections, we need to understand the employ redirection for legitimate reasons such as logging, use of URL redirections for both spam and legitimate rea- localization, and load-balancing. While a majority of the sons. studies on URL redirection focused on spam redirection we Our study measures redirections in both legitimate URLs provide a holistic view of the use of URL redirections in and spam email URLs. We observe that redirection is com- the Internet. We performed a redirection study on vari- mon in both spam and legitimate URLs. Spam URLs used ous sets of URLs that includes known legitimate and spam redirection only slightly more than legitimate URLs at 43.63% . We observed that URL redirections are widely and 40.97% respectively. This means that while spam URLs used in today’s Internet with more than 40% of legitimate will use redirection, that fact that redirection occurs cannot URLs redirecting for various reasons. We also observed that be a means for determining whether an email should be con- server side redirection is prominent in both legitimate and sidered spam or not. spam redirection. Differing from legitimate URL redirec- To further our study, we broke down types of redirec- tion, JavaScript redirection is detected more often in spam tions and examined whether the redirections were internal websites. Furthermore, a very high percentage of spam redi- or external, whether the redirected domains were owned by rections lead to an external domain. Apart from providing the same organization. We found three types of redirec- a quantitative view of URL redirections, we also provide a tions used most often server-side, Javascript, and meta. A further classification of legitimate URL redirection based on clear difference between spam and legitimate URLs emerges the reason of redirection. We expect that our measurement when we categorize redirections into these types. We find results and classifications to provide a better understanding that spam URLs use Javascript redirection much more of- of the usage of URL redirection, which could help improving ten than do legitimate URLs 30% compared to 10%. Other the spam filtering and other applications that rely on URLs types of redirection where used more often by legitimate as the web identifiers. URLs, but they were not disparate enough with the usage by spam URLs to be of use. Breaking down redirections into external or internal yields a result where external redirects 1. INTRODUCTION are used more often by spam URLs and internal more often The point of most is to sell something to re- by legitimate URLs. cipient or steal personal information. This means that the The paper is organized as follows. Section 2 briefly de- spam email must contain some way for the recipient to pur- scribes the previous work on redirection. In section 3, the chase what is being peddled or trick the user into providing types of redirections, reasons for using redirection, how redi- their information. To this end, it is very common to see rections are detected, and classification of redirection is pre- spam emails contain a URL that directs the user to a web- sented. Section 4 describes the experiments conducted while site[?]. In spam filters, it is common to have blacklist filters section 5 provides a detailed analysis of the results observed. for URLs, but redirections make it hard for these filters to function correctly[?][?][?][?]. This means it is common to 2. RELATED WORK see spam messages that obscure their URLs with Most of the previous work in the field of web redirections redirection[?]. focused on spam redirections. In one of the early works on Since redirection is a widely used technique, much re- Web Spam classification, Gyongyi and H. Garcia- Molina search has focused on using URL redirections as an impor- [?] describe redirection as a spam hiding technique used tant factor in detecting spam. These previous works study by spammers to create doorway pages. Wu and Davison URL redirections in spam URLs exclusively. However, many [?] conducted a preliminary study that contributes with a quantitative analysis of the presence of in the In- ternet. They look at redirections as one of the techniques to CEAS 2009 - Sixth Conference on Email and Anti-Spam July 16-17, 2009, perform cloaking. Our study differs from these by consider Mountain View, California USA not only spam datasets but also a few common categories of legitimate web sites. rection would be triggered at the client browser to this target JavaScript redirections have been shown to be used by URL. The redirection happens after a browser finishes pars- spammers as a way to dupe users into viewing spam. Benczur ing a HTML page, then the action is triggered et al. [?] discovered numerous doorway pages which rely on to load content from the target URL. JavaScript redirection. Chellapilla and Maykov [?] look at JavaScript redirection explicitly with a focus on the tech- 3.2 Reasons for Using Redirection niques employed by the spammers. The Microsoft Strider As mentioned earlier, there are several reasons for URL team in their work on systematic discovery of spammers redirection – both legitimate and illegitimate. In this section emphasized URL redirection as a common spam technique. we look at some of the most common reasons for employing They developed a tool, Strider URL tracer, which can be redirection. The order of appearance is based on the popu- used to detect all the domains that a current web page con- larity among the classification results with most commonly nects to. With the aid of the Strider, Wang et al. [?] stud- observed reason listed first. ied URL redirections in the context that there are content Virtual hosting (and DNS aliasing) — Virtual Hosting providers which redirect the user to malicious sites. Niu et and DNS aliasing are where more than one sites (or just al. [?] conducted a study on forum with context- domain names) are mapped to a single IP address. This based analysis using the Strider to identify doorway pages. is commonly used by websites to register most commonly To our knowledge, most of the work mentioned above misspelled domains and redirect requests to these domains studied redirection in the context of spam. Our approach is to the original server. Such organizations register the same different from the above work as we look at general web with different top-level domains. For example, redirections on the whole. Our study involves detection requests to gooogle.com or google.net all redirect the user of URL redirection, classification of detected redirections to http://www.google.com. across multiple dimensions. Load balancing — Most of the top websites host content on several servers. These servers either host specialized con- 3. OVERVIEW OF REDIRECTION tent or mirror each other. In cases of high volume of web A URL is said to be redirected, if a client requests a re- traffic, requests are redirected to one of these servers; the source located at a specific URL, but the client’s final des- criteria of which depends on the website. For example, pop- tination at the end of the request is a different URL. This ular websites like search engines host different mirror servers section describes the types of redirection techniques as well and requests are redirected based on the nature of resource as the common reasons for redirections. requested. Link tracking (via indirections) — Many websites use redi- 3.1 Types of Redirection rection for statistical and logging reasons. For example, Based on the implementation techniques, URL redirec- websites log the advertisement clicks before it actually takes tions are classified into 1) Server-Side redirections, the user to the advertised webpage. For this to happen, ad- 2) JavaScript redirections, and 3) META redirections. vertisement clicks are taken to the originating website where Server-side redirections — Server-side redirection occurs the information is logged and then the request is redirected when a client requests a resource and the server issues a di- to the advertised webpage. rective in the of HTTP status codes which makes the Resisting web spam (via indirections) — Many websites client request through a different URL. HTTP reply status rewrite external links in their web pages by introducing a codes of type 3xx as well as some 4xx with a location field in level of indirection through a server that is not indexed by the header imply that the client has to redirect to a differ- search engines. For example, all user links posted at MyS- ent URL. For example, request for http://www.google.net pace are disguised as a link with indirection from the domain returns a status code 302 redirecting the request to name msplinks.com, not myspace.com. A web page linked http://www.google.com. from the former domain is much less valuable than a link Redirections can also be performed by using publicly avail- from the latter one. able services redirection services (such as tinyurl,shorturl). Providing Warnings (via indirections) — A level of indi- We have also observed that some other services are exploited rection is also used to provide a warning to users when they by spammers to perform redirection even though they are are about to leave the current domain. not exactly meant to generate redirections. An example for Security — Unauthenticated or unauthorized requests to this service is the “Google, I’m feeling lucky” which redirects a resource are usually redirected to a login page or an in- the user to the first query result. formation page. For example, visiting a protected profile JavaScript redirections — JavaScript redirections are ini- on MySpace redirects unauthenticated users to the login tiated at the client side through statements like page. Similarly, transition from HTTP to HTTPS is of- `‘window.location=someurl´’. These instructions are inserted ten enabled through redirection. For example, a request to into the script sections of the HTML page sent and when http://www.bankofamerica.com would be redirected to the client JavaScript engine executes these statements, it ://www.bankofamerica.com/index.jsp. is redirected to the URL specified within the script. Un- URL rewriting — Some websites rewrite long and script like server-side redirections, a web page is actually loaded oriented URLs with short and user-friendly URLs and later partially or completely into the client browser before the redirect to the actual URL. redirection occurs. Other reasons — Redirection is also used for many other META redirections — Another client side redirection is reasons, such as to keep bookmarks, to route requests in based the META tags located in the HEAD section of the CDN, to redirect requests for invalid resources, and to pro- HTML page. By setting the associated http-equiv attribute vide personalized content based on client geolocations or to refresh and the content attribute to a target URL, a redi- user-agent types. For example, we observed that 15 out of Alexa top 500 websites redirected differently when the user- For this reason, a detector should be capable of executing agent field are changed from a popular browser (Firefox) JavaScript and using a is the best way in which and a crawler (Google-bot). one could detect all complex redirections. Excepting the legitimate usage, URL redirection is one of the most exploited techniques for spamming, which have 3.3.2 Redirection Targets been used for cloaking, domain forwarding, doorway pages. The classification of redirections into external and inter- For details of the use of redirections in spam, please refer to nal ones is to validate a commonly held hypothesis, which early work on spam redirections [?][?][?] considers that redirections at spam websites often have more 3.3 Detecting and Classifying Redirection external redirections than the redirections used in legitimate sites. This section describes our efforts of classifying URL redi- The distinction between internal or external redirection rections into several categories based on the implementation is defined by the web domain ownership of the original and techniques (Server-side, Javascript, or META), the target of target URL. An external redirection is defined as a redi- redirections (external vs internal), and the reasons of redi- rection between two URL domains which are not owned or rection. managed by the same organization. 3.3.1 Redirection Techniques Detection of external redirections is not a straightforward task. Very often different domains are managed by the same Classification based on the type of redirection techniques organization. For example, a redirection from the msdn.com helps us identify the most commonly used techniques in domain to the microsoft.com is not considered as external various types of web sites. We briefly overview the re- redirection since both these websites are managed by the quired efforts to detect and classify redirections based on same organization. Unfortunately, there are no systematic their implementation techniques. The detection of redirec- methods to check if two sites are owned by the same orga- tions can be achieved with or without a browser. Obviously, nization. the detection of redirection techniques can be achieved by In addition to checking for identical domains, we adopt a browser. For example, the Microsoft Strider URL Tracer the following heuristics to differentiate redirections within is useful in detecting redirections from a URL by loading the same organization and external redirections. For a redi- the page into Internet Explorer. Browser-based approaches rection from the original URL domain X to a target domain often have high overhead and thus non-browser based ap- Y, we check 1) if one is a sub-domain of the other, 2) if they proaches have also been used. While detecting server-side share a common domain name server, 3) if they are two do- and META redirections are relatively straight forward with- mains with a common Top-Level- Domain, and 4) if their IP out a fullfeatured browser, JavaScript redirections are com- address is in the same Class B range. If any of these checks plicated to detect. Standalone Javascript interpreters, such return a positive result, we consider that X and Y belong to as the Rhino JavaScript engine [?], have been used but they the same organization. Obviously, these heuristics are not do not capture all intended JavaScript executions. We build always accurate. Therefore, we present the results of classi- our own detection and classification tool based on the SWT fication with and without each of these heuristics. And in Browser Widget. We did not use other browserbased tool, this study, if a redirection is not considered as external, it such as Strider, because they do not provide a full classifi- is considered as internal. cation function. Detecting Server-side redirection — The detection of Server- 3.3.3 Reasons for Redirection side redirections can be easily achieved by monitoring the status code returned in HTTP responses. Redirection oc- Classifying redirections based on reasons of redirection curs when the status code returned is of type 3xx or 4xx gives us an in depth understanding of the usage of URL with the location property in the response is set to a URL. redirections in various web sites. Since there is no precise Detecting META redirection — META redirections can information to determine the reason for using redirections, be detected by looking for META tags with the attribute we use the following methods to infer and classifying them http-equiv sets to refresh and the content attributes sets to into categories. First, we manually inspect a small set of a different URL. An example of such a META tag is shown URL redirections in the legitimate dataset, and infer the below: motivations of redirections based on the content of original and target URL and web content, and the naming conven- we believe represent each category and build our heuristic Detecting JavaScript Redirections — The detection of JavaScript classifier based on these keywords. For example, if keywords redirections is not straightforward because of the possible such as “login”, “auth”, and “https” appear in the destina- level of obfuscation [?]. For example, shown below is a sim- tion URL, then the redirection is classified under Security ple JavaScript statement that initiates redirection: category. Keywords such as “ad”, “click”, “overture” and window.location=“http://myspamlink.com” “adtmt” classify the redirection into Advertisements (Link Tracking) category. We apply these heuristics with the clas- Given the amount of flexibility that JavaScript offers to the sifier to the rest of the URL redirections and randomly sam- web developers, the same statement can be obfuscated to ple the outcome for manual verification of the classification many different versions that are hard to parse without exe- results. Throughout our study, the Keyword based URL cuting the script. For example, the above mentioned script heuristics are applied to identify redirections that are for could be written as link tracking, security, directory listings, etc. eval(“\“moc.knilmapsym//:ptth\”= Apart from the keyword heuristics, we also use other tech- noitacol.wodniw”.split(“”).reverse().join(“”)); niques to identify motivations such as load balancing and Figure 1: URL Redirection Comparison Table 1: Summary of Redirection Detection Dataset URL Total Redirections Legitimate (Alexa) 107300 40.97% virtual hosting. These are determined by matching the do- Legitimate (UGA) 1953 39.07% main names in the target and destination URL. For example, Legitimate (Blogs) 8878 25.03% there is a close match between the domain www0.shopping.com Spam Dataset 13451 43.63% and www1.shopping.com. The difference lies in the first part of the domain name with indications of a possible use of load balancing servers. Table 2: Impact of Popularity on Redirection Dataset % Redirections Alexa Top 100 18.00% 4. EXPERIMENTS Alexa Top 101-200 17.00% This section describes the system we used to detect redi- Alexa Top 201-300 22.00% rections and the datasets used in the study. Alexa Top 301-400 23.00% Alexa Top 401-500 17.00% 4.1 System Description The system we used to study redirections is comprised of a custom crawler used to collect the datasets, a redirection 5. RESULTS detector, an external redirection detector, and a heuristics This section describes the measurement results of URL based classifier. redirections on the datasets described earlier. The redirection detector uses SWT Browser Widget [?], a browser component that is commonly used in Javaenabled 5.1 Overview of Redirection Measurements applications. For example, Eclipse, a popular Java IDE uses The overall number of URLs in each dataset as well as the SWT Browser component in its in internal web browser. total number of redirections (in percentage) are listed in ta- Though the SWT Browser is a GUI widget, our system does ble 1. Figure 1 provides a further breakdown of these results not use a graphical user interface and thus does not require based on the techniques used to implement redirections. user interaction. Because the system uses a real browser, it Overall, we observed that URL redirection is common in can accurately detect all known types of redirections. all forms of websites irrespective of the nature of the data The redirection detector parses each URL and redirections set, and we found server-side redirections are predominant. detected are classified as a server-side redirect or client-side These observations are true for popular Internet sites as in- redirect. The URLs containing client-side redirections are dicated by the Alexa dataset, as well as the University and further parsed and classified into JavaScript or META redi- Blog dataset. About 25 to 40 percent of legitimate URLs rections. The system also contains a component that can be actually involve redirection. Among them, the observed used to detect external redirections. The internal and ex- redirections are mostly server-side redirections. Study on ternal redirections are further processed by a classifier com- Spam sites presents a similar result: overall 43.63% of the ponent which uses URL heuristics to logically classify the URL in the Spam dataset redirected to a different location. redirection based on the motivation. Among these, the dominant technology is server-side redi- rection (69.33%). 4.2 Datasets Although both legitimate and spam sites heavily use server In order to study the prominence of redirection, we have side redirection, there is a difference in their use of JavaScript considered different types of datasets. These include a spam redirections. Among all the datasets, Spam sites tend to dataset whos URLs are taken from spam honey pots and have a larger ratio (over 30%) of using JavaScript redirec- three legitimate datasets of different natures. tion than the ratio used by legitimate sites (less than 10%). 1)legitimate(Alexa) is a dataset of the Alexa Top 500 web- This indicates that JavaScript redirection should be more sites; 2) legitimate(UGA) include all the top level websites valuable when considering redirection behaviors as indica- hosted in a class B IP range (128.192.*.*) owned by the Uni- tions of spam. Many spam pages are hosted on exploited versity of Georgia, as a representative dataset for web sites servers which do not allow server configurations or server- in a university; and 3) legitimate(Blog) is a set of blog pages side scripts and JavaScript redirections are preferred over collected by continuously following the “next blog” link on META for their flexibility. blogger.com, which randomly return popular blogs from the The other significant difference in the results is actually blogger.com website. the META redirections among legitimate data set. It turns The selection of the first dataset, legitimate(Alexa), as out the university dataset has a relatively high percentage of a representative dataset for legitimate URL redirection is META redirection (19.79%) while the other datasets all have based on the assumption that the Alexa TOP 500 web sites very low percentage. META redirection is most likely used are spam free because they are the most popular Internet when the webmaster does not have control over the server sites. We did not validate this assumption. which is common with university hosted web pages. Given The selections of the other two datasets are validated by detection of META redirects is pretty straight forward, it us manually. Manual classification of spam and legitimate makes sense that spammers do not employ this technique websites often require domain knowledge and the authors are often and only 0.46% of the spam dataset redirected using familiar with the websites hosted on the university servers. META techniques. Manual testing confirmed that the URLs in this dataset are legitimate. 5.2 Impact of Popularity and Depth confirm a general suspicion that spam web sites redirect ex- Table 3: Impact of Depth on Redireciton in Alexa ternally much of time; as well, they confirm legitimate sites Top 100 try to keep redirection to within the same domain. Alexa Top 100 Level % Redirections We analyzed the results to see the percentage of redirec- Level 1 18.00% tions that are not within the same domain. It was observed Level 2 38.05% that the amount of external redirection observed in the spam Level 3 41.47% dataset is very high (46.81%) as opposed to that observed in the legitimate datasets ( 2%). The complete classification of the redirections into internal and external is shown in Table Figure 2: Comparison of Heuristics for External 4. Looking at the table you can see that the “Nameserver” Redirection Detection heuristic dominates the classification for internal redirec- tion. It should be acknowledged that it is not necessarily true that websites in the same/different domain have the While the overall results demonstrate the wide use of redi- same/different domain name servers, but when looking at rection in legitimate sites, it is still interesting to see whether Top 500 and the Spam dataset, we can conclude that the the use of redirections have any direct connections with the introduced error is not too significant. web site popularities and the depth of the URL on a web The amount of external redirection in the legitimate URL site. (0.39% - 2.00%) is observed to be less compared to the spam To study the impact and depth, we collected top level dataset (46.81%). Advertisement generated revenue is sig- (Level-1) URLs from the Alexa Top 100 to Alexa Top 500 nificant in the Internet and it is very common for the top websites, and we crawl to different level of Alexa Top 100 websites to host advertisements. This is consistent with our sites. It has been studied [?] that crawling websites up to measured result, where 53.10% of external redirections were 3-5 levels is sufficient to cover over 90% of the pages linked. to track advertisement clicks to help keep track of advertise- Therefore, for Alexa Top 100 websites, the first three levels ment revenue. of pages and those linked to them are crawled. For the spam dataset, as expected, a high percentage of The breakdown of percentage of redirections observed across redirections leave to an external domain (46.81%). Most the Alexa Top 500 websites (Level-1 pages) is shown in the of the client-side redirections are observed to be external. Table 2. From the results, it can be observed that the popu- 87% of JavaScript redirections and 97% of META redirects larity of a website has no significant impact on the amount of leave the current domain and redirect the user to an external redirection observed. In spite of 2-3% variation, each sector domain. On the other hand, only 43% of the Server-Side is close the average amount of redirection, 19.40% redirects actually leave the current domain, the majority of We further conducted experiments on the first three levels these server side redirections are from redirection services of Alexa Top 100 Websites and the amount of redirection ob- such as tinyurl. served is shown in the Table 3. From the results obtained, it appears that as one goes deeper into the websites, the 5.4 Study on Internal Redirections amount of redirection increases. This result meets our ex- We further classified the internal redirections of the Alexa pectation as the deeper a web site goes, the more possible dataset based on the reasons for redirection. The dominant reasons of redirections apply (see Section 3.2). internal redirections are primarily virtual hosting and load balancing. We observed that Virtual hosting redirections 5.3 Study on External Redirections account for 59.86% of the total observed internal redirections We further study the types of redirections based on the and load balancing accounts for 34.07%, followed by small target of redirection (external or internal) and the reasons percentage of URL redirection for the purpose of rewriting for the redirection. Given the usage of four heuristics in and security. determining whether a URL redirection goes to external do- This result matches with our expectation. Since Alexa main or not, we first evaluate the effect of these heuristics. top sites are popular sites based on the number of user ac- Each heuristic was individually applied and compared to the cesses, all of these sites would try their best to attract users. results where no heuristics were used (“None”) and those Naturally, these sites would have multiple names to attract were all four where used together (“All”). Figure 2 presents users, and to handle high volume of web access, multiple the comparison of external redirection detected with the use servers are likely to be used with load balancing applied. of each heuristic against the use of all heuristics and none in each dataset. The results indicate that these heuristics help 6. CONCLUSION identify actual external redirections more accurately. They In conducting our study on redirection, we found that the act of redirecting is not itself a good indicator of a spam URL. It turns out that spam and legitimate URLs use redi- Table 4: External vs. Internal Redirects rection with about the same frequency, 43.63% and 40.97% Type SPAM LEGITIMATE (Alexa) respectively. Since redirection itself is not a good indicator, Link Tracking (Advertisements) 53.10% Indirection (external links) 37.32% we studied the types of redirection and whether the redirec- External Redirection 46.81% 2% Indirection (anti-spam) 2.95% tions were external or internal. We found many good indi- Security 0.64% Others 5.83% cators of spam when looking at redirections in these ways. Virtual Hosting 59.86% Redirection when broken down by type server-side, Javascript, Load Balancing 34.07% and meta, shows that spam URLs have a tendency to use Internal Redirection 53.19% 98% URL Rewriting 1.3% Security 1.28% more Javascript redirects 30% compared to 10% for legiti- Others 4.07% mate URLs. Other types of redirection were used with sim- ilar frequency by both spam and legitimate URLs. This sented in this paper can be fairly sure a external redirection means that for detecting spam URLs, methods should only is a mark of spam. rely and must able to execute Javascript when looking at We expect that the results of our study provide can ben- just redirection. efit the applications that can potentially affected by URL Redirection broken down into external and internal redi- redirections, such as and content filter. We rections gives us a good separation of spam URLs and le- are refining our classification and detection tool and con- gitimate URLs. Spam URLs contain external redirection, tinuing our monitoring for the detection of new redirections 46.81%, with a much greater frequency than legitimate sites, techniques and the trend of redirection deployments. 2%. As such, a detection method using the heuristics pre-