An Investigation Into Better Techniques for Geo-Targeting by Shane Elliot Cruz
Total Page:16
File Type:pdf, Size:1020Kb
An Investigation into Better Techniques for Geo-Targeting by Shane Elliot Cruz Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of BARKER Master of Engineering in Electrical Engineering and Computer Science at the ASSACH NSTITUTE OF TERHNOLOGY MASSACHUSETTS INSTITUTE OF TECHNOLOGY 1 June 2002 JUL 3 2002 LIBRARIES © Shane Elliot Cruz, MMII. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part. A uthor ............... Department of Electrical Engineering and Computer c-ene May 17, 2002 Certified by.............................. Brian K. Smith Associate Professor, MIT Media Laboratory Thesis Supervisor Accepted by .................... Arthur C. Smith Chairman, Department Committee on Graduate Students 2 An Investigation into Better Techniques for Geo-Targeting by Shane Elliot Cruz Submitted to the Department of Electrical Engineering and Computer Science on May 17, 2002, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract This thesis discusses the possibility of determining the geographic location of Internet hosts solely from IP address information. The design, implementation, and evalua- tion of several unique components for geo-targeting led to an understanding of the strength and weakness of each strategy. Upon realization of the pros and cons of the location mapping designs, a new system was built that combines the best features of each technique. An overall evaluation of each component and the final system is then presented to demonstrate the accuracy gained with the use of a Domain-Based Corn- ponent Selection methodology. These findings demonstrate that location mapping can be used as a useful tool for many common Internet applications. Thesis Supervisor: Brian K. Smith Title: Associate Professor, MIT Media Laboratory 3 4 Acknowledgments It is difficult to overstate my gratitude to my advisor, Brian K. Smith, for the assis- tance he has given me throughout the entire thesis process. Without his open mind and insightful comments, this thesis would not be possible. I would also like to thank all the members of the Critical Computing group at the MIT Media Lab. Erik, Tara, Anna, Tim, Jeana, and Nell have all provided a great environment in which I was able to pursue my research. I am deeply indebted to all those who have helped me throughout my studies at MIT. This great circle of close friends helped me through times when I thought I could never make it. I would especially like to thank my roommate, Jang Kim, for being a great friend throughout both my undergraduate and graduate careers. Without his assistance and support, I don't know if my accomplishments would have been possible. I thank the many others who have stood by me throughout my life. My close friends and family members have given me the motivation to be where I am today. I would especially like to thank my brothers who have been my best friends throughout my life. Throughout the thick and the thin, I know that you will al- ways be there for me. Lastly, and most importantly, I would like to thank my parents. I wish I had the ability to express in words how much I owe you. Your love and encouragement have given me the opportunity to experience things that many could never imagine. To them I dedicate this thesis. 5 6 Contents 1 Introduction 15 1.1 A Location-Transparent Web ....... ........... .... 16 1.2 Research Contribution ..... ..................... 16 1.3 Thesis Outline. .................. ............ 17 2 Extended Example 19 2.1 Dynamic Webpage Interface ................ ....... 19 2.1.1 Country-Specific Customization .. ............... 20 2.1.2 Geo-targeted Sales and Advertising ............ ... 21 2.2 Regulatory Compliance .............. ............ 21 2.2.1 Location-Restricted Sales and Services ............. 22 2.2.2 Fraud Detection ................... ....... 22 2.2.3 Digital Rights Management ... .......... ...... 23 2.3 Business Intelligence ...... ..................... 23 2.4 Sum m ary ........ ............ ........... .. 25 3 Theory/Rationale 27 3.1 M otivation ... ..... ..... ..... 27 3.2 Lim itations . .... .... .... ... 28 3.3 Previous Research .. ..... ..... .. 29 3.3.1 Exhaustive Tabulation of Mapping 29 3.3.2 DNS-Encoded Information ..... 30 3.3.3 Ping Timing Information ..... 31 7 3.3.4 Whois ... ............... .. ... ... ... ... 3 2 3.3.5 Traceroute ..... .... .... ... .. .. .. ... .. 3 3 3.3.6 Cluster Analysis .... ..... ... ... ... ... .. 3 5 3.4 Proprietary Corporate Strategies . ... .... .... ..... 3 7 4 Design and Implementation 39 4.1 Component Design and Implementation ..... .. .. 39 4.1.1 Whois Parsing Component ... ..... .. .. 40 4.1.2 Traceroute Hostname Analysis .. ..... ... 44 4.1.3 Determining Geographic Network Clusters . ... 50 4.2 The Final System .. ...... ...... .... .... 56 4.2.1 Domain-Based Component Selection . .. .. .. 57 4.2.2 Detecting National Proxies ..... .... .. .. 59 4.3 Sum m ary .. ..... ...... ...... ... .... 60 5 Evaluation 61 5.1 Data Collection ... ... ... .. .. ................. 6 1 5.1.1 Research Community Input .. ................. 6 2 5.1.2 University Server Mapping . .. .......... ..... 6 2 5.1.3 Commercial Website Collection ................. 6 2 5.2 Evaluation Criteria . ..... .... .... .......... ... 6 3 5.2.1 Determining Performance . .. ................. 6 4 5.2.2 Limiting Incorrect Targeting .. .......... ....... 6 5 5.3 Component Analysis ...... .... ........ ......... 6 5 5.3.1 Whois Lookup Analysis .. .. ... ...... ..... ... 6 5 5.3.2 Traceroute Parsing Analysis .. .... .......... ... 6 7 5.3.3 Network Cluster Analysis . .. ................. 6 7 5.4 Final System Performance ...... ........ ......... 6 8 5.5 System Comparisons ........ .. .... ...... ..... .. 6 9 6 Conclusion 73 8 A Figures 77 9 10 List of Figures 4-1 A Sample Interaction Between a Client and the VeriSign Registry . 41 4-2 A Sample Interaction Between a Client and a Registrar ........ 42 4-3 A Simple Traceroute Execution ........ ............ 45 4-4 A Sample Parse File for the Ameritech Network ....... ..... 48 4-5 A Visualization of Common Internet Routing ........ ..... 51 5-1 The CDF of Each Component for the Set of University Web Servers . 66 5-2 The CDF of Each Design for the Entire Set of Test Data ....... 70 A-1 The Growth of Websites on the Internet[23] .............. 77 A-2 The Rapid Expansion of the Online Community[23] .......... 78 A-3 A More Complex Parse File Used for Genuity Routers ......... 79 11 12 List of Tables 4.1 Router Names with Common Location Codes ............. 46 4.2 The Various Network Classes and Their Respective Slashbits ..... 52 4.3 The Strengths and Weaknesses of Each Component Design ...... 57 5.1 The Top-Level Domains Represented in the Cumulative Test Data .. 64 5.2 The Statistical Analysis of Each Component for All Test Data .... 67 5.3 The Statistical Analysis of the Final System for All Test Data .... 69 13 14 Chapter 1 Introduction For many years following its introduction, the computer was used solely as a compu- tational tool to solve complex mathematical and scientific problems at the corporate and university level. As both the price of the hardware and the size of these ma- chines began to drop, computers started to make their way into residential homes as word-processing machines and entertainment devices. More recently, the connectivity presented by the Internet has shown that these devices can be used for much more than solving equations, typing a paper, or playing video games. The rapid growth of the World Wide Web (as seen in Figure A-1) has presented a whole new powerful information source that enables people across the globe to quickly access information that was once impossible to find. This large expansion of online information and the diminishing cost of connecting to the Internet have created an online community that spans the globe. As computer prices continue to drop and as portable devices continue to gain Internet connectiv- ity, the number of hosts connected to the Internet will only rise. This diverse global community, created by the vast array of networks, allows for international communi- cation across an entirely new medium in a form that was once impossible to imagine. Figure A-2 indicates the exponential climb in Internet hosts over the past several years. 15 1.1 A Location-Transparent Web While the Internet continues to provide useful information and services to a global community, the geographic topology of the World Wide Web remains insignificant. Due to geographically-blind packet routing protocols, a website visitor from London, England can access web services from a server in Boston, Massachusetts just as easily a student in the greater Boston area. This desirable location transparency has played a significant role in the flexibility and scalability of the Internet's networks. Unfortunately, with this location transparency comes an interesting challenge for web developers. Since Internet packet forwarding techniques make use of geography- free routing information (and there is no direct correlation between IP address and geographic location), a server cannot currently determine the location of the client with which it is communicating. Therefore, web developers and service providers must generalize their content and services to appeal to visitors from around