Search Engine Privacy Protection in the System the Simple Solution: Trust
Total Page:16
File Type:pdf, Size:1020Kb
Data Collection - Exploitation - Protection. 29 April 2011 S. S. Bhople (TUE) A.W. Huijgen(UT) S.L.C. Verberkt(UT) Outline • Data Being Collected • Local Privacy Protection • Privacy Protection in the System Outline • Data Being Collected • Microsoft • Google • Local Privacy Protection • Privacy Protection in the System Introduction Search Engines A program for the retrieval of data from a database or network, esp. the Internet. Examples •Google Search •Bing •AOL •… Question Will you tell your personal information to strangers on the street? NO Then why to do it online? Primer on information theory and privacy Identification of the person= facts about the person ? What facts can be considered to be important to identify the person? •Zip code •Date of birth •Gender •… Mathematics (Information Theory) Entropy Measure of the uncertainty associated with a random variable. Mathematical Formulation: The entropy H of a discrete random variable X with possible values {x1,.., xn} is H(X)=E(I(X)) E : expected value I : information content of X. If p denotes the probability mass function of X , then We Continue with this basic knowledge… Mathematical background Entropy (H): The quantity which allows us to measure how close a fact comes to revealing somebody's identity uniquely. How much entropy is needed to identify someone? Lets perform some simple calculations. Human population on Earth Approx. 7 billion? To identify someone from entire population 33 bits of information is needed. How? H=-Log (1/7000000000)= 33 bits (32.70 to be exact) Mathematics (continued..) Consider the case of Mr. X •Knowing his Birthday •ΔH= -log(1/365) = 8.51 •Knowing his zip-code (let 5641) and population in that area ΔH= -log(561/7 b) = 23.57 •Knowing the gender (here male) • ΔH = -log(1/2) = 1. Adding all above information total entropy reduces to 33.08. Therefore, Can we say that, we can uniquely identify Mr. X jus by knowing DOB, zip-code and gender? ΔH: reduction in entropy. Evidence Latanya Sweeney Points violating individual privacy •Retention of log of queries. •A log of queries associated with usernames. •A log queries more than a day, a month, an year. •Anonymized query data release. What is clear? Case study of AOL 2006 database scandal What was released? • Search data for roughly 658,000 anonymized users over a three month period from March to May. • There was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information. • According to comScore Media Metrix, the AOL search network had 42.7 million unique visitors in May, so the total data set covered roughly 1.5% of May search users. Case study (continued) •20 million search records •U.S. searches Result •Violent debates. •Apology from AOL. Case study (continued) Some issues • Profiles of some AOL users were uniquely identified. Thelma Arnold, 62 years women. •Some scary stuff Check the profile 17556639.xls Case study (continued) Do you think this person wanted to do something bad? Can we use this to reduce crime rate? "I think freedom should be limited." -Barack Obama, 2006 Google Services offered •Search Google Search, Image Search, Video Search etc. •Advertising Adsense, Adwords, Double Click •Location Google Maps, Google Earth, Google Building Maker •Communication and Publishing Gmail, G-talk, Orkut, Google Public DNS •Online shopping Google Store, Google Checkout, Google Base Google Personal Productivity iGoogle, Google Toolbar, Google Desktop Business Solutions, Mobile, Development, Social Responsibility and Many More. Google collects data Google (Normal Search) •Search Engine Result •Additional Preferences Can Include Pages ◦Street Address •Country Code Domain ◦City •Query ◦State •IP address ◦Zip/Postal Code •Language •Clicks •Number of results •Safe search •Server Log ◦Query ◦URL ◦IP Address ◦Cookie ◦Browser ◦Date ◦Time Google collects data (continued) Google Personalized Search ◦ Personal Picture • Logs every website visited as a ◦ Usage result of a Google Search. ◦ Friends • Content analysis of visited ◦ Google Services Usage websites ◦ Amount of Logins Google Account Toolbar • Used as resource to compile • All Websites Visited information on individual users • Unique application number • Sign Up • Sends all visited 404s to Google ◦ Sign up date ◦ Username ◦ Password ◦ Alternate E-mail ◦ Location(Country) Google collects data (continued) •Toolbar Synchronization Function ◦ Stores Autofill info with Google Story is not very different for Account. other companies products. ◦ Sends structure of web forms to (more specifically other search Google. engines) • Safe Browsing ◦ Stores Response to Security Warnings • Stores Autofill Forms Data • Sends Spellcheck Data to Google Servers And the list continues… Search Engines : Big Boss? Do you get the feeling that someone is keeping a watch on you? Do you think your privacy is disturbed? A simple question Are IP addresses personal? “IP addresses recorded by every website on the planet without additional information should not be considered personal data, because these websites usually cannot identify the human beings behind these number strings.” Yes No ? Some facts • Websites like Google never store IP addresses devoid of context; instead, they store them connected to identity or behavior. • Google probably knows from its log files, for example, that an IP address was used to access a particular email or calendar account, edit a particular word processing document, or send particular search queries to its search engine. • By analyzing the connections woven throughout this mass of information, Google can draw some very accurate conclusions about the person linked to any particular IP address We are moving to a Google that knows more about you. -Google CEO Eric Schmidt, speaking to financial analysts February 9, 2005, as quoted in the New York Times the next day Why? • Google says it needs to store search queries and gather information on online activity to improve its search results and to provide advertisers with correct billing information that shows that genuine users are clicking on online ads. • Google promises to deliver its wonderful innovations by studying the behavior of individual. • To fight fraud and to improve data security. Anonymization by Google Technique followed by google to anonymize the IP Address • An IP address is composed of four equal pieces called octets. • Google stores the first three octets and deletes the last, claiming that this practice protects user privacy sufficiently. Really? "After nine months, we will change some of the bits in the IP address in the logs. After 18 months we remove the last eight bits in the IP address and change the cookie information...It is difficult to guarantee complete anonymization, but we believe these changes will make it very unlikely users could be identified.“ -Google Outline • Data Being Collected • Local Privacy Protection • Google Search • Beyond the Searching Page • Privacy Protection in the System Local Privacy Protection Google Search • Hide • Minimize • Obfuscate • Leave Hide Local Privacy Protection Google Search : Hide [1/2] Minimize Obfuscate Leave • Application Level . Scroogle.org . Startingpage.com . GoogleSharing Hide Local Privacy Protection Google Search : Hide [2/2] Minimize Obfuscate Leave • IP Level • HTTP Proxy • Socks Proxy • VPN • The Onion Routing project Hide Local privacy protection Minimize Google Search : Minimize [1/2] Obfuscate Leave • Startup Page • Sign Out • Autocomplete / Google Instant Hide Local privacy protection Minimize Google Search : Minimize [2/2] Obfuscate Leave • Clicktracking • User Script for Greasemonkey • OptimizeGoogle Add-on • Cookies • Javascript Hide Local privacy protection Minimize Google Search : Obfuscate Obfuscate Leave • TrackMeNot • User Scripts for Greasemonkey Hide Minimize Local privacy protection Obfuscate Google Search :: Leave Leave Local Privacy Protection Quick Summary • Hide • Minimize • Obfuscate • Leave Local Privacy Protection Beyond the Searching Page • Clicked Search Results • Browser Software Local Privacy Protection Clicked Search Results • Block Referers From Being Sent • SSL • HTTP POST Method • RefControl for Firefox • Block Google Analytics • Ghostery for Firefox • Block Google Ads • Adblock Plus for Firefox Local Privacy Protection Browser Software • General • Change Startup Page • Remove Google Toolbar • Mozilla Firefox • Disable Safebrowsing Feature and use WOT • Google Chrome • Use SRWare Iron Alternative • Windows Internet Explorer • Disable SmartScreen Filter Outline • Data Being Collected • Local Privacy Protection • Privacy Protection in the System • Trusting the Search Engine • Protection Against Information Leakage • Protection Against an Untrusted Search Engine Privacy protection in the system The Simple Solution: Trust • Establishing Trust • Trust the Search Engine to Respect your Privacy • End User Agreements • Law Privacy protection in the system Information Leakage • Using an untrusted channel • Sending search terms or retrieving results • Retrieving personalized search term suggestions • Retrieving search history • With a hijacked session • Retrieving search history • Retrieving search history using personalized search term suggestions Privacy protection in the system Information Leakage • Example of misuse of personalized search suggestions: • Malicious user supplies common two-letters prefixes: pr • Suggest will reply with suggestions: privacy protection protocols • If there are 3 suggestions, the malicious