Data Collection - Exploitation - Protection. 29 April 2011

S. S. Bhople (TUE) A.W. Huijgen(UT) S.L.C. Verberkt(UT) Outline

• Data Being Collected

• Local Protection

• Privacy Protection in the System

Outline

• Data Being Collected • Microsoft •

• Local Privacy Protection

• Privacy Protection in the System

Introduction

Search Engines

A program for the retrieval of data from a database or network, esp. the Internet.

Examples

•Bing •AOL •…

Question

Will you tell your personal information to strangers on the street?

NO

Then why to do it online?

Primer on information theory and privacy

Identification of the person= facts about the person ?

What facts can be considered to be important to identify the person?

•Zip code

•Date of birth

•Gender

•…

Mathematics (Information Theory)

Entropy Measure of the uncertainty associated with a random variable.

Mathematical Formulation:

The entropy H of a discrete random variable X with possible values {x1,.., xn} is H(X)=E(I(X)) E : expected value I : information content of X.

If p denotes the probability mass function of X , then

We Continue with this basic knowledge… Mathematical background

Entropy (H): The quantity which allows us to measure how close a fact comes to revealing somebody's identity uniquely.

How much entropy is needed to identify someone?

Lets perform some simple calculations.

Human population on Earth Approx. 7 billion?

To identify someone from entire population 33 bits of information is needed.

How?

H=-Log (1/7000000000)= 33 bits (32.70 to be exact)

Mathematics (continued..)

Consider the case of Mr. X

•Knowing his Birthday •ΔH= -log(1/365) = 8.51

•Knowing his zip-code (let 5641) and population in that area ΔH= -log(561/7 b) = 23.57

•Knowing the gender (here male) • ΔH = -log(1/2) = 1.

Adding all above information total entropy reduces to 33.08.

Therefore, Can we say that, we can uniquely identify Mr. X jus by knowing DOB, zip-code and gender?

ΔH: reduction in entropy. Evidence

Latanya Sweeney Points violating individual privacy

•Retention of log of queries.

•A log of queries associated with usernames.

•A log queries more than a day, a month, an year.

•Anonymized query data release.

What is clear?

Case study of AOL 2006 database scandal

What was released?

• Search data for roughly 658,000 anonymized users over a three month period from March to May.

• There was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information.

• According to comScore Media Metrix, the AOL search network had 42.7 million unique visitors in May, so the total data set covered roughly 1.5% of May search users.

Case study (continued)

•20 million search records

•U.S. searches

Result

•Violent debates.

•Apology from AOL.

Case study (continued)

Some issues

• Profiles of some AOL users were uniquely identified.

Thelma Arnold, 62 years women.

•Some scary stuff

Check the profile 17556639.xls

Case study (continued)

Do you think this person wanted to do something bad?

Can we use this to reduce crime rate?

"I think freedom should be limited." -Barack Obama, 2006

Google Services offered

•Search Google Search, Image Search, Video Search etc.

•Advertising Adsense, Adwords, Double Click

•Location , , Google Building Maker

•Communication and Publishing , G-talk, , Google Public DNS

•Online shopping , ,

Google

Personal Productivity iGoogle, ,

Business Solutions, Mobile, Development, Social Responsibility and Many More.

Google collects data Google (Normal Search) • Result •Additional Preferences Can Include Pages ◦Street Address •Country Code Domain ◦City •Query ◦State •IP address ◦Zip/Postal Code •Language •Clicks •Number of results •Safe search •Server Log ◦Query ◦URL ◦IP Address ◦Cookie ◦Browser ◦Date ◦Time Google collects data (continued)

Google Personalized Search ◦ Personal Picture • Logs every website visited as a ◦ Usage result of a Google Search. ◦ Friends • Content analysis of visited ◦ Google Services Usage websites ◦ Amount of Logins

Google Account Toolbar • Used as resource to compile • All Websites Visited information on individual users • Unique application number • Sign Up • Sends all visited 404s to Google ◦ Sign up date ◦ Username ◦ Password ◦ Alternate E-mail ◦ Location(Country) Google collects data (continued)

•Toolbar Synchronization Function ◦ Stores Autofill info with Google Story is not very different for Account. other companies products. ◦ Sends structure of web forms to (more specifically other search Google. engines)

• Safe Browsing ◦ Stores Response to Security Warnings

• Stores Autofill Forms Data • Sends Spellcheck Data to Google Servers And the list continues…

Search Engines : Big Boss?

Do you get the feeling that someone is keeping a watch on you?

Do you think your privacy is disturbed?

A simple question

Are IP addresses personal?

“IP addresses recorded by every website on the planet without additional information should not be considered , because these websites usually cannot identify the human beings behind these number strings.”

Yes No ?

Some facts

• Websites like Google never store IP addresses devoid of context; instead, they store them connected to identity or behavior.

• Google probably knows from its log , for example, that an IP address was used to access a particular email or calendar account, edit a particular word processing document, or send particular search queries to its search engine.

• By analyzing the connections woven throughout this mass of information, Google can draw some very accurate conclusions about the person linked to any particular IP address

We are moving to a Google that knows more about you. -Google CEO , speaking to financial analysts February 9, 2005, as quoted in the next day

Why?

• Google says it needs to store search queries and gather information on online activity to improve its search results and to provide advertisers with correct billing information that shows that genuine users are clicking on online ads.

• Google promises to deliver its wonderful innovations by studying the behavior of individual.

• To fight fraud and to improve .

Anonymization by Google

Technique followed by google to anonymize the IP Address

• An IP address is composed of four equal pieces called octets.

• Google stores the first three octets and deletes the last, claiming that this practice protects user privacy sufficiently.

Really?

"After nine months, we will change some of the bits in the IP address in the logs. After 18 months we remove the last eight bits in the IP address and change the cookie information...It is difficult to guarantee complete anonymization, but we believe these changes will make it very unlikely users could be identified.“ -Google Outline

• Data Being Collected

• Local Privacy Protection • Google Search • Beyond the Searching Page

• Privacy Protection in the System

Local Privacy Protection Google Search

• Hide

• Minimize

• Obfuscate

• Leave Hide Local Privacy Protection Google Search : Hide [1/2] Minimize Obfuscate

Leave • Application Level

. Scroogle.org

. Startingpage.com

. GoogleSharing

Hide Local Privacy Protection Google Search : Hide [2/2] Minimize Obfuscate

Leave • IP Level

• HTTP Proxy

• Socks Proxy

• VPN

• The Onion Routing project Hide

Local privacy protection Minimize Google Search : Minimize [1/2] Obfuscate

Leave • Startup Page

• Sign Out

• Autocomplete / Google Instant

Hide

Local privacy protection Minimize Google Search : Minimize [2/2]

Obfuscate

Leave • Clicktracking

• User Script for Greasemonkey

• OptimizeGoogle Add-on

• Cookies

• Javascript Hide Local privacy protection Minimize Google Search : Obfuscate Obfuscate

Leave

• TrackMeNot

• User Scripts for Greasemonkey Hide

Minimize Local privacy protection Obfuscate Google Search :: Leave Leave Local Privacy Protection Quick Summary

• Hide

• Minimize

• Obfuscate

• Leave Local Privacy Protection Beyond the Searching Page

• Clicked Search Results

• Browser Software

Local Privacy Protection Clicked Search Results

• Block Referers From Being Sent • SSL • HTTP POST Method • RefControl for

• Block • Ghostery for Firefox

• Block • Adblock Plus for Firefox Local Privacy Protection Browser Software • General • Change Startup Page • Remove Google Toolbar

Firefox • Disable Safebrowsing Feature and use WOT

• Use SRWare Iron Alternative

• Windows • Disable SmartScreen Filter Outline

• Data Being Collected

• Local Privacy Protection

• Privacy Protection in the System • Trusting the Search Engine • Protection Against Information Leakage • Protection Against an Untrusted Search Engine Privacy protection in the system The Simple Solution: Trust

• Establishing Trust

• Trust the Search Engine to Respect your Privacy

• End User Agreements

• Law Privacy protection in the system Information Leakage

• Using an untrusted channel • Sending search terms or retrieving results • Retrieving personalized search term suggestions • Retrieving search history

• With a hijacked session • Retrieving search history • Retrieving search history using personalized search term suggestions Privacy protection in the system Information Leakage

• Example of misuse of personalized search suggestions:

• Malicious user supplies common two-letters prefixes: pr

• Suggest will reply with suggestions: privacy protection protocols

• If there are 3 suggestions, the malicious user descends to three-letter prefixes • Else, the malicious user found them all Privacy protection in the system Information Leakage

• Disabling personalized search suggestions.

• (Strongly protected) authentication

• Compartmentalised search history and suggestions

Privacy protection in the system Untrusted Search Engine

• Perfect privacy when searching a database

• Replicate the complete database

• Search local Privacy protection in the system Untrusted Search Engine: the distributed solution

• Like a puzzle • Every server has a piece • The user requests all pieces • Combining the pieces yields the result

• Need to hide the search terms • Request more than the wanted piece Privacy protection in the system Untrusted Search Engine: the distributed solution

• Two-server example scheme:

• The user selects a uniformly random subset of all possible queries 푺 (with ퟏ probability ): ퟐ

• 푆푎 = *cryptography, shoes, privacy, ministry of defense+

• The user also calculates the union with the wanted query 푺 ∪ 풊 (or 푺 ∖ 풊 if 풊 is already included):

• 푆푏 = *cryptography, shoes, privacy, ministry of defense, privacy seminar+

Privacy protection in the system Untrusted Search Engine: the distributed solution

• Two-server example scheme:

• The user sends 푺풂 to the first server and 푺풃 to the second server.

• Each server replies with the exclusive or of all results for the sent queries:

• 푅푎 = 푅cryptography ⊕ 푅shoes ⊕ 푅privacy ⊕ 푅ministry of defense

• 푅푏 = 푅cryptography ⊕ 푅shoes ⊕ 푅privacy ⊕ 푅ministry of defense ⊕ 푅privacy seminar Privacy protection in the system Untrusted Search Engine: the distributed solution

• Two-server example scheme:

• The user calculates the exclusive or of 푅푎 and 푅푏, which yields the result:

• 푅푎 ⊕ 푅푏 = 푅cryptography ⊕ 푅shoes ⊕ 푅privacy ⊕ 푅ministry of defense ⊕ 푅cryptography ⊕ 푅shoes ⊕ 푅privacy ⊕ 푅ministry of defense ⊕ 푅privacy seminar = 푅privacy seminar

Privacy protection in the system Untrusted Search Engine: the distributed solution

• The 풌-server protocol:

• 풌 = ퟐ풅 for some 풅 ≥ ퟏ Privacy protection in the system Untrusted Search Engine: the distributed solution Privacy protection in the system Untrusted Search Engine: the distributed solution

• The 풌-server protocol:

풅 • Based on the binary number of the server 휶 = 흈ퟏ흈ퟐ … 흈풅 ∈ *ퟎ, ퟏ+ , the user 흈ퟏ 흈ퟐ 흈풅 selects the subsets to send: *푺ퟏ , 푺ퟐ , … , 푺풅 +.

ퟑ • Thus, for 풌 = ퟐ = ퟖ, the user would send server ퟓ = ퟏퟎퟏ풃 the query: ퟏ ퟎ ퟏ • *푺ퟏ, 푺ퟐ, 푺ퟑ+ Privacy protection in the system Untrusted Search Engine: the distributed solution

• The 풌-server protocol:

• Each server calculates the exclusive or of the results for the sent queries and sends these back.

• The user calculates the exclusive or of all the received results, which yields the wanted result. Privacy protection in the system Untrusted Search Engine: the distributed solution

• Problem:

• The servers are not allowed to communicate

• Which they probably do… Privacy protection in the system Untrusted Search Engine: the homomorphic method

• Simplified example: • The user has a query containing several search terms: • “lecture notes privacy seminar”

• The user generates some (plausible) decoy terms and adds these: • “management lecture notes paper book privacy democracy seminar”

• The user assigns every search term its value: • Management = 0, lecture = 1, notes = 1, paper = 0, book = 0, privacy = 1, democracy = 0, seminar = 1

Privacy protection in the system Untrusted Search Engine: the homomorphic method

• Simplified example:

• The user encrypts the values using a homomorphic encryption scheme. • Afterwards, the user sends the search terms with the values to the server.

*(management, 퐸 0 ), … + Privacy protection in the system Untrusted Search Engine: the homomorphic method

• Simplified example:

• The server uses the encrypted values as factors when calculating the relevance scores:

• 푟푒푠푢푙푡푎 = 푟푒푙푒푣푎푛푐푒 푡표 푚푎푛푎푔푒푚푒푛푡 ∙ 퐸 0 + 푟푒푙푒푣푎푛푐푒 푡표 푙푒푐푡푢푟푒 ∙ 퐸 1 + 푟푒푙푒푣푎푛푐푒 푡표 푛표푡푒푠 ∙ 퐸 1 + 푟푒푙푒푣푎푛푐푒 푡표 푝푎푝푒푟 ∙ 퐸 0 + 푟푒푙푒푣푎푛푐푒 푡표 푏표표푘 ∙ 퐸 0 + 푟푒푙푒푣푎푛푐푒 푡표 푝푟푖푣푎푐푦 ∙ 퐸 1 + 푟푒푙푒푣푎푛푐푒 푡표 푑푒푚표푐푟푎푐푦 ∙ 퐸 0 + 푟푒푙푒푣푎푛푐푒 푡표 푠푒푚푖푛푎푟 ∙ 퐸 1

• The server sends these results to the user.

• The user decrypts the results and reorders using the relevance scores. Conclusion

•Search engines certainly collects sensitive information about the users and it is supposed to be well protected. According to service providers the main purpose of this collection is to improve the performance of the search result. •There are always few good methods to preserve personal privacy for the more technical savy user. Also, it is possible to implement a privacy preserving search system, but search engines will not advocate these methods by itself, as this does not appear viable to them. Therefore, to build more privacy in search engines, strict law is required. •Along with the trust between the users and service providers, there is also a need of assurance of the protection of the data from any other possible unwanted entity.

References

•https://www.eff.org/deeplinks/2010/01/primer-information-theory-and- privacy •http://en.wikipedia.org/wiki/Entropy_(information_theory)#Definition •http://plentyoffish.wordpress.com/2006/08/07/aol-search-data-shows-users- planning-to-commit-murder/ •http://techcrunch.com/2006/08/06/aol-proudly-releases-massive-amounts-of- user-search-data/ •http://windows.microsoft.com/en-US/internet-explorer/products/ie-8/privacy- statement •http://memeburn.com/2011/02/google-accuses-bing-of-stealing-search- results/ •B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan. Private information retrieval. J. ACM, 45:965{981, November 1998. •H. Pang, X. Ding, and X. Xiao. Embel- lishing text search queries to protect user privacy. Proc. VLDB Endow., 3:598{607, September 2010. •O. Tene. What google knows: Privacy and internet search engines. Utah Law Review, Februari 2008.

# Data being collected Browser software • Windows Internet Explorer • Integrated Bing Search • Smartscreenfilter • Default Starting Page

• Google Chrome • Integrated Google Search from Address Bar • Protection Against Phishing and Malware • Update checking

• Mozilla Firefox • Safebrowsing feature • Default Starting Page # Browser basics

•Whenever you use the Internet, or software with Internet-enabled features, information about your computer ("standard computer information") is sent to the websites you visit and online services you use.

•Standard computer information includes your computer's IP address, browser type and language, access times, and referring website addresses. This information might be logged on those sites' web servers. Which information is logged and how that information is used depends on the privacy practices of the websites you visit and web services you use.

Google server logs

Like most Web sites, google servers automatically record the page requests made when you visit our sites. These “server logs” typically include your web request, Internet Protocol address, browser type, browser language, the date and time of your request and one or more cookies that may uniquely identify your browser. Example: 123.45.67.89 - 25/Mar/2003 10:15:32 - http://www.google.com/search?q=cars - Firefox 1.0.7; Windows NT 5.1 - 740674ce2123e969 Firefox 1.0.7; Windows NT 5.1 is the browser and operating system being used; and 740674ce2123a969 is the unique cookie ID assigned to this particular computer the first time it visited Google. (Cookies can be deleted by users. If the user has deleted the cookie from the computer since the last time s/he visited Google, then it will be the unique cookie ID assigned to the user the next time s/he visits Google from that particular computer).

# Electronic Frontier Foundation (EFF)

Browser properties

• Demo: https://panopticlick.eff.org/ Google’s privacy principles

5 principles

http://www.youtube.com/watch?v=5fvL3mNtl1g&feature=player_embedded

Google dashboard

http://www.youtube.com/watch?v=ZPaJPxhPq_g Interested in some news and business

Bing (Microsoft) copies the result from Google

There are few evidences that found showing Bing copies the results of Google search. Also information about Bing using a tool studying the google query analyzers have been disclosed.