Eindhoven University of Technology

MASTER

Content based access control in social networks sites

van den Munckhof, C.J.J.

Award date: 2011

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain MASTER THESIS

Content Based Access Control in Social Network Sites

EINDHOVEN UNIVERSITYOF TECHNOLOGY DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE

Author: Supervisors: Coen VAN DEN MUNCKHOF Dr. J.I. (Jerry) DEN HARTOG Prof. Dr. R.E. (Ronald) LEENES

September 2011

Abstract

A string of incidents on the news involving users of social network sites (SNSs) that unknowingly expose their private information to the public form the basis of this Master Thesis. It seems that many people are unaware of the audiences that have access to the content that they share on these SNSs or that they are unmotivated to pay attention to privacy policies for their content. This is the main problem addressed in this thesis. First an extensive research is done that includes an analysis of existing SNSs (containing a definition, feature description, and a detailed look at their privacy settings and data management). The research phase continues with a detailed look at the available access control mechanisms (methods that determine who are granted access to certain digital content and who are not). The research phase leads to the proposition of a new method driven by content based access control. The method automatically proposes an appro- priate privacy policy for a user’s content. It does so by analyzing both the content of the message and the author’s profile and distilling keywords (or tags) that are matching with those from an existing dataset. This set of so called attributes is then matched with an existing dataset of privacy policies to determine the most appropriate policy. The thesis ends with an experi- ment in which an essential element of the proposed method is tested using data from Hyves and Wikipedia.

Contents

List of Figures ix

List of Tables xi

1 Introduction 1 1.1 SNS Related Incidents ...... 2 1.2 Project Outline ...... 3 1.2.1 The Problem ...... 3 1.2.2 The Scope ...... 3 1.2.3 The Research Question ...... 4 1.3 Structure ...... 4

2 Social Network Sites 7 2.1 Popularity ...... 7 2.2 Philosophy and Different Types of SNSs ...... 7 2.3 Information on SNSs ...... 8 2.4 Privacy Settings ...... 10 2.5 Conclusion ...... 13

3 Access Control 15 3.1 Access Control Models ...... 17 3.1.1 Discretionary Access Control ...... 17 3.1.2 Mandatory Access Control ...... 18 3.1.3 Role-Based Access Control ...... 19

v 3.2 Access Control Implementations ...... 20 3.2.1 Attribute Based Access Control ...... 20 3.2.2 Content Based Access Control ...... 21 3.2.3 XACML ...... 21 3.3 Conclusion ...... 22

4 Structured Data 23 4.1 Taxonomies & Ontologies ...... 23 4.2 Existing Datastructures ...... 24 4.3 Conclusion ...... 26

5 Content Based Access Control 29 5.1 Current Model ...... 30 5.2 Content Based Model ...... 30 5.3 Policy Proposing Process ...... 31 5.3.1 Determine Attributes ...... 32 5.3.2 Select policy from attributes ...... 32

6 Concept Testing 35 6.1 Implementation for the Experiment ...... 35 6.1.1 Attributes Determination ...... 35 6.1.2 Select Policy from Attributes ...... 38 6.1.3 Internal Datasets ...... 38 6.1.4 External Datasets ...... 39 6.1.5 Solr and Content Querying ...... 39 6.2 Preparation Datasets ...... 40 6.2.1 Tag Data from Wikipedia ...... 40 6.2.2 Message and Profile Data from Hyves ...... 41 6.2.3 Policy data ...... 42 6.3 Analysis ...... 42 6.3.1 Evaluation of the Wiki Data ...... 42 6.3.2 Evaluation of the Hyves Data ...... 44 6.3.3 Attribute Matching Evaluation ...... 47 vi 7 Conclusion 49 7.1 Conclusion ...... 49 7.2 Recommendations ...... 49

A Interesting Data in Wikipedia 51

B Solr Matching 53 B.1 Regular Expressions ...... 53

Bibliography 55

vii viii List of Figures

1.1 Project structure ...... 5

2.1 Tree of the data categories ...... 9 2.2 Hyves friends model ...... 11 2.3 Facebooks custom privacy popup ...... 12

3.1 Mandatory Access Control ...... 19 3.2 XACML access request flow ...... 22

5.1 Friends model ...... 30 5.2 Content based model ...... 31 5.3 Working of Content based access control ...... 32

6.1 Attribute determination from message content ...... 36 6.2 Policy determination from content attributes ...... 38 6.3 Example of a blogpost ...... 41 6.4 Example of ‘who what where’ message ...... 41 6.5 Part of the policy dataset ...... 42 6.6 Hyves users year of birth ...... 45 6.7 Friends count Hyves users ...... 45

A.1 Wikipedia’s infobox and visualization ...... 51

ix x List of Tables

6.1 Solr matches ...... 39 6.2 Category data in the Wikipedia tables ...... 43 6.3 Items fetched from Hyves ...... 44 6.4 Ratio of public and hidden profiles ...... 44 6.5 Gender distribution of fetched profiles ...... 44 6.6 Percentage of profiles with a hometown or city filled . . . . . 45 6.7 Five most popular cities ...... 46 6.8 Ten most popular regions of interest in sport ...... 46 6.9 Percentage of www messages with location field filled . . . . 47 6.10 Tags in blogs ...... 47 6.11 Tags in status messages ...... 47 6.12 Date & Time attributes in status messages ...... 48 6.13 Date & Time attributes in blog posts ...... 48

A.1 Infobox locations ...... 51

xi xii 1. Introduction

A news item from AFP [9] of March 15, 2011 reported the following:

An Australian schoolgirl had to cancel her 16th birthday party after her Facebook invitation went viral and close to 200, 000 people said they would turn up at her house. The Sydney girl had wanted her schoolmates to attend, and the post – which included her address – said they could bring friends if they let her know, Sydney’s Daily Telegraph newspaper reported. “(It’s an) open house party as long as it doesn’t get out of hand,” she wrote, adding that she had not had time to invite everyone individually. But within 24 hours more than 20, 000 people had replied to the public event to say they were attending and by Tuesday almost 200, 000 potential partygoers had reportedly accepted the invita- tion. The girl’s father said his daughter had invited “a few friends” over Facebook but had initially been unaware of the settings required to stop strangers from viewing the information.

This is just one of the many [8, 17, 39] examples reporting about the con- sequences of poorly configured privacy settings within Facebook1 or other social network sites (SNSs). Ten years ago, you wouldn’t have found this kind of news items. Back then, you would invite your friends by phone, by sending them an invitation by (electronic) mail or by direct conversations. Either way, you did not have to worry about who might see the invitation. In this case, the ‘Sydney girl’ probably did not realize that publicly posting this invitation could reach so many people in such a short time. SNSs have become a popular and widely used medium on the internet for sharing interests, thoughts, photos, activities etc. But besides all the nice features they provide, it also introduces new privacy issues.

1http://www.facebook.com

1 Chapter 1. Introduction

SNSs heavily rely on their users to create the proper access control policies. But many users, just as the Sidney girl, are unaware of defining privacy settings. Other users might be aware but find them too much of a hassle to configure, or the given configuration possibilities too limited. And of course there is a group of users that simply does not care about their privacy settings.

1.1 SNS Related Incidents

In 2008, Weblog Ask The Judge [44] reported that several law schools in the U.S. are using SNSs as part of their admissions process. Fifteen percent of the surveyed admissions officers admitted to visit profiles of applicants on SNSs and many found negative content that reflected poorly on the student. The SNSs allowed the admissions officers to freely investigate the applicants’ private lives.

In early 2009, a teenager was fired from her job after negatively comment- ing on it on Facebook. The girl explained: “They (her superiors) were just being nosey, going through everything”, and further, “it makes them look stupid that they are going to be so petty”. Her boss however stated that she had posted her comments and invited staff members to read them [39].

In November 2009 a Canadian woman, who took long-term sick leave from her job, lost her health benefits after her insurance company visited her Face- book profile and had seen pictures of her during a male strip-tease show at a Chippendales bar, celebrating her birthday and pictures of her smiling in bikini at the beach. Based on these ‘happy’ pictures, the insurance company claimed the woman was no longer depressed and should be able to work again [8].

On 2 March 2011, a 26 year old mother changed her Facebook relation- ship status from ‘married’ to ‘single’. A few days later, she had been fatally stabbed by her (ex) partner who could not stand to lose her to another man [6].

This year, Ivan Kaspersky (son of the rich Eugene Kaspersky) was kidnapped in Russia when Ivan was walking from home to his office at InfoWatch. Quoted News [34] writes that the kidnappers may have gathered all the necessary information about Ivan from his profile on a Russian SNS which contained detailed information about his whereabouts.

All these incidents, some more extreme than others, are caused by people who are not aware or concerned about private information being exposed.

2 1.2. Project Outline

The Australian girl claimed she had initially been unaware of Facebook’s privacy settings making it possible to stop strangers from viewing her post. The other incidents too seem to be the result of either an unawareness or underestimation of the consequences of making private information public.

1.2 Project Outline

1.2.1 The Problem

The incidents in Section 1.1 are not isolated. There seems to be a general lack of action from users of SNSs to protect their content from undesired audiences. Acquisti and Gross [7] explain that there is a contradiction be- tween the privacy concerns (high) and the actual information hiding strate- gies (low) of members of SNSs. Members believe the protection of their privacy is highly important, but they still expose private information to the public. The reasons that the authors give for this contradiction between members’ attitude and actual behavior are: an unawareness of the true visibility of their content; a high level of trust in the SNSs to protect them; and a mis- understanding or ignorance of the treatment of personal data. Strater and Richter [40] add that the complexity and ambiguity of SNSs in- terfaces are also found to be a significant inhibitor of appropriate privacy utilizations.

1.2.2 The Scope

A solution to the problems could be sought at the side of the members by trying to better educate them about the audiences of their content or allow them to more elaborately define their audiences. But it would still require members to put an effort into understanding the information (which might be complex, especially for younger members) and actually putting it into practice (which might become increasingly demanding when configurations get more elaborate). Instead, this graduation project addresses the role of the SNSs. If members have a high level of trust in SNSs to protect their privacy, is it possible for the SNSs to actually meet these expectations? Can the SNSs do most of the work in setting the correct privacy policies so the members are relieved from their effort? This project explores the possibilities of SNSs automatically and intelligently proposing the appropriate privacy policies for the content that their members produce.

3 Chapter 1. Introduction

1.2.3 The Research Question

Is it possible for SNSs to automatically generate appropriate privacy policies for the content their members produce, and if so how can this be realized in practice? The research question is divided into three parts. The first is about the current state of SNSs. Who uses them and how? And what sort of privacy protection do they currently support? Second are the underlying mechanisms (access control mechanisms) that are available to protect data from undesired audiences. What type of poli- cies do they support? The third part is the actual proposal of an automated method to produce an appropriate privacy policy. What information of the user and his con- tent is meaningful to define such a policy? And how can one acquire this information?

1.3 Structure

As explained in Section 1.1, the inspiration for this graduation project comes from several news reports of privacy related incidents involving SNSs. This has lead to the research question regarding the automatic generation of appropriate privacy policies for SNS users. Chapters 2, 3, and 4 present an analysis of the existing situation and avail- able mechanisms regarding: social network sites, access control, and data structures. The results from the analysis phase lead to the introduction of the main model which focusses on content based access control. This is described in detail in Chapter 5. Chapter 6 puts the model into practice with an experiment in which the proposed model is tested using publicly available data from Hyves. Finally Chapter 7 concludes the study with future recommendations and conclusions.

4 1.3. Structure

1. Introduction

privacy related incidents

literature research

research question

2. Social Network Sites 3. Access Control 4. Structured Data

taxonomies & ontologies

types of access control available information existing data structures implementations privacy settings

5. Content Based Access Control

model

propose policy

6. Concept Testing

implementation

preparation

analysis

7. Conclusion

conclusion

recommendations

Figure 1.1: Project structure

5 Chapter 1. Introduction

6 2. Social Network Sites

A social network site (SNS) is an online community that allows users to pub- lish resources, share interests, and establish different types of relationships with other members for all kinds of purposes (business, entertainment, dat- ing, etc.) [14]. It allows members to represent themselves and define and explore their network of relationships [13]. So basically SNSs allow mem- bers to create their (digital) relationship network and communicate with other members. Most SNSs allow members to create their own profile with information about themselves, invite friends to their network, and share and communicate all types of content with one another.

2.1 Popularity

SNSs are hugely popular among the general public. Their use (among adults) in the U.S. has risen from 8% in 2005 to 46% in 2009 [32]. And according to Centraal Bureau voor Statistiek [15] at least 91% of the Dutch ’youngsters’ are active on SNSs such as Hyves1, Twitter2, and Facebook. The biggest social community in the Netherlands is Hyves with 11 million3 members, around 70% of the population. The largest social community in the world is Facebook with more than 500 million active members. It pro- duces more than 30 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each month [1].

2.2 Philosophy and Different Types of SNSs

Social network sites are built on the simple philosophy that people want to share and stay connected with their friends and the people around them. When you have control over what you share, you want to share more. When more is shared, the world becomes more open and connected [48].

1http://www.hyves.nl 2http://www.twitter.com 3Specified at the frontpage of Hyves (March 2011)

7 Chapter 2. Social Network Sites

Translate this philosophy to different fields and you get a large variety of SNSs, each with its own functionality. LinkedIn4 and Xing5 for instance are specifically meant for jobs or business- related communications. Flickr6 and Youtube7 allow members to share graphic contents such as photos and videos. Members of LastFM8 can share and discover each other’s taste in music. And the micro blogs of Twitter allow users to easily and quickly ex- press themselves using short messages. In this landscape of SNSs, Facebook is the world’s biggest (500 million mem- bers) and Hyves is the biggest of the Netherlands (11 million members). These two are frequently used for reference in this chapter since they are the most popular and the features that they offer are interesting for the scope of this project, namely: an extensive and detailed user profile; invit- ing friend and creating a network; frequent sharing of interests and content; communication with other members.

2.3 Information on SNSs

In order to let their members use all the features that they offer, the SNSs need to be able to manage large amounts of data. This data can be categorized in different ways (for example [30, 37]) de- pending on their purpose. Both are useful in structuring the different kinds of data of SNSs but neither fits well within the scope of this project. There- fore a more appropriate categorization is proposed in Figure 2.1. For this project the distinction is made between back-end and front-end data In this model there is a clear distinction between front-end and back-end data. Front-end data is data shown on user profiles and data that is actively submitted. Back-end data consists of data (mostly) not visible on the profile and gathered by other means than user input (e.g. statistics of page views or log-ins). The front-end data is categorized in four parts: profile, relations, user con- tent, communication.

Profile

The profiles that members can make on SNSs can contain lots of informa- tion. In addition to ’regular’ personal information such as name, date of

4https://www.linkedin.com/ 5https://www.xing.com/ 6http://www.flickr.com/ 7https://www.youtube.com 8http://www.last.fm

8 2.3. Information on SNSs

Social Network Site data

Front-end data Back-end data

Profile Relations User content Communication

Figure 2.1: Tree of the data categories birth, and gender, it allows members to also define their work experiences, education information, relationship status, sports, interests and contact in- formation. This information represents the user and can be interesting for other users. And on the other end, it is interesting for SNSs because they can categorize this data and makes it usable for instance for targeted adver- tisement9.

Relations

Defining relations is one of the most important aspects from Hyves and Face- book. People take the time to create a profile and share their interests, pic- tures and thoughts because they want other people to see it. A relation between two members in Hyves or Facebook is realized when, say Alice, re- quests a relation with say Bob, and Bob in return accepts it. In Hyves, Alice and Bob (after creating a relation) can describe their relationship by defin- ing how they know each other. This information however, is not directly vis- ible and is also not used for ordering friends into meaningful groups (which could be useful for other means such as audience segregation, see Section 2.4). Facebook on the other hand, gives both Alice and Bob the possibility to add the person into a group. Section 2.4 further examines this feature.

User content

User content stands for all the content that users post when they want to share or communicate with others. Examples are, uploading and sharing photographs of a holiday, writing a blog post, writing what you are doing and where you are at that moment. User content is created and controlled by the content owner. The representation of user content can differ between different SNSs. Face- book for instance has a ’wall’ on which all the user’s activities are posted

9http://www.facebook.com/advertising/ last visited September 1, 2011.

9 Chapter 2. Social Network Sites

(uploading photos, befriending people, etc.). Hyves separates a lot of the user content into segments such as scraps (or ’krabbels’), videos, and blogs. In both cases the profile owner can control the content. But since Facebook’s ’wall’ is a constant feed on which older content is pushed down by newer content, the older content disappears. On Hyves this also happens but in a more segmented way. For instance their latest blog post might be six months old but still visible while their status update of last week is long gone.

Communication

As a member of a social network site, you can communicate with other members (contacts or strangers) by writing something on their ‘wall’, by writing private messages, by live chat, or by public messages. People react to content of others by writing, posting or by clicking the fa- mous ’like’ button. These different kinds of communication have different characteristics. Some are direct (live chat), some are one-way (’like’ but- ton), some are fast and short (scraps on walls), and others can be long and intimate (private conversations). These different communications produce content of which the privacy sensitivity varies.

Back-end data

All data not actively committed or created by a SNS member is considered to be back-end data and is not always visible at the SNS. This includes sta- tistical data like the numbers of pageviews (per time period, or per user), or the number of times a member logs in or how (by smartphone or desktop PC). Both Facebook and Hyves collect this kind of data [2, 3, Section 2] but they do not reveal what data exactly is stored and how they process it.

2.4 Privacy Settings

To allow users to protect their data (and privacy) from undesired audiences SNSs provide certain privacy settings for the content that users produce. These privacy settings differ between the different SNSs but there are also similarities. Most offer a limited model that allows a user to protect his data by using the ’friends’ model [19, 20]. This basically means that the SNS user can select content to be private, pub- lic, or only accessible for ‘friends’ or ‘friends of friends’. Both Hyves and Facebook use the friends model for regulating privacy and access. Hyves

10 2.4. Privacy Settings members have the extra option to make content available for all Hyves mem- bers. In this model, content is available for all Hyves members but remains prohibited for external parties. This means for instance that search engines like Google and Yahoo can not view and cache a user’s profile data. Figure 2.2 shows how this model is implemented in Hyves.

Figure 2.2: Friends model in Hyves, specifying all Hyves members are allowed to see the users birthday.

Grouping

Facebook uses an expanded version of the friends model based on the idea that not all contacts should have the same access rights. Therefore, Face- book gives users the option to separate their contacts into meaningful groups which then can be used in defining access policies. Besides the usual options (‘public’, ‘private’, ‘friends’, and ‘friends of friends’), Facebook members can explicitly grant or deny access to certain people resulting in finer grained access control. This is shown in Figure 2.3

Shortcomings

At first sight, there seems to be nothing wrong with the friends model, since it is a simple and understandable method to grant access upon. The prob- lem is the notion of ‘friends’ which contains all contacts ranging from best friends to colleagues and from family members to classmates. Information suitable for best friends might be unsuitable for family and colleagues and vice versa [24].

In the ’real’ world people adapt their messages to different audiences (au- dience segregation). When communicating, people are normally aware of

11 Chapter 2. Social Network Sites

Figure 2.3: Facebooks custom privacy popup where you can explicitly grant or deny access to groups and individuals. their audience which enables them to make intuitive decisions about what information to share and what to keep a secret.

Leenes [24] argues that this ability to segregate one’s audience should be translated to SNSs as well, especially since on SNSs most people do not have a clear overview of their audience. In order to achieve this more attention should be given to the social dynamics of SNSs and contexts of users and their posts [43].

Facebook provides already some functionality to segregate one’s audience with the grouping feature. It allows users to separate their contacts into groups and to use these groups to grant or deny access to their content. Holweg [21, page 43-45] however has shown that even though 57% of the users know about the grouping feature, only 10% of all members uses the feature. The rest states it takes too much time, or that they do not understand the feature.

An important weakness of Facebook’s grouping feature becomes clear when translating it to a real world situation. In the real world audiences differ from moment to moment, from context to context, from message to mes- sage. Let’s say Bob has a group ’classmates’ with which he communicates about school related stuff: homework, gossip, etc. In this case he wants to invite classmates to play soccer after school. But he wants a serious match and immediately after school. Therefore he does not want to invite the girls, or boys that are not interested in soccer, or boys that live too far from school. In this example a group is divided into a very detailed subgroup that is only useful for this specific message (involving playing soccer and close to school). Translating this back to using the grouping feature of Facebook would result in Bob having create an entire group just for one message.

12 2.5. Conclusion

2.5 Conclusion

This chapter has laid out what SNSs are and how people use them. It also has addressed the different types of data that these sites manage and store. The privacy settings that the SNSs provide are much the same across the sites. The widely offered ’friends’ model makes it easy for a user to quickly select an audience for their content but it does not allow selection of detailed characteristics of the audience. Experts state a need for the possibility to segregate audiences in SNSs as is done in real life where people adapt their message according to their au- dience. Facebook provides some audience segregation with their grouping feature with which members can create groups. But this feature shows some weaknesses. It fails to take into account that every message can have a very specific, one-time, audience (as is shown in the example in 2.4). It would require a user to create a new group for every such message. And addition- ally, many users do not use the grouping feature because it takes too much time or because users do not understand it.

13 Chapter 2. Social Network Sites

14 3. Access Control

In the previous chapter, we have described what SNSs are, what privacy set- tings they support and their insufficiencies. This chapter introduces differ- ent existing access control mechanisms to find out if there are better access control mechanisms available for social network sites.

Access control in computer science is a broad term about determining and regulating the allowed activities of legitimate users to resources and under what conditions [22]. Its main objective is to protect resources against un- desired access [23, 35] but this can also be described in terms of the optimal and controlled sharing of information [22]. Within access control, a number of closely related aspects are important of which all depend on each other; authentication, authorization, administra- tion, and auditing [35].

Authentication Authentication is the process where the claimed user’s iden- tity is verified. Usually, a user is authenticated based on ‘something the user knows’ such as a username or email address combined with a password or PIN code. Authentication methods can also be based on ‘something the user has’ (for instance a smartcard or phone num- ber) or ‘something the user is’ where a user is authenticated based on unique physical attributes such as a fingerprint [11]. Most SNSs such as Facebook and Hyves, authenticate users based on the username and password.

Authorization The process of finding out if the identified user is allowed to access a specific resource and under what circumstances is called the authorization process. In SNSs, this is usually determined by finding out if the user is a friend of the resource owner, or if the user is a member of a particular group within the SNS (Chapter 2). Sometimes [42], a difference is made between authorization fully based on user attributes and authoriza- tion based on other criteria which may or may not have anything to do with the attributes of the particular user. A SNS might for instance have a policy rejecting old and insecure webbrowsers or disallowing authenticated users which are younger than 18 to visit ‘adult’ places

15 Chapter 3. Access Control

of the SNS. In this thesis, such a distinction is not made. Note that when authorization is based on a user’s identity, the effec- tiveness of the authorization strongly relies on a proper user authenti- cation [35].

Administration & Access Policies In order to determine whether or not an (authenticated) user is allowed to perform an operation, the access policies defined within the system are consulted. An access policy is a set of rules describing what actions are allowed and by whom on specific objects. The access policies are created and maintained by the system administrators. In some cases users are also permitted to set certain access policies. For instance users of SNSs can create their own account along with access policies for their content (the ’friends’ model from Section 2.4). In SNSs, people can create an account on their own and they can also create access policies for content they create using the friends model.

Auditing According to Sandhu and Samarati [35], access control is not a complete security service without auditing. Auditing requires the sys- tem to log all user activities so that these can be used for later analysis. Audit controls are useful to detect suspicious user behavior (i.e. break- in attempts as well as authorized users misusing their privileges) and possible security flaws in the system. Also, when users are aware of their actions being recorded, they most likely are discouraged to per- form ‘illegal’ activities. In the case of SNSs, auditing does not influence the direct behavior of the system so users will not notice when it is used as a security mechanism or not.

The problem of the Australian girl (from the example of Chapter 1) was that the birthday invitation was publicly readable although she did not want ev- erybody who could access the invitation to come. In this case, there was nothing wrong with the authentication and authorization functionality of Facebook. This incident happened because the girl did not create an access policy that would only allow her friends to see the message. This clearly is an administrative issue with the result that authenticated users where wrongly authorized to read the message. Since the other incidents are also results of this type of issue, the focus lies on access policies and user autho- rization. In the context of access control, widely used terms are subjects and objects [26]. A subject is an active entity (initiator) that issues actions upon other entities (target). These active entities are not only persons but can also be applications. In the context of social network sites, we define a subject as a

16 3.1. Access Control Models

SNS user. Subjects perform actions on other entities. Such other entities can be other subjects or passive resources such as files or printers. These passive resources are called objects. In the context of SNSs, we define an object as content placed on the SNS.

3.1 Access Control Models

Park and Sandhu [31] and Sandhu and Samarati [35] both describe three main different access control models; mandatory access control (MAC), dis- cretionary access control (DAC), and role-based access control (RBAC). Hu et al. [22] makes a distinction between Discretionary- and Non discretionary access control (NDAC) which MAC and RBAC are part of. Both Anderson [10][Chap. 4 and 8] and Yuan and Tong [47] make a clear distinction be- tween MAC and DAC. However, they differ as Yuan and Tong describes RBAC to be primarily a DAC mechanism, as Anderson states RBAC to be a frame- work for mandatory access control. This shows that there is no obvious categorization between the different models. All three models have specific properties and handle different kinds of access policies. In general, it is not the case that one is ‘better’ than the other. It highly depends on the security requirements within the environment. For instance, using a model providing very strict access control policies may be inappropriate in situations where greater flexibility is required [35]. But in less specific cases, Lindqvist [26] and Sandhu and Samarati [35] mention that the different models can be used in the same environment.

3.1.1 Discretionary Access Control

Jordan [23] defines discretionary access control as: “A means of restricting access to objects based on the identity of subjects and/or groups to which they belong. The controls are discretionary in the sense that a subject with a certain access permission is capable of passing that permission (perhaps indirectly) on to any other subject.” As content owners control other users access permissions to their content, no system administrators are required to create and maintain access policies. Within DAC, individual subjects have complete control over objects they own. They can specify for each user and each object which modes (e.g. read, write, execute) are allowed making DAC very flexible and fine grained. This, however is also a weakness since access is decided by content owners rather than through a system-wide policy making the policies within the system inconsistent [22].

17 Chapter 3. Access Control

Another drawback is that DAC does not provide real assurance on confi- dentiality as information can be easily leaked. For example, when Alice is allowed to read object O and Bob is not, Alice can copy the content of O and pass it to Bob without the approval or knowledge of O’s content owner[35]. Bob may now do the same, granting others permission without Alice’s knowledge.

DAC is a common form of access control used in Windows and Linux where content owners can grant or deny access to other users or groups of users. Within SNSs, users are also in control over their own content which is a desired property. However, they cannot specify for each individual user if he or she is allowed to access some content.

3.1.2 Mandatory Access Control

Within Mandatory Access Control (MAC), security levels are assigned to all objects and subjects corresponding with the ‘value’ it represent for the orga- nization. A users security level (also known as clearance level) corresponds with the trust the user gained within the organization and the security level of an object (also sensitivity level) specifies the level of trust required to ac- cess the object. The security level for both objects and subjects is an object of a hierarchical group. In for instance the military, such a group generally consists of ‘top secret’ (TS), ‘secret’ (S), ‘confidential’ (C), and ‘unclassified’ (U) where

TS > S > C > U

In this hierarchy, each level is dominating itself and all elements below. In addition to the hierarchical security levels, categories (i.e. finance, secu- rity) can also be associated with the objects and subjects. In this case, the users clearance level and the subjects sensitivity level consists of a pair com- posed of a security level with a set of categories. This concept results in a finer grained security classification. [35].

Confidentiality Based & Integrity Based

Read or write access depends on both subject and objects security levels and whether or not the model is focussed on confidentiality or integrity preservation. To protect confidentiality, you need to prevent unauthorized disclosure of high-level data to low-level data. This is expressed in the following two principles [35]:

18 3.1. Access Control Models

Read down A subject’s clearance must dominate the security level of the object being read.

Write up A subject’s clearance must be dominated by the security level of the object being written.

In this case, when Alice has clearance level ‘secret’ she is able to read ‘down’ meaning Alice can read unclassified, confidential and secret data but is not allowed to read data marked as ‘top secret’. The ‘Write up’ rule only allows Alice to write data as ‘secret’ or ‘top secret’. Note that the information can onlyBLP flow upwards.information This example is illustratedflow in Figure 3.1. The integrity based model is exactly the opposite.

SUBJECTS OBJECTS

write …….....

TS TS

read write …….....

S S

read write …….....

C C

Information flow Information

read write ……..... U read U

Figure 3.1: Controlling information flow for confidentiality.

23/09/2008 DTM course - Daniel Trivellato 30 Note that using the mandatory access control model, individual subjects do not own objects and cannot specify access policies for specific objects. The policies are created and maintained by the system policy administrator.

3.1.3 Role-Based Access Control

Traditionally, access rights are directly assigned to users or groups in the system. Role based access control (RBAC)[35, 36] is a model which takes a more real world approach by not assigning permissions directly to users but to roles. When users are assigned to appropriate roles it greatly sim- plifies the management of permissions. For example, if both Alice and Bob are software developers in a company, they both can be assigned to the role ‘software developer’ which corresponds with specific permissions. If the ac- cess rights for all software developers should change, only the permissions assigned to the role should be adjusted and not per user. Also, when Alice is

19 Chapter 3. Access Control promoted to ‘head software development’, the only thing to do is to change her role in the system. Although role based access control simplifies the management of assigning access policies, the role- and policy assignments still need to be done oth- erwise RBAC does not work. At the moment, SNSs do not support the use of roles and assigning specific policies to roles. Also, since users are not grouping their contacts [21], they are probably not interested in fulfilling the administrative tasks RBAC requires either. Li et al. [25] however proposes ‘Role Based Access Control for Social Net- work Sites’ which is based on the idea that the ‘roles’ are closely related to the relationships or connections in SNSs. In order to work, these relation- ships should be specified for each contact which is very cumbersome and again therefore probably not desired by SNS users [21].

3.2 Access Control Implementations

This section will discuss a number of access control implementations. Each implementation is addressed on its usability within SNSs according to their specific properties.

3.2.1 Attribute Based Access Control

In attribute based access control (ABAC), access is granted based on at- tributes. Yuan and Tong [47] defines three different types of attributes:

1. Subject attributes. A subject’s attribute is a characteristic of a subject (in SNSs a user) such as the name, date-of-birth, job title, number of friends, national- ity, e-mail address, or roles associated with the subject.

2. Object attributes. An object’s attribute is a characteristic of an object (in SNSs the con- tent: post, message. etc) such as creation date, title, page views, and amount of comments.

3. Environment attributes. Environment attributes are all attributes not directly related with sub- jects or objects. An example is the current date and time.

Because SNSs allow users to create a profile containing structured infor- mation like name, education etc. etc. (see section 2.3) some attributes are already present in a SNS. But whenever Alice wants to post a message on

20 3.2. Access Control Implementations a SNS, she still has to define what attributes are needed to allow access to this post. As we already have seen [21], users do not tend to care, or find grouping contacts too time consuming which can be seen as a rough form of defining attributes. This indicates that getting people to actively define attributes for each of their messages does not with how people behave in practice. However, using structured data within the users profile as attributes might be useful.

3.2.2 Content Based Access Control

Previous access control mechanisms require the user or system to set access policies for all resources to make sure access to resources is regulated ac- cording to the policies. Content based access control (CBAC) [19, 20, 28] lets users select or specify policies and when a post or message is added, the system will automatically apply the right policy based on its content. To do this, CBAC requires the system to ‘understand’ what the content is about. When the system interprets the contents differently, a wrong policy could be selected for this content which is a potential security/privacy risk. There- fore, Hart et al. [19] rightly states that CBAC systems should never be used for high-security matters.

3.2.3 XACML

XACML stands for eXtensible Access Control Markup Language and is an XML based standard describing an access control policy language, an access request language, and a response language. The request language is used to form a query to ask whether a given action is allowed. Using the right access policy, a decision is made and a response is formulated which includes the answer whether the request should be allowed. In Figure 3.2 (taken from [29, Page 18]), this process is shown. The two most important functionalities (in this context) offered by XACML are [29]:

Policy combination XACML provides the functionality to combine policies which are specified independent. In order to come up with an autho- rization decision, both policies are combined to form a single policy applicable to the request. This might be useful when for instance an underaged SNS user speci- fies a policy (i.e. by using the friends model) and the SNS also specifies a specific policy for content posted by underaged users.

21 Chapter 3. Access Control

xacml Policy.xml

domain-specific xacml Request xacml Response domain-specific PDP inputs Request.xml Response.xml outputs

Figure 3.2: XACML access request flow

In order to combine these policies, multiple combining algorithms are supported enabling the possibility to define fine-grained policies.

Policies based on attributes XACML is able to allow or deny access to a resource based attributes associated with the subject (e.g. name, age, location, gender) and resources (e.g. type of data, creation date). Using this functionality, XACML can be used to implement RBAC [29]. XACML has included a number of built-in functions for comparing at- tribute values and also provides a method for adding self created func- tions. Using this functionality, all kind of attributes can be compared.

For an in-depth explanation of all possibilities of XACML and the underlying architecture see [29].

3.3 Conclusion

We believe that traditional access control mechanisms are too cumbersome and inflexible to use in SNSs. SNS members do not use the current access control mechanisms [21] or are not capable of doing so. Therefore, we be- lieve that access control in SNSs should be easy and intuitive. Content based access control might be a good extension of access control in SNSs since it automatically tries to understand the content of data. Using this context, the system can propose some predefined policies based on preferences and/or attributes.

22 4. Structured Data

This chapter explores more in-depth the possibility of content based access control. To perform access control based on content, a system needs to be able to ’understand’ what the content is about. It needs to be able to identify the topics of the content. This can be achieved by transforming the raw, unstruc- tured content into meaningful topics by using a word matching algorithm. The topics can be defined by matching words from the content with words from an existing dataset. For instance when a user posts a message about his new Mercedes, some- where in his text he might use the word ’car’. Let us say that this word also exists in the dataset so the content can be ’tagged’ with the term ’car’. But the algorithm finds a new match for ’Mercedes’. Now we have another, more detailed tag ’Mercedes’. This might even go more detailed for ’type of Mercedes’. But what if the user uses an abbreviation ’Merc’? To incor- porate all these match-words in the dataset in a structured way a certain categorization and hierarchy must exist in the dataset (e.g. vehicle ⇒ car ⇒ Mercedes). This chapter shows how such a dataset can be structured into a certain hierarchy using some examples.

4.1 Taxonomies & Ontologies

There are a number of ways to structure datasets. Two well known ap- proaches: taxonomies and ontologies are defined in this section. Definitions from literature vary [12, 18, 46] but here they are defined as follows:

Taxonomy

A taxonomy is a hierarchical structure containing a large number of tags. This hierarchy is organized where tags at a higher level have a broader meaning than those at a lower level. This inheritance concept is also known

23 Chapter 4. Structured Data as supertype-subtype or parent-child relations. Therefore, in these tax- onomies, the subtype has at least the same properties and behaviors as the supertype. Because the subtype has to be distinct from the supertype, it also has at least one additional property. For example: a bird is a subtype of ani- mal where it has all the properties of an animal with one or more additions. One of it that this animal can fly.

Ontology

An ontology defines a relation between an instance and a class. A parrot is an instance of the class bird, making a parrot also an animal (as can be concluded from the taxonomy example). Of course, this is a one-way relation as not all birds are parrots. Other relations are also possible, for instance an ‘is-born-in’ (Mark Rutte is-born-in Den Haag (The Hague)). Note that using this definition, ontologies are not a hierarchical structure but rather a relational structure. For instance, Mark Rutte and Den Haag can also be used using an other relation type: Mark Rutte has-his-job-in Den Haag.

Because they also take into account relations between tags, ontologies can contain much more information than only taxonomies. But they do require more resources. What is important to state here is that the focus of this thesis is not cre- ating taxonomies or ontologies. It is to explore access control policies and the terms help explain the driving characteristics of content based access control.

4.2 Existing Datastructures

This section presents some examples of existing datastructures. An anal- ysis of them reveals possibilities to be used for the purpose of automated proposition of privacy policies.

WordNet

WordNet is a lexical database organized in synsets consisting of synonyms between English nouns (’car’ and ’automobile’), verbs (’buy’ and ’purchase’), adjectives and adverbs. WordNet is also a taxonomy describing hierarchical relations among synsets. The most used relation is the is-a relation (also called super-sub relation or a

24 4.2. Existing Datastructures hypernyms ) which links more generic terms such as vehicle to more specific ones such as car. Hypernyms are transitive; a car is-a vehicle and a sports car is-a car, implying a sports car to be a kind of vehicle. Currently, the WordNet database contains 155, 287 words organized in 117, 659 synsets for a total of 206, 941 word-sense pairs [5]. Limitations of WordNet are that it does not contain many named entities such as names of persons, books and places [38]. An other limitation is that WordNet only consists of English terms such that datasets of WordNet cannot be used in combination with non-English data.

YAGO

Yet Another Great Ontology (YAGO) is a huge semantic knowledge base, initially derived from Wikipedia and WordNet [41]. Most of the data is ob- tained from Wikipedia using infoboxes1 defined in articles and Wikipedia’s category system. The ontology is then created by combining the data ob- tained from Wikipedia with the structured taxonomic relations from Word- Net. According to Suchanek et al., it contained one million entities (such as per- sons, organizations, cities) and 5 million facts about these entities in 2007 and according to the website2, YAGO currently has knowledge of more than 10 million entities containing more than 80 million facts. 5864 Facts of the created ontology were evaluated by human judges, confirming an accuracy of 95% 3. As YAGO is created using WordNet, it is also limited to the English language.

Wikipedia

Wikipedia is a free and open encyclopedia project supported by the Wiki- media Foundation. Currently, volunteers have written over 19 million ar- ticles spread over 282 languages making Wikipedia the largest and most popular general reference work on the internet [4]. As Wikipedia is main- tained by a lot of volunteers, it contains up to date information related to all kinds of topics [38]. An interesting aspect of Wikipedia is its categorization and linkage between the articles. Articles can be assigned to one or more categories of related topics [16]. However, the structure and description of these relations are

1Structured data inside an article. Infoboxes are explained in the Wikipedia paragraph of Section 4.2. 2http://www.mpi-inf.mpg.de/yago-naga/yago/ Last visited at July 22 2011 3http://www.mpi-inf.mpg.de/yago-naga/yago/evaluation.html Last visited at July 22, 2011

25 Chapter 4. Structured Data not as rich and consistent compared to WordNet or YAGO and do not form a taxonomy [33, 38]. An other useful functionality of Wikipedia it that its articles can link directly to other articles. As most of them are created for easy browsing, they also represent some kind of relationship between the pages. The current Dutch Wikipedia counts 39, 741, 124 links from one article to an other [45]. Wikipedia articles can also contain structured data such as infoboxes. An infobox is a template about a topic which can be used in Wikipedia articles to present an overview of relevant information concerning the article. For in- stance, there exists an infobox with the topic ‘Infobox university’. This tem- plate consists of some properties related to universities such as the name, the name of the rector, the number of students, etc. When an article about a university is written (i.e. ‘Eindhoven University of Technology’), this info- box can be used such that the properties related to the university infobox can be filled in. Figure A.1 shows the infobox of the Dutch Wikipedia article of Eindhoven. At the left, the source is given showing the structured infor- mation and the right shows the visual representation in the webbrowser of the same infobox. Liu et al. [27] found that 44.2% of the articles4 have infoboxes while 80.6% of the articles are related to at least one category. The Dutch Wikipedia of April 18th 2011 consists of 1, 647, 627 articles categorized in 90, 625 different categories [45].

4.3 Conclusion

This chapter is meant as background information about data structures for content based access control. It introduces some existing datasets available on the Internet each with their own properties. WordNet is a database consisting of synonyms between English words and a taxonomy describing hierarchical relations among the synonyms. The main disadvantages are that it contains only English words and that it does not contain many named entities which is something people tend to write about in SNSs. YAGO is an ontology derived from Wikipedia and WordNet and contains more then 10 million entities. But, as YAGO is derived from WordNet, it also suffers from the same disadvantage as WordNet that it only contains English data. Wikipedia is an open online encyclopedia and its main focus is to hold sum- maries of all kind of information from all kind of branches. All this informa-

4Liu et al. used the English Wikipedia dump of January 3rd 2008.

26 4.3. Conclusion tion is stored in articles which are linked to an internal category structure. A disadvantage is that this category structure is not as strict as the ones in WordNet or YAGO and does not form a taxonomy and not even a tree as some categories have multiple parent-categories [38]. Advantage however are the multilingual aspect of Wikipedia as articles are written in all kind of languages and the wide variety of topics it contains.

27 Chapter 4. Structured Data

28 5. Content Based Access Control

In Chapter 2 we have introduced social network sites and we have seen what kind of data it stores, the functionality they provide and we have introduced the friends model which is the current access control mechanism most SNSs use. We concluded that the friends model is too limited in its functionality as users can only make data public, private or available for friends or ‘friends of friend’. Facebook extended this friends model by allowing users to separate their friends into meaningful groups, which can be used for access control. However, a study of Holweg shows that people are not interested in creating such groups or are just not aware of this possibility. Chapter 3 describes different access control models and their properties. We have seen that models exist where access is granted (or denied) based on attributes. As SNS allow users to create an extended profile consisting of structured data (attributes). We believe this can be useful for an intelligent mechanism called ‘content based access control’ which described in the next paragraphs.

29 Chapter 5. Content Based Access Control

5.1 Current Model

The process of posting a message on the SNS is simple. First the user types a message and without any interaction with the SNS, he (or she) is allowed to specify who should be able to view the message. In case of Hyves and Facebook, the user can initially choose public, private, Friends and ‘Friends of friends’ and in case of Facebook the user can perform some additional actions to select a predefined group.

the system

the user types message selects audience post message

Figure 5.1: The process of posting a message using the friends model to define an access policy.

After submitting, the message and the selected privacy policy are stored in the social network site. This model is visualized in Figure 5.1.

5.2 Content Based Model

As already explained, we believe that the friends model is too limited in its policies. The extended friends model is suitable but requires too much administrative actions such as grouping people and defining policies. In the content based model, policies are automatically proposed. These policies are based on message content and user attributes. This way, the number of administrative actions for the user is minimized. The only thing

30 5.3. Policy Proposing Process he/she has to do is to accept the policy or overrule it. In Figure 5.2 this model is visualized.

the system the system

processing

accept

the user types message propose policy post message

not accept

selects audience post message

Figure 5.2: The process of posting a message using the content based model in order to propose a access policy.

For example, when a message contains words like ‘soccer’, ‘goal’, ‘corner’ then the attribute ‘soccer’ is attached and a policy ‘friends with matching attributes in their profile’ is proposed. Another example: when a 13 year old child sends a message containing a timestamp, location, or specific words like ‘invitation’ or ‘party’ then the attribute ‘private’ is attached and a policy ‘teenage friends living in the same area’ is proposed. These policies require user attributes like ‘date of birth’, ‘residence’, ‘hobbies’, ‘work/school’ etc. Most profiles cover these fields.

5.3 Policy Proposing Process

To automatically propose a policy we first need to attach a set of content related attributes (tags) to a message. Then we use these attributes to find the best matching policy. This process is shown in Figure 5.3.

31 Chapter 5. Content Based Access Control

processing

default tags policies

the message extract content matching matching propose policy set of attributes

the user extract user profile

Figure 5.3: Two important aspects of the content based access control model are creating attributes from the content and using these attributes to resolve an access policy.

5.3.1 Determine Attributes

Selecting attributes consist of two parts. The first part is to compare the message content to user profile attributes (work, school, hobby, etc.) and determine if the message is related to one or more of them. The second part is to compare the message content to a ‘tag’ database. This database contains predefined attributes each with a related set of words or phrases. An example is the attributes ‘soccer’ with the set {‘goal’, ‘ball’, ..}. The attributes of both steps are attached to the message.

5.3.2 Select policy from attributes

We define a limited set of policies in a database. Each policy contains a set of attributes. The attributes of a message are compared to the attributes of the policies and the best match is selected as a proposal. A policy contains

Priority When multiple matching policies are found, then the one with the highest priority is selected

32 5.3. Policy Proposing Process

Name The name is used as an ID, names must be unique.

Explanation Used to explain to the user what kind of message the system has concluded.

Attributes A list of policy attributes which are relevant for this policy. A matching score with the message attributes is calculated by the system

Rules Defines how the elements in the destination user profile are matched with the message attributes. For example, allow only users with the same interest.

33 Chapter 5. Content Based Access Control

34 6. Concept Testing

In previous chapters we introduced the concept of Content Based Access Control. In this chapter, we want to investigate the feasibility of this concept. In order to do this, we need the following:

• Process implementation for experiment

• Data

– Message content – User profile content – Search terms related to attributes and tags

We decided to use Hyves as a datasource for the user profile and message content as we assumed it mainly consists of Dutch content. Facebook (more internationally oriented) probably contains content in all kind of languages. To automatically fetch search terms, we used the Dutch Wikipedia as it is the only datasource mentioned in Chapter 4 with Dutch content.

6.1 Implementation for the Experiment

Chapter 5 (and in particular Section 5.3) explains the idea of Content Based Access Control in social network sites resulting in a policy proposal. In order to submit a proposal, first the attributes related to a message have to be determined. In a second step, these attributes are used to find the best matching policy. This section explains how these two steps are implemented for the experi- ment.

6.1.1 Attributes Determination

For the experiment, a limited set of attributes is searched for in the message. This section explains for each attribute what the conditions are for a match.

35 Chapter 6. Concept Testing

processing

default tags policies

the message extract content matching matching propose policy set of attributes

the user extract user profile

Figure 6.1: First step is to determine the attributes of message content

Date A ‘date’ attribute is attached to the content when any date is found. To find a date, regular expressions are used together with some static words such as ‘vandaag’, ‘morgen’ and ‘overmorgen’ (today, tomorrow, the day after tomorrow). The patterns the regular expression matches as a date can be found in Appendix B.

Time To attach the ‘time’ attribute to the content, regular expressions are used. The patterns can be found in Appendix B.

Invitation To determine if the attribute ‘invitation’ should be attached to the content, the tag database is used. In this database, all search terms related to ‘invitation’ are fetched and the content is searched for these terms.

Residence The content can also contain the hometown of the user. This is determined using two steps. First the search terms for home (‘home’, ‘thuis’) are retrieved from the tag database and the content is searched for these words (the method is similar to invitation). The second step retrieves the value of field ‘hometown’ in the user profile and search the content for this value. If a match is found, then attribute ‘residence[]’ is attached.

School The method to attach this attribute is similar to attribute ‘residence’. Content is searched for words related to school and on the values in the field ‘school’ of the user profile.

Work The method is similar to attribute ‘residence’ and ’school’.

36 6.1. Implementation for the Experiment

Child The child attribute is attached to the content when the field ‘date of birth’ in the user profile reveal the users ages < 18.

Adult This method is similar to the attribute ‘child’ except the users age ≥ 18.

Interest Suppose the user has an interest in ‘voetbal’, then the tag database is used to select all search terms related to ‘voetbal’. If one of the search terms is found in the message content, then attribute ‘inter- est[voetbal]’ is attached. For this experiment, voetbal will be the only tag we are looking at.

For searching and matching, the open source application Solr1, which is described in Section 6.1.5, will be used.

1http://lucene.apache.org/solr/, data last accessed September 6, 2011.

37 Chapter 6. Concept Testing

6.1.2 Select Policy from Attributes

processing

default tags policies

the message extract content matching matching propose policy set of attributes

the user extract user profile

Figure 6.2: Second step is to use the attributes to select the best policy

When the attributes of the message content are determined, then the best matching policy must be selected from a list of policies, which will be de- fined manually. As explained in Section 5.3.2, each policy has a priority and contains a list of mandatory attributes. The selection algorithm first preselects all policies of which the set of at- tributes is a subset of the content attributes. From this preselection, the policy with the highest priority is chosen and proposed to the user.

6.1.3 Internal Datasets

Tag Database

This database needs attributes and a set of words, related to this attribute. For instance, the attribute ‘voetbal’ (translated soccer) is related to an hier- archy of tags all related to soccer. Wikipedia contains a category structure which is used to fill this database. The top level categories are the attributes. The attribute related words are extracted from the lower levels.

Policies

We manually will fill this database with a few basic policies. We will tune the policy database after evaluation of the attributes we found in the messages.

38 6.1. Implementation for the Experiment

6.1.4 External Datasets

The message and the user profiles are fetched from Hyves and stored in a MySQL database. The message content (consisting of blogs and status messages) are also stored as an XML file such that Sorl can index these message content.

6.1.5 Solr and Content Querying

Scanning the message content for search terms, belonging to an attribute or tag, is not a simple text comparison. Upper/lower case characters can be mixed, dashes and quotes can be present and words can be misspelled (especially by young users). Therefore, to find the search terms of an attribute in the message content, a tool called Solr is used. Solr is a fast, open source, standalone search plat- form from the Apache Lucene project and its major feature is the powerful full-text search. First Solr indexes the message content after applying various text transfor- mations such as removing special characters, removing a given set of Dutch common words (de, het, een), lowercasing the text and breaking the text into single terms by the whitespace characters. Then Solr determines a score, which indicate how ’well’ the search words match with the message content.

Query Match Explanation Noord-Brabant˜ Noord-Brabant Identical NOORD-brabant Case insensitive Noord --brabant Spaces and special characters Noord Brabaant Misspelled

Table 6.1: Word examples matching in Solr.

Table 6.1 shows an example of matches after searching for ‘Noord-Brabant˜’. Note the tilde (˜) at the end of the query meaning that a marginal error is allowed. Because Solr specifies the match result in a score, it is possible to allow only slight mismatches or also larger mismatches.

39 Chapter 6. Concept Testing

6.2 Preparation Datasets

As described in previous section, we need four types of data in order to run the experiment. This section explains how the data is obtained. The tags dataset is obtained using Wikipedia. This process is explained in Section 6.2.1. Section 6.2.2 describes how Hyves is used to obtain messages and user pro- files. Finally, a set of default policies is created which is used to match a set of attributes to a policy which then can be proposed.

6.2.1 Tag Data from Wikipedia

The Wikipedia data is available as a database dump and xml files. We used the Wikipedia dump of 18th April 2011 and downloaded the following sql and xml files. Both sql files contain the structure and content of one table and the xml file contains structured content which can be linked to the database entries. page-articles.xml This is a large file (2.6 GB) with the content of all 1, 647, 627 pages of the Wikipedia dump. For convenience reasons, the file is split into multiple xml files each containing one wiki page. page.sql This file contains the metadata of all pages such as the page id, title, namespace, creation date etc. Using the namespace2 we are able to recognize pagetypes such as user pages, help pages, categories and articles. categorylinks.sql In this table, each of the 1, 909, 183 entries defines a link between a parent and a child page.

Tags

To obtain the soccer search words, we need both tables stored in the MySQL database. The algorithm to create the tags consists of the following steps:

1. Choose the tag name which is in our case ‘Voetbal’.

2. Find the entry in the page table with category name ‘Voebal’, this name is used at the tag.

2http://en.wikipedia.org/wiki/Wikipedia:Namespace, last visited September 1 2011

40 6.2. Preparation Datasets

3. Use the categorylink table to collect all child entries connected to cat- egory ‘Voetbal’. The child names are used as search text for the tag ‘Voetbal’.

6.2.2 Message and Profile Data from Hyves

To fetch Hyves data, a Hyves API can be used. We have built a PHP program around this API to extract the data. We only extracted message content and profile data.

Message Content

We recognize two types of message content. Blogposts in Hyves consists of a number of fields, such as blogid, title, body and the userid which can be linked to the user profile. Via Crontab3 this program was automatically executed daily for a period of approximately one month resulting in a total of 34, 103 blog posts.

Figure 6.3: Example of a blogpost

Another type is the ‘who what where’ status message with fields such as shorttext and location. An example is given in Figure 6.4. In the same

Figure 6.4: Example of ‘who what where’ message

3Crontab is a time-based job scheduler in Unix enabling users to schedule jobs to run periodically at certain times or dates.

41 Chapter 6. Concept Testing period, we fetched a total of 238, 222 publicly available status messages.

Profiles

Each blog and status message in Hyves contains a userid which can be used to extract the user profile. This can be done with the same API. Hyves profiles contain fields like firstname, lastname, birthday, gender, residence, hobbies, schools. Notice that some fields can contain a list, for example field schools. In total, we fetched 126, 211 profiles.

6.2.3 Policy data

We have chosen to manually create a limited set of policies based on the incidents given in Section 1.1.

Priority Name Explanation Attributes Rules - date Your content is marked as an - is-friend - time invitation message as it con- - vicinity (10 Private Invita- - location 1 tains a date, time, location km) tion (child) - age and specific words identify- - age (+/- 5) - (party OR ing an invitation invitation) - date Your content is marked as an - time invitation message as it con- - is-friend Private Invita- - location 2 tains a date, time, location - vicinity (200 tion (adult) - age and specific words identify- km) - (party OR ing an invitation invitation) ......

Figure 6.5: Part of the policy dataset

Note that these policies can be described using the XACML policy language as XACML is able to allow or deny access to a resource based on attributes associated with the subject and resource.

6.3 Analysis

6.3.1 Evaluation of the Wiki Data

Data size

The Dutch Wikipedia database was examined on the number of categories and the number of categories, related to voetbal. A total of 403 items were

42 6.3. Analysis found (in next paragraph some of them are listed). This listing not only shows words containing ‘voetbal’, but also less obvious items like ‘Ger La- gendijk’ who is a football agent and ‘old firm’ which is a collective name for two clubs in Glasgow (Celtic and Glasgow Rangers).

Description Count Category links 1909183 Categories 90625 Categories linked to voetbal 403

Table 6.2: Category data in the Wikipedia tables

Example of search terms on tag Voetbal

Voetbal, Voetbalcompetitie, Voetbalplaatjes, Voetbaltoernooi, Voetbalwet (Ned- erland), Bekervoetbal, Voetbalbestuurder, Voetbalbond, Voetbalclub, Voet- balcoach, Voetbaltrivia, AS Stade Mandji, Association of Football Statisti- cians, Footgolf, GEA World, Holland Belgium All Star Team, Kemari, Mar- veldtoernooi, Non-FIFA-voetbal, Soccerserie, Superleague Formula, US Bitam, Voetbaloorlog, Gehandicaptenvoetbal, Geschiedenis van het voetbal, Jeugdvoet- bal, Voetbal voor landenteams, Voetballied, Voetbalmakelaar, Militair voet- bal, Voetbalprijs, Voetbalregel, Voetbalscheidsrechter, Voetbalstadion, Sup- portersgeweld, Voetbalterminologie, Voetbalvariant, Voetbal (voorwerp), Voet- baller, Vrouwenvoetbal, Voetbalwedstrijd, Hong Kong FC, Hong Kong Rangers FC, Negeri Sembilan FA, Seiko SA, South China AA, Voetbalopleiding, Za- alvoetbalclub, UEFA-stadionclassificatie, Arnos Vale Stadium, Estadio Na- cional 12 de Julho, Voetbalstadion tijdens het Europees kampioenschap voetbal, Voetbalstadion tijdens het wereldkampioenschap voetbal, Karel Jansen, Ger Lagendijk, Vlado Lemic, Soren Lerby, Alain De Nil, Edwin Olde Riek- erink, Arnold Oosterveer, Voetballer, Voetbaltrainer, Coaches Betaald Voet- bal, Amlt Abalo, Hossam El-Badry, Takeshi Okada, Voetbalbondscoach, Atletiba, Brimstone Cup, Clssico dos Gigantes, Clssico dos Milhes, Clssico Vov, Derby della Lanterna, Derby du Rhne, Derby Paulista, Derby van Barcelona, Derby van Spakenburg, Dodenwedstrijd, Easter Open, El Clsico, Fla-Flu, Herman Teeuwen Memorial, IJsselderby, Jaarbeursstedenbekerbeslissingswedstrijd, De Klassieker, La Partita del Cuore, Merseyside-derby, Mistwedstrijd, MLS Cup, New Firm, Nieuwjaarswedstrijd, Old Firm, Paratiba, Premier League (Lesotho), Premier League (Malawi), Texas Derby, Twentse derby, Europees kampioenschap voetbal.

43 Chapter 6. Concept Testing

6.3.2 Evaluation of the Hyves Data

Data size

Table 6.3 shows that the number of items, retrieved, is high enough to be representative. Only 9.27 percent of the users has a hidden profile, so the user profile statistics also are assumed to be a representative. Note that the user count differs for male/female. But also a lot of users did not specify it.

Totals Count Blogs 34103 Www messages 238222 Profiles 123161

Table 6.3: Items fetched from Hyves

Profile Pcnt Hidden 9.27% Public 90.73%

Table 6.4: Ratio of public and hidden profiles

Gender Pcnt Male 34.75% Female 43.44% Unknown 21.80%

Table 6.5: Gender distribution of fetched profiles

User characteristics

Approx 30 percent of the users is below 18, which is a suggestion to build special profiles for children/teenagers. The average friend count per user is 276. When all users with more than 500 friends are skipped, then the average friend count still is 200. It seems unlikely that every message always is meant for all these friends.

44 6.3. Analysis

4500 4000 3500 3000 2500 2000 1500 Number of users of Number 2020 1000 500 0 1940 1950 1960 1970 1980 1990 2000 2010 2020 Year of birth

Figure 6.6: Hyves users and their year of birth

450 400 350 300 250 200 150

Number of users of Number 100 50 0 0 200 400 600 800 1000 1200 Number of friends

Figure 6.7: Hyves users and their friend count

User residence

This paragraph shows that only 54 percent has defined a hometown. This will have an effect on profiles, which take the attribute ’residence’ into ac- count. Note that Rotterdam with approx 600.000 inhabitants has a higher score than Amsterdam with 780.000 inhabitants. However, we still regard the data to be representative.

Hometown Pcnt Defined 54.06% Undefined 45.94%

Table 6.6: Percentage of profiles with a hometown or city filled

45 Chapter 6. Concept Testing

Cityname Pcnt Rotterdam 1.17% Den Haag 1.13% Amsterdam 1.03% Groningen 0.67% Tilburg 0.64%

Table 6.7: Five most popular cities

User interest or hobby

This section shows the top 10 interests and the percentage of users, which mentioned this interest. Because ’voetbal’ is on top, this tag was used to match to the message content.

Interest in sport Pcnt Voetbal 9.61% Fitness 5.50% Zwemmen 4.21% Paardrijden 3.25% Tennis 3.04% Schaatsen 2.78% Hardlopen 2.66% Skien 2.56% Snowboarden 1.91% Volleybal 1.61%

Table 6.8: Ten most popular regions of interest in sport

Locations mentioned in www messages

Less than half of the www-messages contains a location. It must be men- tioned, that a lot of these locations are not traceable, because they contain arbitrary text in stead of the word ’thuis’ (home) or a city name. It is not certain that this location info is usable by a policy.

46 6.3. Analysis

Location field Pcnt filled 42.91% empty 57.09%

Table 6.9: Percentage of www messages with location field filled

6.3.3 Attribute Matching Evaluation

Example of attributes found a messages

This section shows the percentage of blogs and www messages, which are related to the specified tags. The tables also list the search terms or keyword, which were used in the match algorithm.

Tag name [body] [title] Keywords used voetbal 1.58% 0.23% voetbal, goal, doelpunt thuis 4.37% 0.27% thuis school 8.48% 1.55% school, universiteit, hbo, atheneum, havo, ROC, gymnasium werk 11.61% 1.13% werk, bedrijf feest 2.78% 0.48% feest, party verjaardag 3.66% 0.33% verjaardag, jarig uitnodiging 0.48% 0.01% uitnodiging, nodig uit, uitnodigen, uitnodig

Table 6.10: Percentage of blog-bodies and blog-titles containing tag keywords

Tag name [text] [location] Keywords used voetbal 0.61% voetbal, goal, doelpunt thuis 2.42% 8.93% thuis school 3.19% 0.49% school, universiteit, hbo, atheneum, havo, ROC, gymnasium werk 6.72% 0.54% werk, bedrijf feest 3.05% feest, party verjaardag 1.78% verjaardag, jarig uitnodiging 0.06% uitnodiging, nodig uit, uitnodigen, uitnodig

Table 6.11: Percentage of www-texts and www-locations containing tag keywords

Tables 6.12 and 6.13 show the date and time matching on the blog posts and status messages. In Appendix B the regular expressions are shown cor-

47 Chapter 6. Concept Testing responding to the column Keywords used.

Attribute name [text] Keywords used date 0.04% regular expression: YYYY MM DD date 0.32% regular expression: DD MM YYYY date 0.87% regular expression: textual time 0.17% regular expression: time ‘uur’ time 0.02% regular expression: time abbreviation uur/hour time 0.04% regular expression: time ‘pm am’ time 0.85% regular expression: time date 3.76% vandaag date 5.68% vandaag 0.8 date 4.22% morgen date 6.33% morgen 0.8 date 12.51% [all date keywords] time 1.05% [all time keywords]

Table 6.12: Percentage of www-texts containing date and time patterns based on regular expressions

Attribute name [body] [title] Keywords used date 2.70% 0.17% regular expression: YYYY MM DD date 1.40% 0.66% regular expression: DD MM YYYY date 5.08% 1.34% regular expression: textual time 0.45% 0.01% regular expression: time ‘uur’ time 0.03% 0.00% regular expression: time abbreviation uur/hour time 0.15% 0.01% regular expression: time ‘pm am’ time 2.93% 0.58% regular expression: time date 3.46% 0.18% vandaag date 5.34% 0.32% vandaag 0.8 date 1.76% 0.07% morgen date 3.52% 0.15% morgen 0.8 date 39.97% 2.62% [all date keywords] time 3.44% 0.61% [all time keywords]

Table 6.13: Percentage of blog posts containing date and time patterns based on regular expressions

48 7. Conclusion

7.1 Conclusion

The goal of this graduation project was to find out whether it is possible for SNSs to automatically generate appropriate privacy policies based on mes- sage content and user profiles, and if so, how can this be realized in practice.

First an extensive research was done including an analysis of existing SNSs and a detailed look at the available access control mechanisms. The research phase has lead to the proposal of a new method driven by content based access control. The method automatically proposes an appropriate privacy policy for a users content. It does so by analyzing both the content of the message and the authors profile and extracting attributes that match with those from an existing dataset. This set of attributes is then matched with an existing dataset of privacy policies to determine the most appropriate policy. The first step of this method (extracting attributes from user content)was tested in an experiment using data from Hyves and Wikipedia. The results show that it is possible to distill attributes from an example message. The process of selecting the best policy is highly dependent on the success of attribute determination. This step remains theoretical and is yet to be investigated in an experiment.

7.2 Recommendations

Although initial results are promising, in order to test the proposed method for the real world it needs more development. First, the process of privacy policy determination needs to be tested to validate it. Second, both steps need to be integrated in one method to simulate a real world situation. This simulation might reveal problems or aspects that were overlooked which can also be addressed in further development.

49 Chapter 7. Conclusion

50 A. Interesting Data in Wikipedia

It may be interesting to select users within a maximum distance to the orig- inator of a message. A location dataset would be userful in this case. Such dataset can also be retrieved from Wikipedia data. In order to support policies using location based attributes, research was done to uncover if Dutch locations could be extracted from Wikipedia. Using ‘infoboxes’1 living inside Wikipedia articles, a total of 12 provinces and 5411 towns and villages were extracted. Only a few of them do not have GPS coordinates.¨

{{Infobox gemeente Nederland | naam = Eindhoven | bestandsnaam vlag = Eindhoven flag.svg | bestandsnaam wapen = Eindhoven wapen.svg | vlagartikel = Vlag van Eindhoven | wapenartikel = Eindhoven | locatie = LocatieEindhoven | provincie = Noord-Brabant | burgemeester = [[Rob van Gijzel]] | hoofdplaats = ’’’Eindhoven’’’ | breedtegraad = 51°26 | lengtegraad = 5°28 }}

Figure A.1: Wikipedia’s infobox of Eindhoven and a screenshot of its visualization.

Found GPS coordinates¨ specified Province 12 100% Towns & Villages 5411 93%

Table A.1: Statistical overview of Dutch locations found using infobox data.

1http://en.wikipedia.org/wiki/Help:Infobox

51 Appendix A. Interesting Data in Wikipedia

52 B. Solr Matching

B.1 Regular Expressions

Part of the Solr configuration enabling to find dates and times using regular expressions.

1 <fieldType name=”RegexDateTime” class=”solr.TextField”> 6

11 16 21 26

53 Appendix B. Solr Matching

31

36

41

54 Bibliography

[1] Facebook factsheet and statistics. Website. https: //www.facebook.com/press/info.php?factsheet and https: //www.facebook.com/press/info.php?statistics date last accessed June 12, 2011.

[2] Facebooks privacy policy. Website. http://www.facebook.com/ policy.php date last accessed June 22, 2011.

[3] Hyves privacy policy. Website. http://www.hyves.nl/privacy/ date last accessed June 22, 2011.

[4] Wikipedia. website. http://en.wikipedia.org/wiki/Wikipedia date last accessed July 15, 2011.

[5] Wordnet 3.0 database statistics. website. http:// wordnet.princeton.edu/wordnet/man/wnstats.7WN.html date last accessed August 22, 2011.

[6] 2009. Murder ‘followed facebook change’. Website. http:// news.bbc.co.uk/2/hi/uk_news/wales/8232250.stm, date last accessed May 10, 2011.

[7]A CQUISTI,R. AND GROSS, R. 2006. Imagined communities: Awareness, information sharing, and privacy on the facebook. In In 6th Workshop on Privacy Enhancing Technologies. 36–58.

[8] AFP. 2009. Smiling on facebook costs canadian her insur- ance. Website. http://www.google.com/hostednews/afp/article/ ALeqM5gIrsrR0MC9EOlhFJeZcHlvGpHaFg, date last accessed May 4, 2011.

[9] AFP. 2011. Teen cancels facebook party with 200,000 ‘guests’. Web- site. http://en.news.maktoob.com/20090000631456/Teen_cancels_ Facebook_party_with_200_000_guests_/Article.htm, date last ac- cessed May 4, 2011.

55 Bibliography

[10]A NDERSON, R. J. 2008. Security Engineering: A Guide to Building De- pendable Distributed Systems, 2 ed. Wiley Publishing.

[11]B EAUCHEMIN, G. 2009. User authentication guidance for it sys- tems. Tech. Rep. ITSG-31, Communications Security Establishment Canada (CSEC). Available at http://www.cse-cst.gc.ca/documents/ publications/itsg-csti/itsg31-eng.pdf.

[12]B LOEHDORN,S. AND HOTHO, A. 2004. Boosting for text classification with semantic features. In Proceedings of the MSW 2004 workshop at the 10th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 70–87.

[13]B OYD,D. AND ELLISON, N. 2007. Social network sites: Definition, his- tory, and scholarship. In Journal of Computer-Mediated Communication.

[14]C ARMINATI,B.,FERRARI,E., AND PEREGO, A. 2009. Enforcing access control in web-based social networks. ACM Trans. Inf. Syst. Secur. 13, 6:1–6:38.

[15]C ENTRAAL BUREAUVOOR STATISTIEK. 2011. Nederlandse jongeren zeer actief op sociale netwerken. Website. http://www.cbs.nl/nl- NL/menu/themas/vrije-tijd-cultuur/publicaties/artikelen/ archief/2011/2011-3296-wm.htm, date last accessed May 2, 2011.

[16]C HERNOV,S.,IOFCIU,T.,NEJDL, W., AND ZHOU, X. 2006. Extract- ing semantic relationships between wikipedia categories. In In 1st Inter- national Workshop: “SemWiki2006 - From Wiki to Semantics” (SemWiki 2006), co-located with the ESWC2006 in Budva.

[17]C HONEY, S. 2011. Did social network make kidnap victim vulnera- ble? Website. http://technolog.msnbc.msn.com/_news/2011/04/21/ 6508670-did-social-network-make-kidnap-victim-vulnerable, date last accessed May 2, 2011.

[18]F OGAROLLI, A. 2011. Wikipedia as a source of ontological knowledge: State of the art and application. In Intelligent Networking, Collabora- tive Systems and Applications, S. Caball´e, F. Xhafa, and A. Abraham, Eds. Studies in Computational Intelligence, vol. 329. Springer Berlin / Hei- delberg, 1–26.

[19]H ART,M.,JOHNSON,R., AND STENT, A. 2006. Content-based access control.

[20]H ART,M.,JOHNSON,R., AND STENT, A. 2007. More content - less control: Access control in the web 2.0. IEEE Web 2.0 Privacy and Security Workshop, Oakland, CA, May 2007.

56 Bibliography

[21]H OLWEG, T. 2011. Audience segregation a solution to protect online privacy. M.S. thesis, Radboud University Nijmegen.

[22]H U, V. C., FERRAIOLO, D. F., AND KUHN, D. R. 2006. Assessment of access control systems. Interagency report (nistir) 7316, Computer Se- curity Division Information Technology Laboratory National Institute of Standards and Technology (NIST). Available at http://csrc.nist.gov/ publications/nistir/7316/NISTIR-7316.pdf.

[23]J ORDAN, C. S. 1987. A Guide to understanding discretionary access control in trusted systems, Version 1. ed. National Computer Security Center.

[24]L EENES, R. 2010. Context is everything - sociality and privacy in on- line social network sites. In Privacy and Identity Management for Life, M. Bezzi, P. Duquenoy, S. Fischer-Hubner,¨ M. Hansen, and G. Zhang, Eds. IFIP Advances in Information and Communication Technology, vol. 320. Springer Boston, 48–65.

[25]L I,J.,TANG, Y., MAO,C.,LAI,H., AND ZHU, J. 2009. Role based access control for social network sites. In Proc. Joint Conf.s Pervasive Computing (JCPC). 389–394.

[26]L INDQVIST, H. 2006. Mandatory access control. M.S. thesis, Umea˚ University.

[27]L IU,Q.,XU,K.,ZHANG,L.,WANG,H.,YU, Y., AND PAN, Y. 2008. Catriple Extracting triples from wikipedia categories. In The Semantic Web, J. Domingue and C. Anutariya, Eds. Lecture Notes in Computer Science, vol. 5367. Springer Berlin Heidelberg, 330–344.

[28]M ONTE, S. 2010. Access control based on content.

[29] OASIS. 2005. extensible access control markup language (xacml) version 2.0. Tech. rep. Available from http://docs.oasis-open.org/ xacml/2.0/access_control-xacml-2.0-core-spec-os.pdf.

[30]O RLICKI, J. I. 2009. Another categorization of social network- ing data. Weblog. http://blog.mechpoet.net/2009/11/another- categorization-of-social.html Last visited at February 25, 2011.

[31]P ARK,J. AND SANDHU, R. 2004. The ucon-abc usage control model. ACM Trans. Inf. Syst. Secur. 7, 128–174.

[32]P EW RESEARCH CENTER’S INTERNET &AMERICAN LIFE PROJECT. 2009. Adults on social network sites, 2005-2009. Web- site. http://www.pewinternet.org/Infographics/Growth-in-Adult- SNS-Use-20052009.aspx, date last accessed May 10, 2011.

57 Bibliography

[33]P ONZETTO, S. P. AND STRUBE, M. 2007. Deriving a large scale tax- onomy from wikipedia. In Proceedings of the 22nd national conference on Artificial intelligence - Volume 2. AAAI Press, 1440–1445.

[34]Q UOTED NEWS. 2011. Eugene kaspersky’s son kidnapped. Web- site. http://www.quotednews.com/2011/04/24/eugene-kasperskys- son-kidnapped/, date last accessed May 2, 2011.

[35]S ANDHU,R. AND SAMARATI, P. 1994. Access control: Principle and practice. Communications Magazine, IEEE 32, 9 (Sept.), 40 – 48.

[36]S ANDHU,R.S.,COYNE,E.J.,FEINSTEIN,H.L., AND YOUMAN,C.E. 1996. Role-based access control models. In IEEE Computer. Vol. 29. IEEE Press, 38–47.

[37]S CHNEIER, B. 2010. A taxonomy of social networking data. IEEE Security and Privacy 8, 88.

[38]S CHONHOFEN, P. 2006. Identifying document topics using the wiki- pedia category network. In Proceedings of the 2006 IEEE/WIC/ACM Inter- national Conference on Web Intelligence. WI ’06. IEEE Computer Society, Washington, DC, USA, 456–462.

[39]S KY NEWS. 2009. Sacked for calling job boring on facebook. Website. http://news.sky.com/skynews/Home/UK-News/Facebook-Sacking- Kimberley-Swann-From-Clacton-Essex-Sacked-For-Calling-Job- Boring/Article/200902415230508, date last accessed May 2, 2011.

[40]S TRATER,K. AND RICHTER, H. 2007. Examining privacy and disclosure in a social networking community. In Proceedings of the 3rd symposium on Usable privacy and security. SOUPS ’07. ACM, New York, NY, USA, 157–158.

[41]S UCHANEK, F. M., KASNECI,G., AND WEIKUM, G. 2007. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web. WWW ’07. ACM, New York, NY, USA, 697–706.

[42]T HE APACHE SOFTWARE FOUNDATION. 2010. Authentication, autho- rization, and access control. Website. http://httpd.apache.org/docs/ 1.3/howto/auth.html, date last accessed June 7, 2011.

[43] VAN DEN BERG,B. AND LEENES, R. 2010. Audience segregation in so- cial network sites. In Proc. IEEE Second Int Social Computing (SocialCom) Conf. 1111–1116.

[44]W EBLOG ASK THE JUDGE. 2008. School discipline for facebook party pictures. Website. http://www.askthejudge.info/school- discipline-for-facebook-party-pictures/184/, date last accessed May 4, 2011.

58 Bibliography

[45]W IKIMEDIA FOUNDATION. 2011. Dutch wikipedia dumps of cate- gorylinks.sql.gz, page.sql.gz and pages-articles.xml.bz2 of 18th april 2011. http://dumps.wikimedia.org/nlwiki/20110418/ date last ac- cessed August 22, 2011.

[46]X AVIER,C. ANDDE LIMA, V. 2011. A semi-automatic method for domain ontology extraction from portuguese language wikipedia’s cate- gories. In Advances in Artificial Intelligence SBIA 2010, A. da Rocha Costa, R. Vicari, and F. Tonidandel, Eds. Lecture Notes in Computer Science, vol. 6404. Springer Berlin / Heidelberg, 11–20.

[47]Y UAN,E. AND TONG, J. 2005. Attributed based access control (abac) for web services. In Web Services, 2005. ICWS 2005. Proceedings. 2005 IEEE International Conference on.

[48]Z UCKERBERG, M. 2010. Making control simple. Weblog. http: //blog.facebook.com/blog.php?post=391922327130 date last accessed July 29, 2011.

59