Research Collection

Master Thesis

Social network portability and enhancement of the Origo platform

Author(s): Weiss, Ulrich Andreas

Publication Date: 2008

Permanent Link: https://doi.org/10.3929/ethz-a-005675938

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library Portability and Enhancement of the Origo Platform

Ulrich Andreas Weiss 03-911-690 Master Thesis

Chair of Software Engineering Department of Computer Science ETH Zürich

Dr. Till G. Bay Prof. Bertrand Meyer

March 2008 - August 2008

Abstract

Social networking graphs are often secluded and encapsulated in one social network, unable to interact and connect to other social graphs. This phenomenon is often referred to as “walled gardens”, and that’s exactly what most popular social networks are. This master thesis shows an overview over the most important data portability standards and what should be taken care of to enhance the accessibility of the data and information, eventually connecting the social graphs together. These standards and concepts are analyzed and outlined critically, highlighting their privacy, authenticity and security implications. The thesis contains two sections, a theoretical part about data portability and social networks, and a practical part about enhancements to the Origo [1] Platform. ii Abstract Acknowledgment

I thank the entire Origo team for all their help and commitment to the project. Special thanks go to my supervisor Till Bay and to Dominique Schneider and Dennis Rietmann for their great support with Origo. Further, I thank family and friends for their continuous encouragement throughout my studies and this thesis. iv Acknowledgment Contents

Abstracti

Acknowledgment iii

1 Social Network Portability1 1.1 Introduction...... 1 1.2 Data Portability...... 2 1.2.1 Some Standards...... 2 1.2.1.1 FOAF...... 2 1.2.1.2 OpenID...... 4 1.2.1.3 ...... 9 1.2.1.4 OAuth...... 10 1.2.1.5 SIOC...... 11 1.2.1.6 DOAP...... 11 1.2.1.7 RSS and Atom...... 12 1.3 Existing Web Services...... 12 1.3.1 MySpace...... 12 1.3.2 ...... 13 1.3.3 Google Friend Connect...... 14 1.3.4 Google OpenSocial...... 15 1.3.5 Google API...... 15 1.3.6 Gravatar...... 15 1.3.7 FriendFeed and Plaxo...... 16 1.3.8 Proofile...... 16 1.4 Conclusions...... 16 1.4.1 Identity...... 17 vi Contents

1.4.2 Authenticity...... 17 1.4.3 Privacy...... 17 1.4.4 Finding Friends...... 18

2 Origo 19 2.1 Introduction...... 19 2.1.1 Back-end...... 19 2.1.1.1 Nodes...... 20 2.1.2 Front-end...... 21 2.2 Social Networking Integration...... 21 2.2.1 Profile...... 22 2.2.1.1 Profile Import...... 22 2.2.2 FOAF...... 24 2.2.3 OpenID...... 25 2.2.4 SIOC...... 25 2.2.5 DOAP...... 26 2.2.6 Future Work...... 26 2.3 Origo Enhancements...... 27 2.3.1 Tooltips...... 27 2.3.2 Password Recovery and Account Activation...... 29 2.3.3 Project Information...... 30 2.3.4 Origo Maintenance and Minor Enhancements...... 32 2.4 Origo Issues...... 34 2.4.1 Latency...... 34 2.4.2 Application Programming Interface Development...... 36 2.4.3 Scalability and Redundancy...... 37 2.5 Concluding Remarks...... 37

Bibliography 39 CHAPTER 1

Social Network Portability

1.1 Introduction

The first chapter of this thesis is about portability of social networks and data portability in general. Some of the major technologies and services are introduced and analyzed.

Historically, social networks have always been very centralized and secluded islands, walled gardens is what many people call them. These web sites bind users to their services by making it hard to leave, and by requiring visitors to register in order to interact with the current users. Friendships in a social network, and the real world for that matter, can be viewed as a graph of connections between real people. Every service requires their users to recreate their social graphs and user profiles over and over again, also, users’ online identities are not connected with each other in any way. For acquiring the possibility to connect and interact with their friends and acquaintances, users need to join numerous social networks, requiring a lot of inconvenient and repetitive manual input.

Many have tried to tackle these problems, many incompatible standards have been defined and none have prevailed. This might change in the near future as some standards emerge, driven by the Data Portability Workgroup [2] and the companies supporting them. 2 1 Social Network Portability

1.2 Data Portability

The Data Portability initiative was created to promote the idea that individuals maintain control over their data by determining how it can be used and accessed. 1 The main idea is that users should be able to control what data can be used by whom and in what manner. The group seeks to achieve these goals by promoting existing standards that enable data portability, not by developing new standards, rather by encouraging development of these standards and by identifying new standards that are required to fulfill the data portability vision.

The vision is that data can be shared and remixed across the borders of web sites. The owner of the data should be enabled to control who has access to it, access that should not only be limited to the place where it has initially been uploaded. Application Programming Interfaces have been developed and are being used by the user community, e.g. by creating mashups of aggregated data. This is already a big step in the right direction, however it is not the end of the story and the solution to all the problems.

1.2.1 Some Standards

I give a very brief introduction to some of the most important data portability and social networking technologies and standards, along with their implications.

1.2.1.1 FOAF

Friend of a Friend [3] (FOAF) is a Resource Description Framework [4] (RDF) vocabulary designed to describe people and the connections between them. The RDF is a method for modeling information and meta-information in an XML [5] format. Abstract concepts or meta-models build the foundation of this mechanism, denoting the traits of subject to object relations. RDF is a major component of

1 Some parts quoted from 1.2 Data Portability 3

W3C’s activity, trying to create, exchange and use machine-readable information in a distributed fashion.

The main aspects of the FOAF specification are the user’s profile data, links to other people the user knows, and identities or accounts that the user holds. Listing 1.1 shows an example of what a FOAF file could look like.

Listing 1.1: A sample FOAF file

1

2 xmlns:rdf ="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

3 xmlns:foaf="http://xmlns.com/foaf/0.1/">

4

5

6 Ueli Weiss

7 937086423 dcf54b784ec740aea134dfe4e879828

8 http://uweiss.myopenid.com/

9

10

11

12

13

14 uweiss

15

16

17

18

19

20

21

22 John Doe

23

24

25 4 1 Social Network Portability

26

This effectively establishes a decentralized directed graph of connections between people. The mbox_sha1sum value is an SHA1 hash of the person’s e-mail address, which can be used as a unique identifier, same as the OpenID (cf. 1.2.1.2) identifies a person. These graphs can be crawled, or queried on one of the existing services.

The decentralized nature of this makes it possible for anyone to add bogus entries to the network or claim identities that are not actually theirs. This phenomenon has been observed on the web for many years, with search engine bombing, domain squatting, bogus web page networks being only some of the many examples. RDF search engines will have to deal with these problems, just like regular search engines deal with them on the web currently, assuming the semantic web ever gains enough popularity to attract spammers, or other malicious entities.

The first that comes to mind are blacklists. Blacklisting domains or documents is easy, but how do we assure that only malicious content is denied? If spammers misuse a social network and generate bogus users, we cannot simply block out the whole domain, since most user accounts are not malicious. In the end, unfortunately not much more than spam detection techniques can be applied.

Advogato [6] tackles this issue by establishing a trust metric [7] for their users. Members can certify his or her trust of other members, leading to a reputation graph, or a graph of trust. The algorithms are based on the distance to a seed node (trusted source) and network flow algorithms that create a cut of the graph, separating “good” from “bad” nodes. These ideas have helped Advogato prevent spamming and trolling, however the solution does not work for web-sized problems, because it is not that scalable and a decentralized trust ranking system is hardly feasible for this very general use-case.

1.2.1.2 OpenID

OpenID [8] is a set of protocols for digital identities, providing a single sign on solution. The protocols allow to create online identities that can be used on any of 1.2 Data Portability 5 the OpenID-enabled web sites1, called Relying Party (RP). OpenID Providers (OP)2 issue a unique URL, which can be used for signing in to many websites. OpenID is an open and decentralized user centric solution, users have the choice, which OpenID provider they want to entrust their online identity. The openness of the protocol also means that anybody can create their own OpenID provider, since no central authority is required to register OpenID enabled providers or relying parties.

In essence, OpenID does two things for the user. Firstly, you can log in to multiple services with just one single OpenID and password, so you don’t have to remember all your usernames and passwords. Secondly it is a means of authenticating your identity – only you have access to this particular OpenID, which means that if your OpenID is displayed on your user profile, visitors can verify your identity with your other accounts.

The UML sequence diagram3 in figure 1.1 will walk you through the authentication protocol.

1 As of June 2008, there are more than 11’000 websites that act as OpenID relying parties 2 A list of the most popular providers can be found on 3 Diagrams generated using 6 1 Social Network Portability

Figure 1.1: OpenID sequence diagram

The authentication process is initiated by the end-user who passes his OpenID identifier – usually an URL – to the Relying Party. Let’s go through an example where I will log into Origo with my OpenID, , which I acquired on my OP’s website . The first step of the protocol is to provide my identity, I enter uweiss.myopenid.com into the login-form. In step 2 of the protocol, the Relying Party normalizes this OpenID, receiving the claimed identifier http://uweiss.myopenid.com/, and looks up the OP, which is referenced in the identities HTML source code1.

1 example quoted from 1.2 Data Portability 7

Listing 1.2: OpenID Provider delegation

1

2 href ="http://myprovider.com/openid/server" />

3

4 href ="http://john.myprovider.com/" />

5

6 supports OpenID 2.0, also add: -->

7

8 href ="http://myprovider.com/server" />

9

10 href ="http://john.myprovider.com/" />

11

12 OpenID 2.0 & XRDS (VeriSign & LinkSafe), also add: -->

13

14 content ="http://myprovider.com/oid_xrds/=ve7jtb" />

The RP can establish an association with the OP using Diffie-Hellman Key Exchange, resulting in a shared secret. These keys will be used for subsequent communication between the RP and the OP. The end-user is redirected to the OP. Steps 5 and 6 authenticate the user with the OpenID Provider by requesting the password. This is a regular login process, which may also be saved in a cookie for the user’s convenience. The OP then asks the user if he trusts the RP and can deny the authentication, accept it for once, or for multiple sessions. The user is then redirected back to the RP with either an assertion that authentication is approved or that authentication has failed. If the authentication is approved, the user can be logged into the RP.

Now that you know how OpenID works, and what it does, you should also be aware of implications. An OpenID is a key to to all the services you use, everybody who has this key can do log in to those accounts, so you better not lose it. Phishing is one of the many possible attacks where your login credentials can be stolen. A phisher will create a site that gets you to log in using your OpenID, redirecting to his own page, masked as your very own OpenID Server, which is easily done by fetching the content of your provider’s login screen. Most users are not aware of these issues and don’t actively look at the URL to verify that it is actually the correct address. Even if you do look at the browsers address bar, the authentication process is only as 8 1 Social Network Portability secure as DNS itself, which can be hijacked by using DNS IP spoofing or DNS cache poisoning. This is not technically an OpenID protocol related problem, but the fact that the stolen credential can be reused on so many sites makes makes OpenID more vulnerable. Some OpenID extensions [9] that leverage existing phishing prevention techniques, such as Windows Cardspace 1 for example, have been developed. This solution requires users and Relying Parties to use or enable more services. Regular users are highly unlikely to use any of these extensions, most of them aren’t even aware of phishing at all. Phishing can only be prevented from the OP’s side, because there is no way a user can determine if the RP is trustworthy and not compromised in any way – a simple cross-site scripting vulnerability suffices to compromise the integrity of the RP. VeriSign Lab’s Personal Identity Provider 2 offers a browser extension for helping to detect and prevent phishing attacks, the only promising solution I have seen so far.

There are some serious privacy problems with OpenID. The obvious one is that your online accounts can be associated with each other if the RP displays your OpenID. Relying parties are therefore encouraged to allow the user to control if his OpenID URL is disclosed. Alother fundamental privacy issue is that your OpenID provider can track all the websites you log into, even worse, they can log into them and view your private data. You can argue that this is just a trust issue, but ask yourself, do you really trust all the employees who work for your OpenID provider? What about security vulnerabilities or viruses on the OP’s side?

The third major issue is availability. When your OpenID Provider is down, you will not be able to log into your regular services, to which you probably already have forgotten your usernames and passwords, because you never use them. The more interconnecting services are involved, the smaller the probability that all of them work at the same time.

1 2 1.2 Data Portability 9

1.2.1.3 Microformats

Microformats [10], sometimes referred to as µf, are a set of simple and open data formats designed to annotate content with semantic information, targeting very specific problems. Simplicity is a key feature, anybody can add microformats to existing solutions without having to reorganize the whole existing infrastructure.

XFN: With XHTML Friend Network [11] (XFN), web links can be given relationships between the linking page and the linked page. Adding a rel="friend met" attribute to an tag will let visitors and web services know that the URL linked to belongs to a person that is a friend of the person at the linking page. This meta-information should only be added when both URLs “belong” to real people, otherwise very weird information can be derived by aggregating over this false information 1. hCard and vCard: vCard [12], which is not a , is a standard for electronic business cards forming the basis for hCard [13] – a microformat – which is why it is being introduced here. vCard has its roots in corporate email attachments and have been broadly implemented within address books and e-mail clients. The v in vCard stands for Versit, a consortium that consisted of Apple, AT&T, IBM and Siemens, aimed at creating industry standards. Typical data contained in vCards are names, organizations, addresses, telephone numbers or e-mail addresses of a single person. The hCard specification reflects these values in XHTML notation. Other Microformats: There are more microformats, some of the more notable ones being for cal- endars (hCalendar), licensing (rel-license) or tags (rel-tag). For full lists and specifications, please see the official microformats website at .

1 also see the Google Social Graph API in section 1.3.5 10 1 Social Network Portability

1.2.1.4 OAuth

OAuth [14] is an open and standardized protocol for secure API authentication and API access delegation. API access delegation means that a user, who has full access to a resource via his login credentials, would like to delegate access to a consumer service. Valet keys are a great analogy for this concept. You, the owner of a car (resource), would like to grant the valet (consumer) the ability to drive the car no further than 1 kilometer (restricted access), by giving him a valet key (access token). A real scenario where OAuth is being used would be Google contact information and address books. Your Google credentials do not have to be disclosed to third parties anymore, if you would like to search for your friends on the new social network you just joined. The OAuth protocol lets you give the consumer service restricted access to your account, only letting it read your contact information for a short period of time.

The authentication process is nothing radically new, it’s simply an open stan- dard of well known concepts, such as RSA public-key cryptography for authentic communication. OAuth authentication is done in three basic steps, beginning with the consumer requesting and receiving an unauthorized request token, which has to be authorized by the user. Usually this is done in a browser; The user enters his credentials on the service provider’s web page. This protocol requires the user to authenticate himself directly with the service provider, making sure the consumer service does not see the credentials and therewith cannot receive full access. This part is important to understand, the official website does not make this clear enough. OAuth is primarily designed for use on the web, other means of direct authentication other than using a browser could be sent via alternate channels, such as SMS or e-mail, making the authentication process very frustrating and tedious. In the third and final step of the authentication protocol the consumer sends the request access token to the service provider, who in turn returns the access token and access token secret.

One concern about the protocol is the lack of user control. If the service provider simply implements the protocol, the user has no idea whatsoever what access rights he is granting the service consumer, and for how long. I outline some best practices that service providers are recommended to follow. 1.2 Data Portability 11

Service providers should:

• Let the user know or specify exactly what access rights are being granted to the consumer • Let the user specify a time-frame for this access token • Give an overview of what access tokens have been given to which consumers • The user should be able to revoke these tokens at any time

Service providers may:

• Inform the user when access tokens are being used and what for

1.2.1.5 SIOC

The SIOC (Semantically-Interlinked Online Communities) [15] project is an effort to get data available on the semantic web, in a structured and standardized RDF format. It is an additional layer on top of the web, interconnecting data – i.e. blog posts, forum entries, mailing list or users – over a broad spectrum of sites and formats, capturing its structure. Distributed conversations can be captured, making finding pingbacks 1 as simple as entering a SPARQL [16] query, just to give an example. The biggest problem with SIOC is that hardly anybody actually uses it, and even fewer enable interconnection on a wider scale.

1.2.1.6 DOAP

DOAP (Description of a Project) [17] offers a way to describe open-source projects in an RDF vocabulary, making it interoperable with other popular projects. The initial release defines internationalizable descriptions, licensing, used program- ming languages, supported operating systems and references to important resources such as bug tracker, repository and its web front-end representation.

1 A pingback is a method for notifying a blog that it has been referenced by another website or blog 12 1 Social Network Portability

1.2.1.7 RSS and Atom

Web syndication feeds have been growing steadily in popularity over the years. In essence, a feed is a standardized XML format for representing consecutive updates added to a resource. It is a (push-)pull system, much like POP3 1 is, which means that the documents have to be continually polled for updates. This continual polling has caused several popular feeds to be shut down due to bandwidth and request processing limitations. Feeds can only scale well if all or most of the readers and publishers respect the best practices for syndication hints [18], HTTP features or caching.

1.3 Existing Web Services

This section introduces a few select social networks and web services from a data portability and social network portability viewpoint. The first two, MySpace and Facebook, probably won’t need a long introduction, they are the largest and most influential English social networks 2.

1.3.1 MySpace

MySpace has not had a reputation of being very innovative and open. The announce- ment in May 2008 [19] to support data portability standards was all but expected. They finally have made a bold move, hopefully leading others in the right direction. MySpace is trying to regain some ground lost to competitors such as Facebook, by allowing users to share their data, giving them an incentive to save their data on MySpace. This all sounds great, but a closer look at the Developer Terms of Service [20] reveals more.

6.2 [...] You will not, and will not permit any person, directly or indirectly, to [...] distribute, sell, resell, lease, license, sublicense or trans-

1 Post Office Protocol version 3, a protocol for retrieving e-mail messages from a server 2 as of June 2008 1.3 Existing Web Services 13

fer or otherwise make the Developer Platform or any User Information available to third parties (including by storing the Developer Platform or User Information in any manner which would enable a third party to access it (other than in the case of the User from which such User Information was retrieved ("Originating User")).

...

7.2.9 unless data is otherwise designated in writing by MySpace, you may not export any User Information and must cease using and delete any User Information or other MySpace Content within 24 hours after the time which you obtained such data;

Effectively they are prohibiting other web services to even display 1 the data, except to the originating user. So much for data portability.

Here’s a nice quote from the official release note [19], great stuff!

As the developer of an independent website you can now enable your users to leverage the power of their social data outside of the MySpace.com domain. Our users spend hours updating and making changes to their profiles, uploading content, and building friend relationships. With your help that data can now be available to MySpace users no matter where they go on the .

Unfortunately, most users will remain ignorant of these facts, or they simply do not care. To me, this looks very much like an attempt to expand their walled garden.

1.3.2 Facebook

The 8th of January 2008, another great publicity stunt, the headline reads: “Face- book, Google and Plaxo join the DataPortability Workgroup”. 6 months after this,

1 because displaying makes the user information available to third parties 14 1 Social Network Portability the only commitment to data portability have been hCard annotations on public 1 profiles. Facebook, the mother of all walled gardens, is slightly more open than MySpace with respect to terms of service [21] regulations. Although they do have an API for external applications, the only data that may be stored for more than 24 hours are user IDs, linked content IDs, note counts and profile update times. User relations may be displayed, but not stored. What would happen if social graph data is displayed by one client and parsed and stored by a third party – Surely a Facebook lawyer will tell exactly what happens, after all, Facebook cannot afford to lose their greatest asset, the social graph.

Seemingly in response to MySpace’s announcement [19], Facebook Connect was announced [22] one day later, Google following up another 3 days later with Google Friend Connect [23]. Not much information about the concepts of Facebook Connect have been released, but it seems to be very similar to MySpace’s endeavors. The sudden movement is motivated by these companies realizing that they will eventually have to offer some sort of hub for profile information and friend connections, otherwise they risk losing control over the user data and will lose the invaluable user traffic.

1.3.3 Google Friend Connect

Google will launch its own data portability effort called Friend Connect. Web developers can add a Friend Connect widget to their (possibly static) web pages, adding pluggable functionality such as interacting with friends, comments, social networking 2 and OpenSocial integration. Third party developers can develop applications on top of Friend Connect. Soon after the initial release, Facebook has banned Google Friend Connect’s API access to the Facebook platform, due to an alleged terms of service and privacy policy violation. Google and Facebook claim to be in contact, trying to resolve the issues.

1 very brief profiles listed to be crawled by search engines 2 upon initial release Friend Connect was using Google, MySpace and Facebook APIs 1.3 Existing Web Services 15

1.3.4 Google OpenSocial

OpenSocial is a platform for social applications, i.e. gadgets, to be hosted on any social network or website that implements the API, containing interfaces to information such as friends or profile information. A website hosting an OpenSocial API is called an OpenSocial container. Googles business model of having an open interface, as opposed to the other big players who have had applications on their closed and proprietary platforms for a long time, also has the benefit of distributing their framework throughout the whole web, reaching many more people. Facebook in response started developing an interoperable interface with their applications in order to keep up with Google.

1.3.5 Google Social Graph API

Google has been developing an API [24] for the global and public social graph that is created by the open standards FOAF and XFN. The team, lead by Brad Fitzpatrick1, has set up an API for developers to query and search nodes in the social graph. For example one could query for an email address and search where it is referenced from, or ask who has claimed ownership by adding an XFN tag rel="me", a FOAF attribute mbox_sha1sum or the cleartext attribute mbox. The Social Graph API uses a canonical representation of URLs, which maps all the URLs on one social network of the same user to one unified canonical form, beginning with “sgn://”. This is done using the SocialGraph Node Mapper (SGNodemapper) [25], a community project written in JavaScript.

1.3.6 Gravatar

The reason why Gravatar [26] is being mentioned here are privacy implications that web sites and users might not be aware of. Gravatar is an abbreviation for globally recognized avatar which associates email addresses with an avatar – whenever you post on a Gravatar-enabled blog or web site with your email address that is associated

1 http://bradfitz.com 16 1 Social Network Portability to Gravatar, your avatar will be displayed with the post. Since the image URL contains a hash sum of your e-mail address, all you posts and web accounts using Gravatar will be linked together.

1.3.7 FriendFeed and Plaxo

Friendfeed [27] and Plaxo [28] are very similar, so they are introduced together. These services allow users to link their other feeds that they generate such as blogs, uploaded images, Twitter messages or planned travels on Dopplr. Plaxo was originally a platform for synchronizing calendar entries and other personal data and has grown to a full-blown social network and synchronization platform. Friendfeed allows users to keep up to date with their friends online activities, by presenting users feed aggregations, generally known as Lifestreams.

1.3.8 Proofile

This is a side project I have been working on in my off-time. Proofile [29] aims to fill a gap, the lack of inter-user connections, in social graphs. It is based on the open standards FOAF, XFN and hCard and is to date one of the only sites on the web with full FOAF and hCard support. Proofile connects the online identities of a user together, so that they can be associated with each other, for example by using the Social Graph API or other semantic web search engines. Most of the data portability standards have been integrated already, building a solid foundation for applications that use these connections.

1.4 Conclusions

In the past six months so much has happened in the area of data portability, social networking and social graphs. Big corporations have been starting opening up their social graphs, pushing on getting their content onto external web sites, expanding their frontiers. I expect smaller social networks to follow the data portability 1.4 Conclusions 17

movement by adding FOAF support, building an antagonist to the big players. The vision of a global social graph is still far away, but sure to come.

1.4.1 Identity

Many organizations have tried to establish online identification methods, but none of them were really ever used on a large scale, except maybe Microsoft Passport [30], which is limited to Microsoft’s own services and a handful of other supporting services. OpenID is on its best way to becoming a global identification with over 11’000 OpenID enabled sites.

1.4.2 Authenticity

Online anybody can claim anything. To verify statements and detect bogus entries is hard, making an open social network very prone to malicious activities such as spam. E-mail is the best example for an open system that is vulnerable in this very respect. I cannot await all the spim 1 pouring into XMPP [31] as soon as it becomes popular enough, luckily the spammers always lag a few years behind new technology.

1.4.3 Privacy

By enabling all these new services, many users unwittingly trade access for privacy. They join service after service making their private information publicly available for everybody to see. The internet is a one-way host of information, there is no guarantee that you will ever be able to remove information, once it has been made available.

1 Spam on instant messengers 18 1 Social Network Portability

1.4.4 Finding Friends

An area not covered in this thesis is finding friends. Initially I had planned to implement an open-source friend finding service but this idea has been dropped due to time constraints and other more important goals. The most obvious algorithm is to compare a set of friends, identified by e-mail address or OpenID, with the available entries on the system that the friends are searched for. The set of identifiers can be imported from other social networks or contact , returning a list of matching users. Another algorithm counts the number of mutual acquaintances you have with somebody else, the more mutual acquaintances, the more likely you might know each other. These features have become to be expected and are supported by most major social networks. CHAPTER 2

Origo

2.1 Introduction

Origo [1, 32] is a distributed development platform. It is open, modular, extensible and integrates various tools. It consists of a back-end and a front-end which are separated. The back-end is responsible for controlling and managing data, the front- end is responsible for displaying data and information retrieved from the back-end. The interaction between front-end and back-end is done using API [33] calls which are implemented using XMLRP Calls [34].

An important feature of Origo is the notion of a work item. Work items can be used by any Origo user to keep track of progress in own projects as well as in bookmarked projects.

2.1.1 Back-end

The back-end of Origo is a middleware architecture and control infrastructure for the Origo platform. The back-end contains programmed use cases that direct the interaction with the different services. The main target was a very good scalability and extensibility. For more detailed information about the back-end cf. [35]. 20 2 Origo

2.1.1.1 Nodes

The back-end consists of nodes of different types, each having a different role and function. Nodes work in a P2P environment which makes it easy to add and remove nodes dynamically to and from the network. The inter-node communication uses a reliable message passing layer implemented as a JXTA service [36] using VamPeer [37].

For each node type, it is possible to have multiple actual nodes running. This allows good scalability and redundancy. Nodes can be addressed directly using their name.

Core node: Message bus and controller of the back-end. It controls the other nodes according to user defined use cases by sending control instructions. API node: Provides an XMLRPC interface for Origo using Goanna [38]. For each incoming XMLRPC, a specific message is constructed and sent to the Core node. An API node can be started in a normal mode and in an internal mode. The internal mode provides special services that should only be available for the front-end. Config node: Used to execute configuration scripts and generate configuration and access control files directly on the server. This is used e.g. for the FTP and Subver- sion [39] access rights management. Storage node: Responsible for managing all Origo related data. Internally a MySQL [40] is used to store data persistently. Mail node: Allows sending of mails to Origo users. It can use a local or remote mail server to deliver mails. Build node: Starts a compilation of a given project. Not used at the moment. 2.2 Social Networking Integration 21

2.1.2 Front-end

The front-end of Origo is based on the [41] content management system. It provides all necessary functionality to manage a project in Origo using the back-end. Drupal has a flexible extension system to include own themes and modules. Own modules use hooks to interact with the Drupal core and internal processes. Each Origo project is a separate Drupal site which theoretically could have its own special modules and themes. Also each Origo project has an own Drupal database which allows to scale.

An important feature is a single sign-on mechanism which allows a user to browse every project without having to log in every time. This is necessary because each project has its own Drupal site and therefore its own database and user table. The single sign-on mechanism uses two cookies to store a Drupal session in one cookie and an Origo user name together with the encrypted password in the other cookie. For more details cf. Section 3.1 in [42].

The main page is Origo Home where work items of owned and bookmarked projects are displayed. This page is independent of the currently browsed project and retrieves its data directly from the back-end using API calls. For more detailed information about the front-end cf. [42].

2.2 Social Networking Integration

In this section the technologies from chapter1 are assumed to be known, basic concepts will not be repeated here. The main goal of this master thesis was initially set to be data portability of social networks, starting off with “Thoughts on the Social Graph” [43] by Brad Fitzpatrick1 and David Recordon2. Their main concerns are the portability of social networks and the non-existence of decentralized social graph solutions. This was in August 2007 – a lot has happened since then. In the end I settled with some standards that are supported by the Data Portability [2]

1 http://bradfitz.com 2 http://davidrecordon.com 22 2 Origo project, which was initiated in November 2007. I concluded that these standards form a good set of technologies needed to realize the vision of a decentralized social network.

Origo has very basic concepts of friendship and communities [44], which are limited to simply displaying a list of friends on the profile page. This was a very nice entry point for my work on integrating some data portability standards.

2.2.1 Profile

A first step in making Origo’s data more accessible was to enable profile markup, by annotating the data with hCard microformatting. This is not as easy as it might seem. At the time, Origo’s user profiles were only viewable by logged in users, which made hCard annotations useless, because web services would not even be able to parse them.

The Origo API did not support any anonymous requests, which means that only logged in users could view data on the Origo front-end originating from the back-end. This problem was solved by adding a new pseudo-session for anonymous users and associating it with a access role. From now on API calls could be issued without requiring a valid session key. If a blank session is provided, the API node checks the access rights for anonymous users for the particular call.

The profile pages were cleaned up such that all the values could be changed by new “[edit]” links. Hooks for external modules to hook into the profile were added1. Furthermore, the profile pages have also been annotated with XHTML Friends Network (XFN, cf. 1.2.1.3) and a vCard (cf. 1.2.1.3) profile download is available.

2.2.1.1 Profile Import

As a part of my thesis, I implemented a profile importer [45] to be used on Origo, written in PHP and distributed on Origo as open-source software. Users are

1 This is currently being done by the FOAF module 2.2 Social Networking Integration 23 enabled to import their personal data into their profiles by launching the JavaScript application. An URL, containing a hCard enabled profile, may then be entered into the appropriate field. This URL will be parsed from the back-end using hkit [46] and then presented to the user, who may then click on the fields he likes to have imported.

Figure 2.1: The profile importer on an Origo profile page

The PHP Profile Importer is based on PHP 5 [47], hkit [46], JQuery [48] and SimpleModal [49].

The retrieving of information is divided into 5 steps:

1. An asynchronous JSON-encoded [50] JavaScript message containing the re- quested URL is sent to the back-end. 2. The back-end sends the URL through W3’s Tidy Proxy [51] and receives a 24 2 Origo

cleaned up HTML file. 3. This HTML file is parsed using PHP 5’s new HTML parsing1 functionality. 4. hkit then gathers all the hCard data and sends back a JSON object to the browser. 5. The profile importer can now display the hCard fields.

When generating the profile importer’s source code, all the field IDs have to be associated with their specific hCard types. For example if we have a text-field with the ID “firstnamefield”, this ID has to be associated with the appropriate hCard notation, “given-name”. This associative array is passed as a parameter to the JavaScript-generating function. Note, that these generated files can certainly be cached, so that they don’t have to be generated every single time somebody visits a page containing a profile importer.

Integrating the Profile Importer with Origo was a rather straight forward task, except for some JQuery incompatibilities with the version2 of JQuery that is used by Drupal. The profile importer can be enabled or disabled by simply enabling or disabling the profile_import module respectively.

2.2.2 FOAF3

Having a modular content management system with an abundance of well-written, user-contributed modules reduces the workload for website administrators. Unfor- tunately however, Origo Home [42] is not compatible with Drupal’s user profile modules. This is an area of possible future work for Origo, where interoperability could be improved by using front-end caching, while porting the code base to Drupal 6.

The Origo user profile fields and friend connections are retrieved from the back-end through the Origo API, which are then presented to the client in a FOAF RDF file. The following improvements have been made in the implementation.

1 The parser is not very lenient, which is why the tidy proxy is required 2 JQuery 1.0.5 is used in Drupal 5 3 , cf. 1.2.1.1 2.2 Social Networking Integration 25

• Optional OpenID visibility: The user may choose to have his OpenID exported. See section 1.2.1.2 for reasons to do so. • Optional hashed e-mail (mbox_sha1sum) export. (cf. 1.2.1.1) • Optional references to other FOAF files of the same user.

2.2.3 OpenID

Drupal’s OpenID1 implementation proved to be a good starting point for getting OpenID support on Origo, only cosmetic changes2 on the actual module were done. Integrating it with Origo required an additional module called origo_openid.

OpenID is designed for use on the web and hence cannot be used to authenticate users via other interfaces (Origo API, SVN, FTP etc.). It is possible to simply redirect the credentials to the OpenID provider and authenticate the user with the provider, however this is discouraged for obvious reasons. An untrustworthy relying party could simply save those redirected OpenID credentials and log into any service that the particular user has authenticated his OpenIDs with.

2.2.4 SIOC

Integrating SIOC3 with Origo was very easy and quick, thanks to the SIOC module for Drupal. Blog posts and forum entries are now automatically made available in SIOC, which effectively establishes a semantic connection from a post to a user.

1 cf. 1.2.1.2 2 These JavaScript changes are present in the Drupal 6 OpenID module release, but not in the Drupal 5 release that Origo is using 3 Semantically-Interlinked Online Communities, cf. 1.2.1.5 26 2 Origo

2.2.5 DOAP

There are no open-source implementations for DOAP, so the generation had to be done manually. The very brief RDF schema allowed an implementation in a short time. There exists no real documentation other than the schema, so some details were not exactly clear. For example, the developers intended the tag to contain a reference URL to an RDF schema or official web page for that license, which is too limiting in my opinion. Instead of adding the URL as an attribute, I added the license name as a string, still validating with the schema.

2.2.6 Future Work

There are many social networking and data portability features that could be added to the Origo platform, however, this is not the primary objective for a software repository like Origo. There are more important issues to be solved and enhancements to be added, since social networking portability is hardly a reason to choose one repository over another.

Adding OAuth (cf. 1.2.1.4) API Authentication would certainly be a great extension to Origo, however, the current access control system should be reconsidered and refactored while doing so. Basically, the idea of having keys for logging in, granting access rights to a subset of API calls, is what Origo suggests for external API authentication. Login via a simple key (not using public key authentication) is already implemented, but at the moment access is granted to the complete set of API calls, including retrieving the user’s password and e-mail address, which renders the whole concept utterly useless, even worse, it gives API clients a false sense of security. The implementation should allow users to generate a set of different keys with their own specific access control lists (ACL), granting the key only a subset of the users’ actual access rights. For example, an administrator of a project may generate a key for inviting new users to the project, but not for viewing his private data. 2.3 Origo Enhancements 27

2.3 Origo Enhancements

During my master thesis I added some functionality to Origo which was not directly related to my primary objective. The following list is comprehensive, yet incomplete. Please refer to my personal Origo wiki page1 for a more complete list of my work.

2.3.1 Tooltips

While writing the FOAF and OpenID modules there was a need for a way to explain what certain settings do. Adding large texts to the web forms would have bloated up the pages and the important parts – the actual settings – would have been lost in huge patches of text. JavaScript tooltips seemed to be an optimal solution, providing the required information in a short and concise form, leading the user to a wiki page with extended information by the click of a mouse button. Users who do not have JavaScript enabled will simply see the well-known question mark icon and be able to visit the help page without experiencing any inconvenience due to the lack of tooltips.

1 http://www.origo.ethz.ch/wiki/uelis_workitems 28 2 Origo

Figure 2.2: Tooltips using JTip

My personal preference for non-moving tooltips and the restriction of JQuery as our JavaScript library lead me to JTip [52]. JTip requests the content through asynchronous JavaScript message to text resources, which are then displayed in the tooltip. To meet our requirements, I added some functionality to JTip while still remaining 100% backwards compatible with JTip. The improvements make it possible to bind tooltip content (the actual tooltip text) to any HTML tag, not just tags. No additional JavaScript is required to set up a tooltip and no page requests need to be done anymore, because the tooltip content is already in the HTML source, though not visible to the viewer.

Listing 2.1: HTML code for a JTip tooltip

1

2 move the mouse over here! 2.3 Origo Enhancements 29

3

4

5

6 The Tooltip text goes here .

7

8

Note the IDs of the elements – JTip automatically binds the element with the class jTip_element_XXToolTipID1 to the tooltip element with the ID XXToolTipID1.

2.3.2 Password Recovery and Account Activation

These were more important open tasks for general Origo development. The password recovery has been improved and e-mail verification is now required for new accounts to be activated.

Before, anybody could enter a username into the password recovery field, resulting in the reset of the particular user’s password, an unsatisfactory situation.

A use case of the new password recovery system:

1. The user enters his username or e-mail address into the password request field and presses the submit button. 2. An e-mail will be sent to that user, containing an authentication URL. This URL is valid for 7 days and may be requested any time. No data about this link has to be saved in the back-end, because the link itself contains enough information to verify it’s authenticity. 3. Whoever enters a valid password request link has proven access to the users associated e-mail address and therefore is able to set a new password. 4. After entering a new password, the user is then automatically logged in to Origo.

When creating a new account, an e-mail address has to be entered, which is now actually verified before the account can be used. Authentication links for this e-mail 30 2 Origo verification is very similar to the links used in the password recovery mechanism. New accounts are purged if the e-mail address is not verified within a week.

2.3.3 Project Information

Project information and specification are very important, informing the visitor or user about the software product. Origo has a general wiki page, where the maintainers of the projects can add general information, which is rarely done.

Having a consistent presentation of general information such as licensing, descrip- tion, programming languages or supported operating systems is a great help for the maintainers to take the time to enter the data. The meaning and semantics of the provided data is known because of the predefined set of fields, which can be preserved by exporting the information in DOAP (cf. 1.2.1.6, 2.2.5), an RDF [4] vocabulary. 2.3 Origo Enhancements 31

Figure 2.3: Project information settings page, JavaScript disabled

One of the decisions that had to be made was how to let the users select the fields, such as licensing, programming languages used in the project, and operating systems the software runs on. Figure 2.3 shows what the profile page looks like for users that do not have JavaScript enabled. Consider that a very large portion of Origo users don’t actually deactivate JavaScript in their browsers, so it’s not a big issue that these users are not presented with lists of possible selections. All others are rewarded with very extensive lists of licenses, programming languages and operating systems. 32 2 Origo

Figure 2.4: JavaScript enhanced project information settings

Predefined lists help keeping the entries consistent, so they can be used or converted later, this prevents or at least reduces dirty data right from the beginning. The selection drop-downs can be converted into text fields for values that are not contained in the predefined drop-downs – relieving the Origo team of having to maintain them all too often.

Figure 2.5: JavaScript enhanced project information settings

2.3.4 Origo Maintenance and Minor Enhancements

The following incomplete list contains minor enhancements and maintenance work done for Origo. It is ordered from oldest to newest.

Added icons to the user profiles: There are three new icons for the user profiles, one for vCard, one for indicating the use of microformats and one linking to the users FOAF file. Remove hard-coded references to origo.ethz.ch: All hard-coded URL references to origo.ethz.ch were removed and are now extracted from a configuration file. Our long-term goal is to have Origo 2.3 Origo Enhancements 33

available as a portable open-source software-repository solution, so that it can be set up easily on other than ETH’s own servers. This is a must have for hosting confidential closed source products. Most commercial closed source software producers cannot let external people view the source code. Unambiguous profile URLs: All profile URLs have been set to lower case. Security issue with encryption key in Origo’s source code: Origo’s session handling mechanism uses a key for encrypting the cookies. This key was saved in Origo’s source code, which is freely available for everybody to view. The implications of this issue are that anybody can read the cleartext password from the cookie. The issue was solved by reading the secret key from a configuration file which is changed before being deployed onto the live Origo servers. User icon generation: User icons are now resized to widths of both 100 and 200 pixels, using an ImageMagick [53] script. Platform auto-selector: When releasing many files, selecting the platforms for each and every file can be very cumbersome. This JavaScript helps automating this task by matching the file names to possible platform selections. Storage node debugging: Eiffel’s database interface EiffelStore has shown to be erratic at times. Breaking a foreign key constraint caused the storage node to omit some results, leading to many contract violations and test suite failure for several days. The errors and warnings in the log files were either inexistent or inconclusive. Only direct debugging of the storage node revealed the cause. Refactoring Origo Home: I started to refactor the origo_home module by loading the included files only when they are actually required. This should be done for the rest of Origo Home as well. Cross-Site Scripting: I detected and fixed a Cross-Site Scripting vulnerability in Origo Home, i.e. an arbitrary script could be inserted into a profile page. In the long run we should prevent such incidents directly from the API node. This security hole 34 2 Origo

could be misused to steal users’ passwords, because the cookie can be read. The cookie then allows to retrieve the password and e-mail address by using the Origo session key and calling the XMLRPC API directly from the script. Related to this subject is the earlier stated password recovery issue in section 2.2.6.

2.4 Origo Issues

While working on Origo I came across some issues that should be addressed in the near future. I present some of the problems along with possible solutions.

2.4.1 Latency

Some Origo users mentioned that the web interface seemed to be a bit slow or laggard. Origo was designed to be scalable and fast, yet there is room for improvement on the implementation in terms of latency.

Figure 2.6: Origo web interface sequence diagram 2.4 Origo Issues 35

As can be seen, an API call requires at least 9 messages to be sent in sequence from one node to another. For five API calls on one single page, not uncommon at all, there are at least 47 messages being sent around sequentially. This appears to be a lot, but still, the setup manages to deliver the pages at acceptable, yet improvable delays. After all this, the web server (“Front End” in figure 2.6) can start generating the HTML file and then send it to the browser. At this point, the browser has only received the HTML file and then starts fetching all the other files – images, CSS files and JavaScript files – required to fully display the actual page.

There are many ways to improve latency, some of them are listed here. These optimizations may be combined for even better results.

File aggregation : External files should be aggregated into one single file. Origo Home currently requests up to 9 CSS files and 5 JavaScript files. This should be reduced to 1, either by enabling Drupal’s aggregation functionality, or by merging the files manually. API call aggregation: The basic idea is to aggregate multiple API calls together into one single call, effectively reducing the number of required API calls per page to one. The call is then disassembled at the target node, which for most calls is the storage node. In case of multiple target nodes, the core node must separate the calls and dispatch the newly-formed aggregations to the target nodes. If one call depends on the result of another call, they should be sent sequentially. Aggregation is certainly also possible for depending calls, but this would immensely complicate the code. Asynchronous API calls: The API calls can be sent asynchronously to the back-end. This improves the latency significantly, even though the total number of messages is not reduced. Having a scalable P2P system, the total number of messages is not very relevant. What holds for API call aggregation also holds for asynchronous calls. API calls that depend on the answers of other previous calls have to wait for their completion. Authentication delay: If authentication is done on the same node as the procedure call itself, authen- 36 2 Origo

tication can be delayed until right before the procedure call, to avoid the need for two extra messages. Client side API calls: Some calls can be initiated from the browser, by using asynchronous JavaScript to call the XMLRPC interface. I do not generally recommend this option, but it is viable for unimportant information, such as a user’s bookmarks on his profile page. Faster messaging: Speed tests should be conducted to see where the most time is spent. Surely, there are solutions for node to node messaging, queuing and serialization more suitable than how it is done in the Goanna Framework [38]. Prefetching: Instead of waiting for results of sequential and depending calls, the subsequent call might be requested even though it is not certain if the data will ever be used at all. Caching: Since most of Origo’s data is stale, it could be cached on the web front-end and invalidated from the back-end as soon as it has been updated. This however defeats the whole idea of having a P2P back-end, so it is not desirable for Origo.

2.4.2 Application Programming Interface Development

Adding new API calls to Origo is a very tedious and repetitive task. With all the copying and pasting, mistakes are very likely to happen, and in such a case it can take several hours to add a simple API call. An IDL1 compiler could be used to automatically generate all the necessary Eiffel code.

Adding general retrieve and change functions (see the API [33] calls user.retrieve_information and user.change_information) is definitely a step in the right direction.

1 Interface Definition Language 2.5 Concluding Remarks 37

2.4.3 Scalability and Redundancy

The current setup and implementation does not allow for scaling of storage nodes yet, though the architecture of Origo makes adding this a fairly easy task, should the storage node become the bottleneck. Data that is hardly ever updated or doesn’t require real-time updates can be replicated on to different storage nodes, only the updates need to be synchronized. Project specific data can be horizontally partitioned, for example by using a hash function on the project id, implicitly leading to load balancing of the storage nodes.

A large number of not-so-stable nodes is better than a large number of highly stable nodes, as the success of Google’s architecture has proven. Hardware failures happen all the time in large networks, hence, establishing an automated fallback solution is a good move.

2.5 Concluding Remarks

In the practical part of this thesis, Origo’s data was made accessible and machine readable in the major semantic web and internet technologies. Substantial contribu- tions were made to both the back-end and the front-end of Origo, which required in-depth knowledge of the entire system.

The original plan was to implement not only a profile import mechanism, but also an automated friend finding mechanism. The latter was dropped due to time constraints and because the project information and specifications were considered to be of higher priority for the Origo platform.

The initial orientation and familiarization with the Origo system and code, and the installation, demand quite some effort and time. Developers new to Origo should not let themselves get frustrated by this. They are encouraged to read all relevant wiki pages and documentation before starting, and, most importantly, to consult with other current or former developers in case of questions. 38 2 Origo Bibliography

[1] Origo.

[2] Data Portability.

[3] Friend of a Friend (FOAF).

[4] Resource Description Framework (RDF).

[5] Extensible Markup Language (XML).

[6] Advogato.

[7] Raph Levien: Attack Resistant Trust Metrics, 2004

[8] OpenID.

[9] OpenID Provider Authentication Policy Extension (PAPE). micr 40 Bibliography

[10] Microformats.

[11] Xhtml Friends Network (XFN).

[12] vCard.

[13] hCard.

[14] OAuth.

[15] Semantically-Interlinked Online Communities (SIOC).

[16] SPARQL Query Language for RDF.

[17] Description of a Project (DOAP).

[18] HowTo RSS Feed State.

[19] MySpace Developer Team. Data Availability has Arrived!, 2008-06-26

[20] MySpace Developer Terms of Service, as of 2008-06-26

[21] Facebook Developer Terms of Service, as of 2008-06-26 Bibliography 41

[22] Facebook, Announcing Facebook Connect, 2008-05-09

[23] Google, A friend connected web, 2008-05-12

[24] Google, Social Graph API.

[25] SocialGraph Node Mapper.

[26] Gravatar.

[27] Friendfeed.

[28] Plaxo.

[29] Proofile.org

[30] Windows Live ID, formerly known as Microsoft Passport.

[31] Extensible Messaging and Presence Protocol (XMPP).

[32] Till G. Bay: Hosting distributed software projects: concepts, framework, and the Origo experience, 2008

[33] Origo Application Programming Interface.

[34] XMLRPC specification, October 1999.

[35] Patrick Ruckstuhl: Origo Core: Middleware and Controller for Origo, July 2007 42 Bibliography

[36] JXTA Project.

[37] Beat Strasser: Vampeer, JXTA Implementation for Eiffel, March 2007

[38] Goanna, The Eiffel Web Application Framework.

[39] Subversion.

[40] MYSQL.

[41] Drupal.

[42] Peter Wyss: Origo Home: Web Interface Design and Work for Origo, July 2007

[43] Brad Fitzpatrick: Thoughts on the Social Graph.

[44] Marco Ziezling: Origo Communities and General Work on the Platform, January 2008

[45] PHP Profile Importer.

[46] hKit.

[47] PHP.

[48] JQuery.

[49] SimpleModal, a JQuery Plugin.

[50] JavaScript Object Notation (JSON) Bibliography 43

[51] W3 Tidy Proxy Service.

[52] JTip, a JQuery Tooltip.

[53] ImageMagick. 44 Bibliography List of Figures

1.1 OpenID sequence diagram...... 6

2.1 The profile importer on an Origo profile page...... 23 2.2 Tooltips using JTip...... 28 2.3 Project information settings page, JavaScript disabled...... 31 2.4 JavaScript enhanced project information settings...... 32 2.5 JavaScript enhanced project information settings...... 32 2.6 Origo web interface sequence diagram...... 34 46 List of Figures List of Listings

1.1 A sample FOAF file...... 3

1.2 OpenID Provider delegation...... 7

2.1 HTML code for a JTip tooltip...... 28