Final Report

Bachelor project: Simlike platform

July 11, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Commission: Nathan Navarro Peter van Nieuwenhuizen Willem Paul Brinkman

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

2

Preface

This is the final report of the bachelor project of Computer Science at the Delft University of Technology. During the bachelor project a product called Simlike was made for Nerval Limited. It is a concept for a new social media platform. More information about the product can be found in the introduction and Attachment K.

The report is written for people who are interested in the development of a new social media platform for the internet and internet applications. Future developers of Simlike can find most information in Section Recommendations and Attachments J and L.

The bachelor project team consists of the following persons:

Joris Albeda, Human-Machine-Interaction. Jeroen Dijkhuizen, planning and external communication. Joey Ezechiëls, code quality manager. Volker Lanting, work flow and priorities manager.

The following people are also involved in realising Simlike:

Nathan Navarro, primary contact of the company. Paris Hidden, graphical design. Erwin Schuur, marketing and sales.

The project team would like to thank the following people:

Drs. Peter van Nieuwenhuizen of the Delft University of Technology and Nathan Navarro of Nerval Limited for their advice, guidance and supervision of this bachelor project, Paris Hidden for the graphical design and mock-ups, Ir. Pascal Wiggers, also from the Delft University of Technology, for his advice on machine learning and clustering algorithms, and last but definitely not least, Hendrika Vreugdenhil for supplying us with a steady stream of candy throughout the project.

3

Table of contents

Summary ...... 6 Introduction ...... 7 About Nerval Limited ...... 7 Current situation ...... 7 Project description ...... 7 Contents of the report ...... 8 Orientation ...... 9 Requirements ...... 9 Plan of approach ...... 10 Preliminary study ...... 11 Research study ...... 11 Setting up the development environment ...... 13 Tools ...... 13 Process ...... 14 Design ...... 15 Functional design ...... 15 Technical Design ...... 22 Foundation level...... 22 Simlike application level ...... 24 Implementation ...... 28 Sprint 1 ...... 28 Sprint 2 ...... 28 Sprint 3 ...... 28 Sprint 4 ...... 28 Sprint 5 ...... 29 Sprint 6 ...... 29 Sprint 7 ...... 29 Sprint 8 ...... 30 Sprint 9 ...... 30 Sprint 10 ...... 30 Final sprint ...... 31 Quality assurance ...... 32 Security ...... 32 Law and licenses ...... 32 Availability/Reliability ...... 34 Expandability/Maintainability ...... 35 Test environment ...... 35

4

Evaluation ...... 37 Development environment ...... 37 Progress ...... 39 Collaboration ...... 41 Quality assurance ...... 41 Final result ...... 42 Recommendations ...... 43 Improvements ...... 43 Quality assurance ...... 43 Security ...... 43 New features ...... 43 Algorithm platform ...... 43 Finer grained authorisation ...... 43 Collaboration ...... 44 Development process ...... 44 Continuation of Simlike ...... 44 References ...... 45 Attachments ...... Attachment A - Time schedule ...... Attachment B - Plan of approach ...... Attachment - Requirements analysis ...... Attachment D - Functional Design Document ...... Attachment E - Technical Design Document...... Attachment F1 - Progress Report Sprint #1 ...... Attachment F2 - Progress Report Sprint #2 ...... Attachment F3 - Progress Report Sprint #3 ...... Attachment F4 - Progress Report Sprint #4 ...... Attachment F5 - Progress Report Sprint #5 ...... Attachment F6 - Progress Report Sprint #6 ...... Attachment F7 - Progress Report Sprint #7 ...... Attachment F8 - Progress Report Sprint #8 ...... Attachment F9 - Progress Report Sprint #9 ...... Attachment F10 - Progress Report Sprint #10 ...... Attachment G - Definitions ...... Attachment H - Security orientation & design guidelines ...... Attachment I - Preliminary study ...... Attachment J - Nerval Limited Code Style ...... Attachment K - Project assignment ...... Attachment L - Bachelor project Code analysis ...... Attachment M - Study report ......

5

Summary

This is the final report for the bachelor project of the study Computer Science at the Delft University of Technology. This report contains an overview of the project and reflects upon the process. It also discusses some recommendations for continuation of the created product. The contents of the report are summarized in this section.

The project assignment During the bachelor project a prototype of a social media platform called Simlike was developed for the company Nerval Limited. Nerval Limited is an internet start-up with no prior development experiences. Therefore a development environment had to be set up as well.

Simlike focuses on making new friends and the importance of real life relations. The main strength of Simlike is its matching function, which can match people with similar interests to provide useful suggestions for new relations. The goal for the bachelor project was to make a working version of Simlike as a Facebook application. This way, the user base of Facebook can be used to launch Simlike. Also a rudimentary mobile application should be created to take advantage of the location based features of smart phones.

The process Researching and setting up a new development environment which could be reused for future development projects at Nerval Limited took more time than expected. In the end a lot of tools were selected and installed, ranging from code analysis and test tools to planning and project management tools. Also, a code style has been made with corresponding configuration for the code style checking tools. The idea is that all these tools can be reused by new developers.

During the development process, time was a big issue. The product as we had envisioned turned out to be too big to be completed successfully within the given time frame. The graphical user interface of the mobile application was left unfinished to complete more features of the Facebook application.

In the end a professional development environment was delivered along with a functional Facebook application. The mobile application was left uncompleted, but is a good proof of concept. All the deliverables are delivered, two must have features have not been completed.

6

Introduction

This document describes the result of 11 weeks of effort by 4 persons (44 weeks full time equivalent), which resulted in a prototype for a product of the company Nerval Limited. The contents of this document are intended for developers who wish to continue the project, as well as for supervisors to get an indication of how far the project has progressed.

About Nerval Limited The client of the assignment is Nerval Limited, an internet company founded in March 2011 which focuses on innovative social media products. Nerval Limited holds intellectual property for future social applications for social media.

Current situation Nerval Limited does not have an internal development department, but wishes to develop their new product in-house. For this purpose, students of the TU Delft are invited to develop the new product as a bachelor project at Nerval Limited. Since there are no existing development guidelines, the students will have to setup all the development processes and key performance indicators which secure these processes.

Project description The assignment entails the construction of a new social platform called Simlike, which is coupled with Facebook. Facebook was chosen because of its huge user base, 700 million people around the world use Facebook. Simlike must be designed for high volume traffic, as well as high volume data processing. This platform will be available to end-users as a Facebook application and a mobile application.

Simlike offers a new view on online social interactions. Where other social media are mainly focusing on managing real-life relations online, Simlike will focus on the facilitation of new real- life relations. When the product is launched, people will be able to meet new people through Simlike, based on similar interests and Simlike will facilitate in moving these relations from online to real-life. An ambitious goal has been set to realise and launch the product in August 2011.

7

Contents of the report This report describes and evaluates the development process of Simlike and provides recommendations for future development. The remainder of the report is organised as follows.

In section Orientation the orientation phase of the project is discussed. In this phase research was done to find suitable tools, techniques and external libraries to use for the development of Simlike.

The setup of the development environment from scratch is discussed in Section Setting up the development environment. This development environment can be used for future work on Simlike as well.

The design and implementation of the product are discussed in Sections Design and Implementation respectively. These sections can be used by future developers of Simlike as an introduction to the current working of the system.

In Section Evaluation the development process is evaluated. The conclusions of this section are listed as recommendations for future development of the Simlike platform in Section Recommendations.

8

Orientation

The starting phase of the project is the Orientation phase. In this phase, preparations were made before starting the implementation. This was necessary for both the client and the project team to get an idea of what was needed and to determine an efficient approach. Firstly, the project assignment and the client‟s requirements were analysed. After that, a plan of approach was presented to document the expectations and agreements between the students and the client. A preliminary study was performed to gain insight in the scope of the domain. Another research study was performed to compare the existing solutions that were identified in the preliminary study.

Requirements The requirements specified by the client have been analysed and documented in the Requirements analysis report. This report is included as Attachment C. The client has many plans for features to be added to the system after its first release. Since the full list of features is much longer than the project allows for, the team and the client have ordered the requirements by importance. The team used the MoSCoW method for this. This method is intended for reaching a consensus between stakeholders. MoSCoW embodies the following categories in which requirements are classified: ● M: Must have this. ● S: Should have this, if at all possible. ● C: Could have this, if it does not affect anything else. ● W: Won‟t have this this time, but would like to have in the future. The most important features to be implemented, the „must haves‟, are as follows: ● Matching users: a matching algorithm that can match users, based on characteristics that are distinctive for a friendship. This is one of the Unique Selling Points (USPs) of Simlike. ● Matching one-on-one: an algorithm that, given two users, determines what they have in common. ● Search for users: this is another USP of Simlike. Not only is it possible to receive suggestions based on your own interests, it must also possible to receive suggestions based on interests entered in a search form. ● Chat functionality ○ photo sharing: let users share their Facebook photo albums in the chat. ○ interest sharing: let users share their interests in the chat. ○ places sharing: let users share the places they have visited in the chat. ● Private messaging: functionality to send private message between users. ● Top interests: a mechanism for users to display their favourite interests. This feature is used by the algorithms, top interests have more meaning than normal interests. This is also an USP of Simlike.

9

The following non-feature requirements were also added as must haves: ● Mobile App: a mobile app which acts as a portal to Simlike. ● Security: the platform must be secure in both terms of user experience and data handling. ● Activity monitoring: users must be able to determine whether a user is online. ● Usability: Simlike must be easy to use for all users, including inexperienced users. ● Responsive: Simlike must not take too long to respond for a real-time experience. ● Reliable: the platform uptime must be reliable. ● Scalable: the platform must be easily scalable. ● Chat - multiplicity: the chat feature must be designed to support at least 50 people in one chat. ● Expandable: adding new features must be relatively easy.

Plan of approach The plan of approach acts as a contract between the project team and Nerval Limited, it describes the course of action and the deliverables. The plan of approach is included as Attachment B. The following deliverables were agreed upon by both parties: 1. Preliminary study, a small study to familiarise the team with the project domain. 2. Plan of approach, describes the expectations and agreements between the team and the client. 3. Weekly progress reports. 4. Final report, a report which describes the entire project to bring other readers up to date with the project. 5. Technical Design Document, describing the technical design of Simlike in detail. 6. Presentation, given at the end of the project at the TU Delft. 7. Website, a small website online at www.simlike.com which redirects visitors to the Facebook application. 8. Facebook Application, a working prototype of the implementation. 9. Test environment, a working test environment for future developers.

The following phases were planned, the exact delivery dates for the deliverables may be found in Attachment A:

Study phase (18 April - 24 April) with deliverables: ● Preliminary study ● Plan of approach

Global development phase (25 April - 9 July) with deliverables: ● Weekly progress reports ● Final report ● Technical Design Document ● Website ● Facebook application

10

● Test environment

Final phase (11 July - 15 July) with deliverables: ● Presentation

Preliminary study A very important and demanding part of the Orientation phase was to lay a solid architectural foundation for Simlike. Since there have been no prior projects at Nerval Limited, the tools, environment and languages to use were not predefined. Therefore, a study was required to determine what was available. Areas in which candidates were identified are hosting platforms, database solutions, programming languages, software tools and development methodologies. The results of this study were used in a subsequent research study, of which the results are noted in the next paragraph. The preliminary study is included as Attachment I.

Research study The preliminary study identified some tools, external software and services that could be used for this project. To make a well informed decision, more research had to be done and the tools, services and software had to be compared to each other. This research and comparison was done in the research study. The research study was also done to be able to design a proper architecture for the Simlike platform. The complete study may be found in Attachment M. A brief overview of the results of the study will be given in this section.

The architecture is as follows. It is split up into two areas: 1. Simlike: this area consists of the services which are visible to the users of Simlike and to third party developers. These services are: a. Mobile application b. Facebook application c. Website d. Background services (e.g. email notifications) e. Application Programming Interface, a central component which is responsible for communication between the other components. 2. Foundation: this area contains the components which are required by Simlike, but are not directly accessible by the outside world. The architectural foundation is composed of four parts: a. Hardware platform - the hardware on which the other components are hosted. b. Web platform - this component is responsible for connecting everything to the internet so that it is accessible from all around the world. c. Database platform - Provides secure data storage and data access. d. Algorithm platform - Responsible for executing the matching between users.

11

Figure 1. Overview of the Simlike architecture.

A graphical representation of the architecture is represented in figure 1. The final choices for the software and services used to run the foundational platforms are as follows: ● Amazon Web Services Elastic Cloud Compute (AWS EC2) is used as a hardware platform [AWS]. The servers hosted at EC2 are virtual private servers (VPS). The main advantage is that servers can be added and replaced easily (literally with a click of a button). EC2 also has an attractive cost model, where resources are billed based on what resources are used. This makes it an attractive scalable solution from an economical and technical perspective.

● Apache Cassandra is used as the database platform [CAS]. It was chosen for its scaling abilities and real time performance. Another important aspect of Apache Cassandra is that it does not have a single point of failure.

● DECLARED CLASSIFIED BY NERVAL LIMITED

● A combination of AWS Elastic Beanstalk (AEB) and Amazon CloudFront is used as the web platform [AEB, ACF]. These services are specialised in high availability, high reliability and scalability.

12

Setting up the development environment

This section elaborates on the development environment that was set up and used during the bachelor project. The development environment had to be set up by the team, as there had not been any prior development at Nerval Limited. Such an environment is required by the developers. It allows for a structured and faster development process, which can greatly improve the quality of the product.

First, the tools that were chosen for this project are listed. These tools were not selected randomly. Some were chosen because of positive experiences of some of the team members. Others were found after research and comparison with other, similar tools. This research and its results can be found at the study report in Attachment M. Second, the development process itself is described.

Tools The development environment of the project team consists of a staggering number of 27 different programs, not counting the tools that were used, but have been discarded. These are the tools that are still in use at the end of the project: ● Maven: a very versatile build tool that is capable of performing a variety of supporting tasks. The functionality is provided by plugins. These tasks can for instance be generating Javadocs in HTML or starting a web server that can be used to test HTML and JavaScript code. ● Eclipse with a number of plugins, such as m2e (a maven plugin for Eclipse) and EclEmma, Checkstyle, PMD and the Metrics plugin (plugins for code analysis). ● The Mobl toolchain, which also includes an Eclipse plugin, and a command-line compiler. ● Mozilla Firefox with the Firebug add-on and Google Chrome, to develop and debug HTML/JavaScript/CSS code ● Sonar: A tool for static code analysis which stores its data in a MySQL database. ● The application server Apache Tomcat and the web servers Jetty and nginx to host the API and websites. ● ejabberd: A jabber chat server underlying the Simlike chat feature. ● Ubuntu as test server OS for the Simllike API; However, the production version of the API is to be hosted on Amazon’s Elastic Beanstalk, so this is a temporary arrangement. ● Apache Cassandra, the distributed database that was used. ● DECLARED CLASSIFIED BY NERVAL LIMITED ● Javadoc: a tool for generating API documentation for methods, constants, classes and packages on the source level. ● git and GitHub: Git is a powerful distributed revision control system, used to store incremental developments to the Simlike . GitHub is a powerful extention to

13

Git by storing a copy of the entire Simlike git repository on the GitHub servers, which all but guarantees the safe storage of the Simlike source code. ● UmlGraph, for automatic class diagram generation of the Java source code. ● JUnit, a software library for unit tests for Java code.

Process To develop the product, the team used the SCRUM method. The process has been divided into multiple time units called “sprints”, where a number of features was planned for each sprint.

The SCRUM method defines a number of roles: ● The “Scrum master”: This is the person who manages the sprint, and bears the main responsibility for it. Every sprint can have a different Scrum master. ● The “product owner”: This person represents the interests of the company, if applicable. ● The team: This is the development team.

The idea behind SCRUM is that every sprint results in an incremental update to the product, while retaining its “ready for deployment”-status. This is in contrast to the Waterfall model, in which a development build often is not ready for deployment, and is often broken. SCRUM provides flexibility through its sprints, as each sprint lasts typically a relatively short time, meaning that less features will be developed each sprint. In addition, since requirements can and often do change during development, it implies that the team can switch with relative ease to accommodate the new requirements. Indeed, this project team has found it a most useful property of the Scrum development method.

Which features are developed during a sprint is determined at the start of a sprint, during a meeting called a “sprint planning meeting”. These are chosen from a backlog of features, which in turn are derived from the requirements – functional or otherwise – determined by the team and product owner (Nerval Limited).

The sprint planning meeting is intended to be the longest meeting during a sprint. After it has taken place, every day the team has been at work on the project is started and ended with a meeting, in which the team members communicate to each other what they plan to do and the progress they have made that day respectively, as well as any difficulties encountered during development. These meetings are intentionally kept short and to the point by encouraging team members to stand.

While not strictly part of the Scrum methodology, the project team has decided to end each sprint with a sprint report in order to be able to learn from past mistakes. This adds some overhead in the form of time, but is well worth the additional effort.

14

Design

This section explains the design of the Simlike platform. The design describes how the platform is constructed. This is an important part of the realisation of Simlike. It promotes working efficiently between developers and it documents the product, which is required in order to maintain the product.

The design is split up into two parts: a Functional Design and a Technical Design. The functional design describes the design in terms of functional specifications. The technical design describes the design in terms of technical specifications. The technical design may be validated by the functional design.

Functional design In this section a brief overview of the functional design will be given. The functional design describes the functionality of the system and is based on the requirements specified by the client. It is an important part of the design process, because it is used to validate whether the developers and the client have the same application in mind. After the implementation, it can be used to validate the final product.

The Facebook app finally contains the following features: ● Auto matching - The system can match users, based on characteristics that are distinctive for a friendship. ● One-on-one matching - Given two users, the system can discover what they have in common. ● Search matching - The user can search for users based on interests. The results will be matched with the user and sorted by the number of common interests. ● Chat - Users can communicate using a chat. This chat has the following additional features: ○ Photo sharing - Users can share photos while having a conversation (photos where they have been tagged in Facebook). These photos can be scrolled through, so multiple photos can be shared during the chat. ○ Interest sharing - Users can share their interests visually (e.g. with a photo of the cover of their favourite book). ○ Places sharing - Users can share places (e.g. a Google streetview location of their favourite restaurant). ● Top interests - A user can specify five interests as his top five interests. This means they get higher priority when matching with other users and it can be used a simple statement to others about what you like. ● Messaging - Users may send each other private messages.

The mobile app contains the following features, which are also included in the Facebook app:

15

● Auto matching. ● Search matching. ● Messaging. ● Top Interests.

Additionally, the mobile app contains these features: ● Buddy List - The user can keep and edit a list of friends to access quickly. ● Location - The user can view his own location on a map.

The result of the functional designing phase is the Functional Design Document, which can be found at Attachment D. The functionality is described from the viewpoint of the user, in the form of use cases. An example is the use case for searching:

Use Case F2.1 Summary: The user searches for a co-user based on specific interests. Situation: The user can be in any screen. Step 1: The user clicks in the search bar. Step 2: The user types the interests and presses . Step 3: The system switches to the search screen and displays a list of the results. Step 4: The user can scroll through this list. He selects an outcome that interests him. Result: The user can view his choice’s profile, and click one of the buttons for some extra actions regarding his choice.

These use cases were set up together with the client, to ensure that the team and the client have the same global idea in mind of how the functionality will work. Use cases have been made for the Must Have requirements and for the Should Have requirements.

In each sprint, a list of use cases was included in the backlog to implement. Apart from a few, all of the use cases for the Must Have requirements have been implemented. Three use cases have not been implemented:

● The original design for the GUI of the Facebook app contained a separate Chat screen. Use cases F3.2 and F3.3 concerned “leaving the chat screen” and “returning to the chat screen”, respectively. In the end, there was not enough time to implement this. The Chat conversations are now draggable windows on the main screen. The separate window can still be implemented in the Facebook application after the project, however. ● Use case F4.1, “Reading a new message”, was also based on a design that did not fit into the scope of the project in the end. The main window was supposed to show a welcome screen which showed the user‟s matches, the last users the user had chatted with, and a list of new mails. In the product of this project, the user must open his inbox to view his mails.

16

To make the use cases more meaningful, Graphical User Interface (GUI) mock-ups have been created to illustrate the expected functionality. To give an impression of how the functional specifications lead to an implemented version, a series of GUI specifications is displayed:

In figure 2 is the first mock-up of the main screen. It offers an interface to adjusting the “Top likes” (the top section, use cases F5.1 and F5.2), a menu (just below the top likes) and a chat interface (use cases F3.1 to F3.4).

Once the layout and functionality was determined, the mock-ups were redesigned to make them more aesthetically pleasing. The result is visible in figure 3.

As can be seen in the mock-ups, the look of the Facebook application has changed significantly over the project. The functionality, however, has stayed the same. A screenshot from the final application can be found below in figure 4.

The same process was run through for the mobile application. A mock-up of the mobile application is displayed in figure 5. More use cases, mock-ups and more detailed explanations can be found in the Functional Design Document (Attachment D).

17

Figure 2. The first mock-up of the GUI main screen.

18

Figure 3. The revised Main Screen mock-up

19

Figure 4. A screenshot from the actual Facebook application at the end of the project.

20

Figure 5. A mock-up of the mobile application.

21

Technical Design This section describes the general technical design of the Simlike platform. The technical design is an architectural overview of the system, which specifies how the different parts of Simlike cooperate. It also describes how these parts work and what tools or techniques they use. The technical design gives a good overview of the entire system and is therefore one of the most important pieces of documentation for future developers of Simlike. The detailed Technical Design Document is included as Attachment E.

As happens with a lot of complex software projects, the Simlike platform as a whole has been split up into several components, where each component has a well-defined function and interface to communicate with other components. The modules are divided up into two parts: the foundation level and the Simlike application level (see figure 6). These parts will now be discussed in more detail.

Figure 6. Overview of the Simlike architecture.

Foundation level The foundation entails everything that provides core services to the rest of the Simlike platform as a whole, from more mundane tasks such as data persistence to services such as matching Simlike users. The foundation entails a module for the hardware, one for the database and one for the algorithms used within Simlike.

Hardware platform 22

The Hardware platform consists of the hardware required to host the Web and Database platforms. It has been opted to spend as little time as possible on maintenance on the hardware platform. To guarantee the quality of this platform, maintenance is outsourced to Amazon Web Services (AWS). Outsourcing the hardware to Amazon‟s cloud is a strategic choide to allow the developers to focus more on the features of Simlike itself, instead of setting up and maintaining servers in different countries.

All servers will be hosted on the Elastic Cloud Compute (EC2) service. This service provides Virtual Private Servers (VPS) which are extremely scalable and very maintenance friendly. Literally at the press of a button, extra VPSs can be added or replaced, making hardware maintenance a breeze and instantaneous. Extra services are offered by AWS to make the live of the developers easier: ● Load balancing. Automatically spread the load over multiple VPSs so that the service is equally responsive to all users. ● Auto scaling. Automatically add or remove VPSs based on the load of the platform. ● Backups. It is possible to create snapshots of VPSs. This allows to restore a VPS in case of failure with the click of a button. There is no need to reinstall and reconfigure all the services installed on that VPS. ● World-wide coverage. AWS has data centres spread out over all major continents, making it possible to provide a reliable service to users world-wide.

Other services are also offered by AWS. The services that are of interest to the design of the system are: ● Amazon Elastic MapReduce. Amazon offers out-of-the-box Hadoop MapReduce functionality. It is possible to run a MapReduce job on a cluster of servers, with control over how many servers run the job. There is no need to configure the infrastructure, as Amazon takes care of this. ● Amazon CloudFront. This is a content delivery service, which allows for low latency and high speed data transfer. This is used by the Web platform. ● AWS Elastic Beanstalk. A distribution system for web applications, it can handle capacity provisioning, load balancing, auto-scaling, and application health monitoring automatically. This is used by the Web Platform. ● Amazon Simple Storage Service and Amazon Elastic Block Store. Both scalable persistent storage platforms. Particularly of use to the Database and Web platforms.

Database platform The database is where the user information is stored. All Simlike applications will (indirectly) use the database, so it is important that the database is responsive and the data is secured.

Apache Cassandra will be used as data store for the database platform. It was chosen for its perfomance and completely distributed nature, it has no single point of failure. The considerations of using Cassandra instead of another database system are listed in the Study Report. This report is included as Attachment M.

23

Cassandra has no notion of tables; instead – in a somewhat simplified view – so-called Keyspaces are defined, which essentially act as groups for entities called Column Families. Column Families, in turn are used to group rows, which can be added and deleted at will. Each row can be – independently of all other rows – furnished with almost any number of columns, at the software architect‟s discretion. Finally, it is these columns that contain data.

This architecture enables Cassandra to scale excellently. Coupled with Amazon‟s EC2 and Elastic Beanstalk it means relatively little effort to the development team to ensure scalability.

Algorithm platform The main innovative function of Simlike is the way it looks for people that have common interests and facilitates interaction between users. This matching feature requires the use of several algorithms. These algorithms will only be addressed broadly. For a more detailed discussion please consult the Technical Design Document.

One-to-one-matching When given two users, one-to-one matching determines how well they match based on their respective interests. The algorithm is currently used to determine the Facebook likes the two users have in common. However, it can be extended to work on more data, like location based data from the mobile application.

Automatic matching When a user logs in, Simlike will automatically calculate which users match well with the given user. This process is called automatic matching and is an easy way for users to meet new people.

Search matching If a user wants to find people with specific interests, a search function is provided. This search matching algorithm takes a list of interests as input and returns an ordered set of users that have the given simlikes and match well with the user that did the searching.

Characteristics allocation For now Simlike only uses Facebook likes to identify the interests of a user. In Facebook, a “like” is a page that users have liked. This means there can be several likes representing the same actual interest (e.g. „Tennis and tennis‟). To find better matches it would be beneficial to be able to group all likes that represent the same (general) interest together. This functionality will be implemented by the Characteristics allocation algorithm. Given a set of likes, this algorithm clusters these characteristics into larger topics (e.g. clustering „movies‟ and „films‟ into the same topic).

Simlike application level The Simlike application level entails the modules that meet at least one of these criteria: ● It provides some directly usable functionality to either end-users or third party developers. 24

● It uses a module in the Simlike application level. From the user‟s point of view this is the most important level, as they will actually interact with these modules.

Currently there are five modules in this layer: 1. Application Programming Interface (API): this is the abstraction of all the components which are in the foundation level of the platform. 2. Facebook application. 3. Mobile application. 4. Website. 5. Background services: services that interact with the outside world in the background, such as sending of notification emails, updating user data from Facebook, etc.

Application Programming Interface (API) The Simlike API module offers several services which can be used by developers. It is designed to be as stable as possible to external clients (such as the Facebook application). It is an abstraction of the layers below it. This abstracted design allows to make changes to the underlying architecture, without impacting the clients. Both Simlike and third party developers wishing to make use of the functionality Simlike offers can use the API.

Furthermore, the API is used for communication between modules. Since there is only “vertical communication” (see figure 6) in the architecture, the modules may not communicate directly with each other. E.g. the mobile application may not talk directly to the Facebook application, it must do this via the API. This is done to keep maintenance as low as possible. If all modules would talk to each other, there would occur a coupling explosion, which would have negative consequences for code maintenance (e.g. a change to the Facebook application could break all other components, which is undesired behaviour). By restricting the modules to talk to the API only, only one interface will have to be maintained and components may work independently of each other.

The API is used to provide the Facebook and mobile applications with (most of) their functionality. However, since the API is developed as part of this project, it is not yet used by any third party developers. It provides the following functionality: ● Suggest other people to the user based on common interests. ● Search for other people that have the given interests. ● Calculate the interests two users have in common. ● Retrieve a user‟s profile information. ● Enable users to chat and share photos, simlikes and places they‟ve visited while they are chatting. ● Enable users to exchange messages. ● Retrieve and modify a user‟s top likes. ● Synchronise a user‟s profile information with Facebook. ● Automatically perform authentication for a user (once he is logged into Facebook) when using one of the above functionalities.

25

JavaScript SDK A JavaScript Software Development Kit (JSSDK) has been developed to expose the Simlike functionality of the API to web platforms. Currently it is used by the Facebook application and the Mobile application. This saved a lot of development time by reusing the same code for both applications. The JSSDK consists of the following components: ● Simlike, the core JSSDK component. It is used to talk to the API servers. ● Simlike.Facebook, which handles all the Simlike/Facebook interaction (e.g. logging into Simlike with a Facebook account). ● Simlike.Chat, which makes the chat and real-time sharing functionality available to applications.

Facebook application The application central to this project is the Facebook application, which allows the functionality that Simlike offers to be used by all Facebook users. Users with a Facebook account may login to the application, at that moment Simlike is displayed to the users. They can start using it for the first time immediately once they enter the application.

The application is written in HTML, JavaScript and CSS. It communicates with the JavaScript SDK to access the API functions and the Facebook API to retrieve data to display in the GUI. The JavaScript part of the application was originally a single JavaScript file that handled the managing of the GUI, the communication and the API, the connection to Facebook and the other functionality such as Chat and Mail. In the final weeks of the project, however, the functionality was divided in a more object oriented fashion, which also resulted in the JavaScript SDK which could be used by any JavaScript application. The JavaScript that still belongs to the Facebook application consists of the following components: ● Simlike.Gui.FbApp, which edits the HTML using jQuery templates (jQuery is a third party javascript library). It requests the data from the JSSDK. ● Simlike.Gui.FbApp.Chat, which calls the JSSDK‟s Chat function and displays it in conversation screens. ● Simlike.Gui.FbApp.Mail, which sends and displays mails. It communicates with the API through the JSSDK‟s Simlike.api() call.

Mobile application The mobile application is another manner in which the services provided by the Simlike API are exposed to users. The mobile application has a slightly different feature set. In comparison to the Facebook application, it does not support any chat functionalities. It does support basic location sensing, which the Facebook application does not have. The two applications are meant to be complementary.

To accelerate the development of the mobile application, the project team has decided to use a declarative programming language called Mobl, developed by Zef Hemel at Delft University of Technology [MOBL]. The Simlike API calls are executed asynchronously. Normally this brings extra complications to building a mobile application, but Mobl code is synchronous. This relieves

26 the developers of accounting for the asynchronous nature of the API. Mobl compiles to HTML, CSS and asynchronous javascript.

The mobile application has been designed as a collection of screens, where each screen provides specific functionality. The main screen (see figure 5), for example, serves as an entry point, and leads – among others – to a screen to view and edit top likes, as well as another to search for other users based on the user‟s likes.

Simlike website The website is meant as a simple web page which redirects users to the Facebook application once it is ready. For now, it contains a message to users who visit the website before the Facebook application has launched. The website is accessible at www.simlike.com. The planned launch date for the Facebook application is in August 2011.

Background services The background services component contains all service that communicate with external services. For example, Facebook‟s terms of use dictate that user data originating from Facebook should be kept as up-to-date as possible. To keep the user information synchronised between Facebook and Simlike, a supporting background service has been designed. It works by asynchronously pulling user data from Facebook and incorporating that into Simlike‟s infrastructure, and doing so in such a way that it won‟t use resources that are necessary to serve users that are online at such a moment.

In the future, email notification services may also be integrated into this component.

27

Implementation

This section describes the course of the implementation. The implementation occurred in the form of SCRUM sprints. Of each sprint a progress report was created. These progress reports are attached to this report as attachments F1 through F10, describing in detail sprint 1 through 10 respectively. A summary of these sprints given in this section, which gives a good overview of the implementation phase.

Sprint 1 The requirements have been gathered and analysed, they have been incorporated in the functional design. Several requirements have use cases, more will be added to the functional design document as more sprints are executed. Simple GUI mockups are created as part of the functional design. The team communication is going reasonably well, but there is room for improvement. This is to be expected however, as they are still getting used to each other.

Sprint 2 For this sprint, the team had planned to finish the first version of the Technical Design Document (Attachment E). The team members were not able to complete all of the planned work. Some details about the algorithm, databases and hardware have been deferred to following sprints.

On a general level, the team defined the product‟s architecture, decided an approach for the required algorithms, determined the structure and requirements of the graphical user interface (GUI) and have done hardware and database research. As stated above this research has not yet been completed.

Sprint 3 In this sprint, further steps were made in the design phase. There were still a lot of decisions to make, and motivating them took considerable time. These decisions include selecting a programming language, a database and a web framework. Work on the Technical Design Document (Attachment E) has been done. A design for the first algorithm has been made and some of the most important tools have been configured.

It was planned to make a first version of the class diagram, but that proved to be too ambitious.

Sprint 4 In this sprint the first prototype was created. A very basic and still quite non-functional Facebook application was constructed, as well as an interface to the database. The database itself is not

28 yet completed and neither is the coupling between the interface, the database and the Facebook application. This is the reason the Facebook application is not yet very functional.

The workload was grossly underestimated for this sprint and the team did not work efficiently enough. To counter this problem, one team member is designated as a “help desk” who can help out other team members on the fly. When he is not helping others, he can work on the reports and documentation.

Sprint 5 The GUI and API are working and have been integrated. The database is not ready due to setbacks. There are some tasks which depend on the database which could not be done. A test suite has been set up and test cases have been written. The test cases all succeeded.

Sprint 6 In this sprint test specifications and test cases were created for the API and database. More documentation has been created to describe the interactions of the different libraries, classes and modules.

The team members are now using the Maven tool to quickly compile the project. Dependencies do not have to be managed manually anymore. Previously the team had been using Apache Ant. It also led to a refined toolchain, using Mockito instead of Mockrunner and Jetty instead of Tomcat.

The database is now complete and can be used, but the communication between the database and the API still needs to be implemented. The communication between applications (like the Facebook application) and the API has been defined. Including the URL‟s to visit and the format of the JSON objects that are returned. Users can also be authenticated before they can use the API.

There were technical difficulties and too much hours were planned, which got the team further behind on schedule. To prevent this from happening next time, an explicit feature pool will be used, with time indicators per task. This allows closer monitoring of the progress as a team and as an individual according to the planning.

Sprint 7 In this sprint the documentation has been made up to date and the code has been tested and checked for style errors.

The communication between the API and the database has been implemented. The development of a small application, called Project ADEL has been started. Project ADEL gathers demo data with consent from real Facebook users. Users are asked if they want to

29

“donate” their Facebook data which will be used to fine tune the algorithm. However, not much data has been gathered from it yet. This test data will be removed afterwards. Project ADEL also is an integration test: it can communicate with Facebook, a user, the API and (through the API) the database.

A lot of tasks were finished that did not get completed in earlier sprints, but on the other hand, some tasks planned for this sprint did not finish. The most important task that did not finish, is the creation of a global class diagram.

Sprint 8 This sprint project ADEL and the possibility to select, change and retrieve top interests were finished. Also all mocks for the database are replaced by real implementations. Only matching, chat and interest-clustering remain to be implemented for the Facebook application.

From this sprint on the team members will start working on separate tasks in teams of two people. This way it is anticipated to be able to implement the last features faster and be in time for the deadline.

Sprint 9 DECLARED CLASSIFIED BY NERVAL LIMITED

Two algorithms were implemented: automatic matching and search matching. The latter was not planned, but finished as a by-product of automatic matching. One-to-one matching, which was planned, was not completed.

The chat system is up and running. The chat client GUI was not finished due to sickness of a team member.

DECLARED CLASSIFIED BY NERVAL LIMITED

A start was made on the development of the mobile application.

This week the team was divided into smaller groups to work on more features simultaneously, this proved to be very effective. Nonetheless, not all tasks were finished this week.

Sprint 10 With the approaching deadline, the team decided to spent more time on finishing features and less time on reports. The progress will have to be documented eventually. The entire last week is planned for this. In the meantime, the technical design will be kept up to date in the draft of

30 the Technical Design Document (TDD). This saves time because the team does not have to merge this week‟s progress report in the TDD.

DECLARED CLASSIFIED BY NERVAL LIMITED

A JavaScript SDK was added to integrate features faster on multiple platforms and to increase maintainability.

A lot of the must have features are implemented in the API and in the Javascript SDK. However they are not all yet fully integrated in the GUI.

The team members are working overtime to integrate all the must have features in the GUI.

Final sprint This sprint, the final report, the code analysis report (as an alternative to the SIG code review), the functional design document and the technical design document were finalised.

Chat functionality was added to the Facebook application and the JSSDK. It includes support for sending text, sharing photos, places and interests.

The following functionality was added to the Mobile application: ● A buddy list, which enables users to view a list of their buddies, mail with them, and view their profiles ● Location sensing, which enables users to see where they are at the moment ● Mail inbox, which gives an overview of all the mails users have sent and received. In addition, they can reply to received mails. ● Suggestions, which suggests potential buddies to users ● View simlikes, which enables users to see their simlikes and select their top simlikes

31

Quality assurance

This section discussed the way the project team aims to provide some assurance of the quality of the delivered product. It is important that the product is of a high enough quality to be used commercially. Therefore some measures were taken to monitor the quality. These measures are discussed in this section.

There are a lot of factors that determine the quality of Simlike. To ensure a reasonable level of quality, all of these aspects need to be taken into account or monitored during the process. The most important of these are: ● Security - Simlike will eventually possess a lot of personal information about users. This information will need to be stored securely. ● Law and licenses - When working with personal information, privacy laws have to be adhered to. Also, the use of the Facebook API and other external tools binds Simlike to several licenses. ● Availability/Reliability - The system will need to be online and available to a lot of users at the same time all over the world. ● Expandability/Maintainability - The client plans to keep innovating and therefore changing or adding to the product. Therefore Simlike will have to be easy to maintain and expand. These aspects and what was done to assure the quality of these aspects will now be discussed. Afterwards the test environment that was constructed for this project will be discussed.

Security Security is not absolute, it is a matter of risk management and it should be built into the application, not tacked onto it as an afterthought. This means security will need to be taken into account when designing the product‟s features. So in order to keep the product secure some research on security was required. While the results of this orienting research can be found in Attachment H, these are the most important measures that have been taken to promote security: ● Using an Access Control List in combination with the Least Privilege Principle ● Ensuring availability by designing enough redundancy/duplication into the system to make it more difficult for DoS attacks to succeed. ● Storing only salted credentials in the database. ● No “turtle shell” design. ● Using Apache Shiro to take care of most of the authentication and authorization code. ● No Security Through Obscurity.

Law and licenses

32

When working with personal information of users, it is important to keep this information safe. If not enough care is taken to secure the information, Nerval Limited may be held accountable for any resulting damages. Also, some countries have specific laws for collecting and processing personal information. First the Dutch laws for the protection of personal information will be discussed, as Simlike will be launched in the Netherlands. Then the legal issues concerning the use of open source software will be discussed.

Dutch privacy law Simlike will be launched in the Netherlands and will therefore have to comply with Dutch privacy laws. The Dutch privacy law is applicable to the collection, storage and processing of data which can be coupled to specific persons. This means that Simlike will have to abide by this law, unless the data is sufficiently anonymous. Neither Facebook nor Hyves (both popular social media active in The Netherlands) have registered themselves at the College Bescherming Persoonsgegevens (CBP). This means that the Dutch privacy law is not applicable to their information. As Simlike will not use more personal information than either of the two aforementioned companies, the team is confident the law will not be applicable to our information either.

Even so, it would be recommended to apply for a registration at the CPB. The team believes this best communicates the intentions and beliefs of Nerval Limited. A privacy statement will also have to be made by the company, describing the intentions regarding the use of their data and who will have access to it. Also, when users ask for an overview of their data, it will have to be given upon request. A small administration fee may be charged to the user for this service. These are the only steps needed to abide by the Dutch privacy law. If Nerval Limited does not wish to register itself at the CPB, they must take enough active security measures to the extent that they cannot be held responsible for potential attacks, this is required by Dutch law.

External software and licenses A lot of external libraries and software packages were used to be able to complete the project within the time given for this bachelor project. This external software is mostly open source and comes with a range of licenses:

● GNU General Public License v2: This is the archetypal copyleft license, which requires the source code of any such licensed product, if the product is delivered to users, to also be made available to those users upon request. Furthermore it also requires that any derivative work of any such licensed product is also licensed under the GPLv2 or higher if it is ever released to the public. ● GNU Lesser General Public License v2: This is equivalent to the GPLv2 except for one clause, which allows software licensed under the LGPLv2 to be used by software that is not licensed under this license. However, the rules for such use are quite specific: the non-(L)GPL software may not be a derivative work of the used code. ● Apache License v2: This is a permissive free (as in free speech) , allowing software licensed under it to be included in proprietary code while allowing the

33

proprietary code to remain proprietary. It does require, however, that every file licensed under the ALv2 retains any copyrighted material, including the license itself. ● MIT license: While there is no single “MIT license”, most versions are very similar to the Apache License v2. The main difference is that some incarnations of the MIT license contain a clause prohibiting advertising the names of the authors of software licensed under it.

However it is important to note that no software developed as part of the Simlike will be made available to users to run on their own terms; the software is built only to enable the delivery of multiple services to users. In this model many of the obligations imposed on Nerval Limited by the licenses are lost: According to the GPL, for instance, no binary is delivered, therefore the sources need not be delivered either.

Apart from the use of actual external software, Simlike also uses the services of Facebook‟s API. The use of these services also comes with some obligations.

Facebook has a policy regarding the data that is pulled from Facebook [FB]. In a nut shell: ● It is only permitted to request the data needed to operate Simlike. ● It is allowed to cache the data, though it should be kept up-to-date as much as possible. ● A privacy statement about the data is required. ● The data which is relevant to a particular user‟s friends, may only be used in the context of that friendship (this is to prevent leaking of personal data to other people). ● Facebook data (and derivatives) may not be transferred to third party advertising networks or data brokers. ● Facebook data may not be sold. If Nerval Limited is acquired by or merges with a third party, it is allowed to continue to use the data within the application the license was given to. ● If the license is terminated, Nerval Limited must delete all Facebook data. ● A user may request to delete all of his data, Nerval Limited is obligated to comply. ● Facebook data may not be included in any advertisements.

Availability/Reliability For a social media platform to be successful it requires a large user base and very little down time. To achieve this, some redundancy is needed in data storage and no single point of failure should exist. As discussed in Section Technical design this was achieved by using Cassandra as a database and by using Amazon Web Services. Also, tests were made to increase the reliability of the system.

The tests have identified bugs in the behaviour of the Simlike platform code and the project team is quite confident in the correctness of their implementation. At the end of the project end only half of the system had been covered by automated unit tests. The advantage of these tests is that they can easily be rerun to test after changes of the code. The other half of the system has been tested manually, so this entire half has to be manually tested again after each small

34 change. Therefore the creation of automated unit tests for the other half of the system has a very high priority.

Expandability/Maintainability When keeping code simple, easy to understand and well documented it is easier to maintain the product. Therefore the tools Checkstyle, PMD and Sonar were used to analyse the style of the code to assure the maintainability. Each Java class and method has Javadoc documentation. For the classes, a class diagram for the scope of the class is automatically generated. An example of such a diagram can be seen in figure 7.

Figure 7. The class diagram belonging to the Javadoc of the FacebookRealm class.

Test environment Although it is not feasible to prove that a complex collection of software is correct, testing its behaviour can provide some level of confidence in its correctness. For this purpose a test environment has been set up. The test environment includes jUnit, qUnit and Mockito.

Both jUnit and qUnit make it possible to create automated tests, the advantage of which is that relatively little overhead is added each time they are run; In contrast, manual testing adds quite a big overhead. The fact that these test cases can be rerun makes it easier to test whether or not any change to the system has been performed correctly. This increases the maintainability of the system.

Some classes depend heavily on others. For example, a servlet needs an implementation of the HttpServletRequest interface to function properly. Therefore the interface needs to be mocked in order to create unit tests for a servlet. The mocking process is taken care of by Mockito, which offers an easy and flexible way to mock any class.

35

The test suite may be run from any project using the `mvn test` command. This will automatically run all test cases and display the results on the command line. A graphical overview of the test results may be obtained through Sonar.

36

Evaluation

The goal of this section is to evaluate the work and development process of the bachelor project. This is mainly useful for the team members to reflect upon and learn from the mistakes that were made. It also provides an overview of the pitfalls that were encountered during the development, which can be used by future developers to avoid making the same mistakes.

Development environment This section evaluates the current development environment at Nerval Limited. The development environment was set up by the team members, since there was no prior in-house technical development at Nerval Limited.

The development environment has grown to a mature and professional status. A test environment is set up for the software. Developers can run tests and the software on their local computers. Once the software is stable, it is committed to the central product. The central product is hosted on a development server. This way the developers can immediately benefit from the new working functionality, which aids the collaboration between developers. Before any products are deployed to a production environment, a set of unit tests and integration tests are run. This is done to minimise the chance of software failure in the production environment.

Another key aspect of the development environment is developer mentality. All developers agree that code quality is a key aspect of development and that it should not be neglected. This is part of a healthy developers mentality. To get an impression of the code quality, consider the following documented code snippet in figure 8:

37

Figure 8. Code snippet of the source code produced by the project team.

Tools for keeping up the code quality are included in the standard developer‟s toolkit. The standard toolkit for a developer consists of professional tools like Eclipse, JUnit, Maven, Checkstyle, PMD and a -based . For a complete list of tools, please see Section “Setting up the development environment” of this report.

There were a number of tools which were used, but have been discarded in the course of the project. A summary of the discarded tools:

1. Various UML tools, such as Papyrus, ArgoUML, TaylorMDA and the Eclipse UML2 Tools. They have been discarded because none of them showed their stability and user- friendliness to be at a level where they can be considered production quality or even productivity enhancing. As an alternative, the UML diagrams are now automatically generated and inserted into the Javadocs by a Maven plugin for UmlGraph. 2. DECLARED CLASSIFIED BY NERVAL LIMITED

3. Apache Ant, a build script tool designed specifically with Java projects in mind. It has been discarded because Simlike consists of more than only Java code and Ant build scripts are quite maintenance intensive relative to Maven, which uses the coding by convention paradigm to achieve the same results (and in some cases, more) with considerably less configuration. 38

4. Jenkins, an automatic build tool can integrate with Maven and GitHub to create a Continuous Integration flow in which Jenkins is notified of every git push to GitHub, subsequently pulls from GitHub and build the result. It has been discarded because of a software flaw in GitHub where pushed code triggers no notification towards Jenkins. This has caused the added CI value to the team to evaporate. 5. Mantis, an Open Source bug tracker. While there is, to the knowledge of the project team, nothing wrong with Mantis itself on a technical level, the flaw here was more of a social one: it simply did not fit naturally with the work flow the team has adopted during the time spent on the BSc project.

Progress In this section the planning and scrum process are evaluated. During the development of Simlike the development process has changed quite a lot.

Planning In this section the planning of the project is evaluated. First the global planning is discussed. Second, the planning during the scrum sprints is discussed.

The first part of the global planning of the bachelor project consisted of selecting the requirements that could be met during the project and ranking them as must haves. The team underestimated the work that was involved with setting up and learning to work with so many external software packages. This led to a lack of time to complete everything and the testing process suffered because of it.

The planning of features in the scrum process turned out to be quite difficult. In the beginning the number of planned features per sprint was way too large. After a couple of sprints the team got better at using planning poker to estimate the workload of features. It turned out that planning 25 to 30 hours per person per week was a good estimation. The rest of the time is lost on communication, unexpected problems and helping other team members.

The team also learnt that big features should be split into smaller tasks or structured in some other way. If this is not done, people may get „lost‟ in the big task and it will take longer to complete. A little planning overhead is actually quite useful in these cases.

It is possible to decrease the communication overhead by working in smaller teams. Small decisions can be made without discussing it with the whole team, which saves a lot of time. Also, appointing only a small number of tasks to one team member so he can serve as a help desk for other team members, greatly speeds up the work of the other team members.

All the tasks were listed on a Gantt chart (see Attachment A). Also a printout of each week was made for the supervisors, these are attached to the Progress Reports (Attachments F1 trough F10). For the team members, also daily TODO lists were printed out, based on the tasks from

39 the Gantt Chart. This way the team members had an overview of the tasks to be done on a global, weekly and daily scale.

Orientation A lot of attention was paid to orientation and researching design decisions. For the web and database platform this paid off. Most of the tools, techniques and external libraries have turned out to be excellent choices. However, the algorithm platform research could have been performed in a more efficient manner. The algorithm was researched and developed before choosing a specific algorithm platform. It would have been better to have looked for existing algorithm platforms first, to see what was already possible in terms of algorithms. DECLARED CLASSIFIED BY NERVAL LIMITED. It was a setback that a first algorithm implementation was too slow. DECLARED CLASSIFIED BY NERVAL LIMITED.

The project team planned to use Mantis to list all features and tasks and keep track of all bugs, including everything that was discovered during implementation. In the end Mantis was not used. Instead, the team used a Gantt chart and „todo‟ comments were placed in the code, which can easily be tracked by using the grep command-line tool. This was mainly due to the fact that the use of Mantis (or any other bug tracker) did not fit into the adopted workflow during the project; another factor was that Mantis did not have Gantt chart support. Using Mantis would probably have improved the overview of all tasks, bugs and problems, which in turn could have helped in improving the planning on the Gantt Chart.

The use of a mocking framework to test some of the code was unforeseen. Instead of taking some time to look for frameworks and compare them, Mockrunner was chosen because it was the first one that was found. It turned out Mockrunner was rather limited in its applicability so in the end research still had to be done. Mockito is now used, as it offers great flexibility and control in mocking classes.

Implementation This section will reflect upon the implementation Simlike‟s features. This includes the use of the implementation tools and external software and the implementation process itself.

The most important development tool for this project was Eclipse. It caused the project team some problems with badly configured classpaths and build paths, but thanks to all the plugins that were provided more time was saved than lost by using Eclipse. The plugins that were used are the Maven plugin for automatic building of the project. This plugin solved the build path problems, only a Jetty related classpath problem remained. The style and code analysis plugins for Eclipse were also very useful.

The most difficult external software package that has been used for this project was Cassandra. It took a whole different mindset to get to know the workings of Cassandra. Especially since the schema used for storing data in Cassandra has an impact on the efficiency, in was important to get familiar with it first. Although the setup of Cassandra took a lot of time, it also saved a lot of

40 time by taking care of scalability, availability and reliability (data duplication, no single point of failure).

Collaboration In this section the collaboration between team members is evaluated. The lessons learned from this project can be applied to other projects.

The importance of the scrum concept of a list of features that need to be completed during that sprint was not overlooked. Using such a list improved the planning and gave each project team member a better overview of not only his tasks, but also those of others. This list helped prevent the pitfall of poor awareness of other member‟s work and eventually replaced the scrum meetings at the end of the day. The list does not have to be made as a formal document and can simply be written on a blackboard or piece of paper. This is part of the more informal process of scrum development.

As was mentioned before, not all of the daily meetings of the scrum process were held. The meetings at the end of the day could be replaced by the feature lists and during the project the meetings got more informal. This informal style of development worked quite well, as it was a good start of the day to talk and discuss problems with each other in a relaxed fashion.

The informal management style and the team building trips led to a good collaboration. The team members had a lot of fun working with each other.

Quality assurance Several tasks concerning quality assurance did not go according to plan. Due to pressure from deadlines, a lot of untested code was created. This is not good for reliability and for future projects it would be wise to plan less features to be able to create more test cases.

Because of trouble with non-disclosure agreements the code could not be checked by the Software Improvement Group (SIG). As an alternative, the SIG suggested the code should be analysed by the team using Sonar. This gave a lot of new insights in the code quality.

It turns out that the code analysis tool Sonar provides exactly the same information as SIG would have provided (and more). The SIG developed a plugin for Sonar which ranks the quality of the code on a scale from 1 through 5. Although there are enough possible improvements, the code produced by the project team is of high quality. It scored 3.5 out of 5 on the SIG maintainability ranking and it can be boosted to 4.5 by increasing test coverage. The estimation is that this would take three days of effort to reach this score.

Although there is no time left to improve the code now, quite extensive and detailed recommendations are given in Attachment L. The main areas of improvement reside in test

41 coverage and cutting dependencies. For the JavaScript code, the main improvements are test cases and code style. The results are also documented in Attachment L.

The Javascript code could not be analysed as well as the Java code, because Sonar could not deal with the Object Oriented approach spread over several files. However, apart from no unit tests it scored quite well according to the SIG plugin for Sonar. Also, the lack of any tools to test the MOBL code made the MOBL code difficult to test automatically.

Final result The final result of the project will be evaluated in terms of the MoSCoW model. The final result of the bachelor project contains almost all must have features. Only the mobile application and activity monitoring were not completely finished. The project team did finish the mobile application in terms of functionality, however the GUI was not finished. The activity monitoring feature was not implemented.

There were two main reasons why these two must have features were not finished. First, it was decided by the team to implement the API, which is a should have. This was done before implementing some of the must haves. This was a good choice. This was done because it saves enormous amounts of time and money in the long run. It greatly reduces the code complexity of the entire platform and increases the maintainability. For the client, this means huge savings on future development costs. For the future developers, this saves huge headaches. In fact, the API should have been a must have.

Second, the number of new tools to get familiar with was greatly underestimated. In total, 38 tools were used an learned by the team. Of these 38 tools, 31 tools were new to the team members. At the start of the project only 4 new tools were identified (see Attachment B). Thus, the research and use of so many tools and external libraries took more time than anticipated. Therefore, too many features have been classified as must haves, given the time left for implementation.

42

Recommendations

This sections provides several recommendations for the continuation of the project, which the project team wishes to make with regards to necessary improvements, potential new features, the collaboration process and the continuation of Simlike.

Improvements The following improvements will have to be made to the Simlike platform.

Quality assurance In the beginning of the project the team decided to use Lombok to reduce boilerplate Java code. An ugly side effect of Lombok is that quite a few tools are incompatible with it. PMD, Checkstyle and Sonar give false negatives because of Lombok annotations. Javadoc does not recognise the getters and setters, this requires running a tool called delombok, which compiles the Lombok code to regular Java code. Also the way of installing Lombok into Eclipse is not very maintainable (it installs a JAR file directly into Eclipse), because it does not adhere to the Eclipse way of maintaining plugins. Given these limitations, and the fact that Lombok is used in only 17 files, it would be beneficial for the code quality when the Lombok library would be removed from the project.

Because it was not possible to send the source code to the Software Improvement Group, a detailed analysis of the source code was made by the team itself. The results can be found as Attachment L. The report contains detailed recommendations for code improvement and these recommendations should be executed before any new functionality is added to Simlike.

Security Security is very important for a social media platform. Therefore it is important to use HTTPS protocol instead of the less secure HTTP protocol for the API calls. Sensitive user data is sent to the API, meaning data encryption is required. This is important enough to delay the Simlike launch until it is implemented. To implement HTTPS, a SSL certificate must be purchased from a trusted party. The SSL certificate must be configured on the web server.

New features The following features could be of high interest to Simlike in the future.

Algorithm platform

DECLARED CLASSIFIED BY NERVAL LIMITED

43

Finer grained authorisation The database has been set up in a way that allows a finer grained authorisation mechanism to be easily implemented, should it become necessary. The recommended way to implement such a mechanism, is to grant permissions to a simtoken. These permissions are then stored in the column value of a simtoken in the database (See section “Cassandra Schema” in Attachment E). The column value is currently not used, and has been left empty for this purpose. This would allow different permissions per authenticated session. For example, this is useful when a third party wants to access Simlike on behalf of a user, but the user does not want the third party to be able to access his/her private messages.

Collaboration While the development team remains small, an informal management and development style works great for team building and to improve motivation. If the development team gets bigger, it is recommended to split it up into smaller teams. The smaller teams can work informally among each other and have formal meetings with the project manager and the other teams.

Development process Because social media is constantly evolving, it is important to use a flexible development approach. SCRUM is recommended, as it is a flexible approach to development that worked out great for the project team. However, it is important not to lose sight of a global planning when using SCRUM.

The use of Eclipse, Checkstyle, PMD, Sonar, Maven, git, JUnit, Mockito, Cassandra, and Shiro is recommended, as these tools, libraries and software packages worked out great during the development.

For future development either Mantis or Jira is recommended to keep better track of the tasks and bugs that need to be completed or fixed. DECLARED CLASSIFIED BY NERVAL LIMITED .

Continuation of Simlike It is recommended to attract a team of at least five developers to continue the development of Simlike. This recommendation is based on the amount of features which are planned for the end of this year (these features were not part of the bachelor project). The project team believes that Nerval Limited wishes to launch the first version of Simlike in August 2011, at least three developers will have to continue the project immediately after the project has been transferred back to Nerval Limited.

44

References

[AWS] Amazon, “Amazon Web Services”, http://aws.amazon.com/, accessed on July 9, 2011. [AEB] Amazon Web Services, “AWS Elastic Beanstalk”, http://aws.amazon.com/elasticbeanstalk/, accessed on July 9, 2011. [ACF] Amazon Web Services, “Amazon CloundFront”, http://aws.amazon.com/cloudfront/, accessed on July 9, 2011. [CAS] Cassandra, “The Apache Cassandra Project”, http://cassandra.apache.org/, accessed on July 9, 2011. [FB] Facebook, “Facebook platform policies”, https://developers.facebook.com/policy/, accessed on July 10, 2011. [MOBL] Mobl, “The new language of the mobile web”, http://www.mobl-lang.org/, accessed on July 10, 2011.

45

Attachments

46

Legend: Todo Ready Overdue Week 16- 23- 30- 18-Apr 25-Apr 2-May 9-May May May May 6-Jun 13-Jun 20-Jun 27-Jun 4-Jul 11-Jul 18-Jul 25-Jul Description Start date End date % Complete Assigned to 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Study phase 18-Apr 24-Apr 100.00% Everyone X Project kick-off 18-Apr 18-Apr 100.00% Everyone X Study report 18-Apr 20-Apr 100.00% Everyone X Plan of approach 18-Apr 22-Apr 100.00% Everyone X

Global development phase 25-Apr 9-Jul 98.15% Everyone X X X X X X X X X X due Requirements analysis 26-Apr 26-Apr 100.00% Everyone X Functional design 27-Apr 30-Apr 100.00% Everyone X Technical design 2-May 13-May 100.00% Everyone X X Final report 15-Jun 9-Jul 100.00% Everyone X X X X Website www.simlike.com 4-Jul 10-Jul 100.00% Jeroen X Facebook application 16-May 6-Jul 99.00% Everyone X X X X X X X due Test environment 16-May 6-Jul 100.00% Everyone X X X X X X X X

Sprint #1 25-Apr 1-May 100.00% Everyone X Meeting with client 26-Apr 26-Apr 100.00% Everyone X Daily meetings 26-Apr 26-Apr 100.00% Everyone X Review plan of approach 26-Apr 26-Apr 100.00% Everyone X Translate user cases into English 26-Apr 26-Apr 100.00% Joris X Research software licenses 26-Apr 26-Apr 100.00% Jeroen X Write summary plan of approach 26-Apr 26-Apr 100.00% Volker X Spellcheck plan of approach 26-Apr 26-Apr 100.00% Joris X Install Mantis bugtracker 26-Apr 26-Apr 100.00% Joey X Weekly report 26-Apr 29-Apr 100.00% Everyone X Requirements analysis 26-Apr 26-Apr 100.00% Everyone X Functional design 27-Apr 29-Apr 100.00% Everyone X Daily meetings 27-Apr 27-Apr 100.00% Everyone X Rent books @ TU library 27-Apr 1-May 100.00% Joey X Mail Peter about omittance of scenarios 27-Apr 27-Apr 100.00% Volker X Functional design: how and why? 27-Apr 27-Apr 100.00% Joey X Security research orientation 27-Apr 29-Apr 100.00% Joey X Configure Mantis bugtracker 27-Apr 27-Apr 100.00% Joey X Daily meetings 28-Apr 28-Apr 100.00% Everyone X Daily meetings 29-Apr 29-Apr 100.00% Everyone X Validate FD with client 29-Apr 29-Apr 100.00% Joris X

Sprint #2 2-May 8-May 100.00% Everyone X Meeting with client and TU Delft supervisor 2-May 2-May 100.00% Everyone X Daily meetings 2-May 2-May 100.00% Everyone X Weekly report 2-May 6-May 100.00% Everyone X Technical design 2-May 6-May 100.00% Everyone X TDD: algorithms 2-May 2-May 100.00% Volker X TDD: GUI 2-May 2-May 100.00% Joris X TDD: database/storage 2-May 2-May 100.00% Jeroen X TDD: Quality assurance 2-May 6-May 100.00% Everyone X Daily meetings 3-May 3-May 100.00% Everyone X Daily meetings 4-May 4-May 100.00% Everyone X TDD: VPS security w.r.t. Amazon vs. Rackspace 4-May 4-May 100.00% Joey X Daily meetings 5-May 5-May 100.00% Everyone X Daily meetings 6-May 6-May 100.00% Everyone X

Sprint #3 9-May 15-May 100.00% Everyone X Meeting with client 9-May 9-May 100.00% Everyone X Plan days of absent team members 9-May 9-May 100.00% Everyone X Daily meetings 9-May 9-May 100.00% Everyone X Daily meetings 10-May 10-May 100.00% Everyone X Daily meetings 11-May 11-May 100.00% Everyone X Daily meetings 12-May 12-May 100.00% Everyone X Daily meetings 13-May 13-May 100.00% Everyone X Weekly report 9-May 13-May 100.00% Everyone X Legend: Todo Ready Overdue Week 16- 23- 30- 18-Apr 25-Apr 2-May 9-May May May May 6-Jun 13-Jun 20-Jun 27-Jun 4-Jul 11-Jul 18-Jul 25-Jul Description Start date End date % Complete Assigned to 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Technical design 9-May 13-May 100.00% Everyone X Finalise reports and drafts of last week for handing in 9-May 9-May 100.00% Everyone X Make appointment with Pascal Wiggers 9-May 9-May 100.00% Everyone X Meeting with Pascal Wiggers 11-May 11-May 100.00% Everyone X Install tools 10-May 11-May 100.00% Everyone X Install platforms 10-May 11-May 100.00% Everyone X Create a prototype web application 11-May 13-May 100.00% Everyone X Implement 1-to-1 matching algorithm 11-May 13-May 100.00% Everyone X Implement automatic matching algorithm 11-May 13-May 100.00% Everyone X Implement searching algorithm 11-May 13-May 100.00% Everyone X

Sprint #4 16-May 22-May 93.64% Everyone due Meeting with client and TU Delft supervisor 16-May 16-May 100.00% Everyone X Rent books @ university library 16-May 16-May 100.00% Joey X Daily meetings 16-May 16-May 100.00% Everyone X Daily meetings 17-May 17-May 100.00% Everyone X Daily meetings 18-May 18-May 100.00% Everyone X Daily meetings 19-May 19-May 100.00% Everyone X Daily meetings 20-May 20-May 100.00% Everyone X Lookup 5 star code requirements for SIG (especially cyclomatic complexity) 16-May 17-May 100.00% Joris X Refactor research study (orientatieverslag) and TDD 16-May 17-May 100.00% Everyone X Setup continuous integration 16-May 16-May 100.00% Joey X Technical design 16-May 20-May 100.00% Everyone X TDD: global class diagram 16-May 20-May 100.00% Everyone X TDD: Design Cassandra/algorithm database mapping 17-May 18-May 100.00% Jeroen X Generate test data 18-May 18-May 100.00% Everyone X Create backup of test data 18-May 18-May 100.00% Everyone X TDD: facebook app 16-May 20-May 100.00% Everyone X TDD: mobile app 16-May 20-May 0.00% Everyone X Facebook app prototype 16-May 20-May 100.00% Joris X API prototype 16-May 20-May 100.00% Volker X Authentication using Apache Shiro 16-May 20-May 100.00% Joey X Implement mobile app 16-May 20-May 60.00% Everyone due Progress report 16-May 20-May 100.00% Everyone X

Sprint #5 23-May 29-May 100.00% Everyone X Meeting with client and TU Delft supervisor 23-May 24-May 100.00% Everyone X Daily meetings 23-May 23-May 100.00% Everyone X Daily meetings 24-May 24-May 100.00% Everyone X Daily meetings 25-May 25-May 100.00% Everyone X Daily meetings 26-May 26-May 100.00% Everyone X Daily meetings 27-May 27-May 100.00% Everyone X Progress report 23-May 27-May 100.00% Everyone X Retrieve data from Facebook: design 23-May 23-May 100.00% Everyone X Retrieve data from Facebook: do it! 23-May 23-May 100.00% Everyone X Set up qUnit 23-May 23-May 100.00% Joris X Integrate database with API (requires DB) 25-May 25-May 100.00% Everyone X UI styling (requires Paris) 25-May 25-May 100.00% Jeroen X Integrate API with Facebook app 23-May 23-May 100.00% Everyone X Set up test framework 23-May 23-May 100.00% Volker X

Sprint #6 30-May 5-Jun 100.00% Everyone X Meeting with client and TU supervisor 30-May 30-May 100.00% Everyone X Legend: Todo Ready Overdue Week 16- 23- 30- 18-Apr 25-Apr 2-May 9-May May May May 6-Jun 13-Jun 20-Jun 27-Jun 4-Jul 11-Jul 18-Jul 25-Jul Description Start date End date % Complete Assigned to 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Daily meetings 30-May 30-May 100.00% Everyone X Daily meetings 31-May 31-May 100.00% Everyone X Daily meetings 1-Jun 1-Jun 100.00% Everyone X Daily meetings 2-Jun 2-Jun 100.00% Everyone X Daily meetings 3-Jun 3-Jun 100.00% Everyone X Integrate database with API (requires DB) 30-May 3-Jun 100.00% Everyone X Retrieve data from Facebook: do it! 30-May 3-Jun 100.00% Joris X Create documentation (TDD) 30-May 3-Jun 100.00% Everyone X Create test specifications: authorisation 30-May 3-Jun 100.00% Joey X Create test specifications: authentication 30-May 3-Jun 100.00% Joey X Create test specifications: API classes 30-May 3-Jun 100.00% Volker X Create test specifications: javascript 30-May 3-Jun 100.00% Joris X Create test specifications: database 30-May 3-Jun 100.00% Jeroen X Implement test cases: authorisation 30-May 3-Jun 100.00% Joey X Implement test cases: authentication 30-May 3-Jun 100.00% Joey X Implement test cases: API classes 30-May 3-Jun 100.00% Volker X Implement test cases: javascript 30-May 3-Jun 100.00% Joris X Implement test cases: database 30-May 3-Jun 100.00% Jeroen X Create the database 30-May 3-Jun 100.00% Jeroen X Implement database-API communication 30-May 3-Jun 100.00% Jeroen & Joey X Create test data 30-May 3-Jun 100.00% Joris & Jeroen X Complete facebook authentication 30-May 3-Jun 100.00% Joris & Joey X Complete authorization 30-May 3-Jun 100.00% Volker & Joey X Flying goalie 30-May 3-Jun 100.00% Jeroen X Concept meeting for testdata application 30-May 3-Jun 100.00% Everyone X Concept meeting for authorization 30-May 3-Jun 100.00% Everyone X Progress Report with "Two Weekly Progress" attachment 30-May 3-Jun 100.00% Everyone X Replace Mockrunner with easyMock 30-May 3-Jun 100.00% Volker X

Sprint #7 6-Jun 12-Jun 100.00% Everyone X Meeting with client and TU Delft Supervisor 6-Jun 8-Jun 100.00% Everyone X Daily meetings 7-Jun 7-Jun 100.00% Everyone X Daily meetings 8-Jun 8-Jun 100.00% Everyone X Daily meetings 9-Jun 9-Jun 100.00% Everyone X Daily meetings 10-Jun 10-Jun 100.00% Everyone X Daily meetings 11-Jun 11-Jun 100.00% Everyone X TDD: global class diagram 6-Jun 12-Jun 100.00% Jeroen X TDD: facebook app 6-Jun 12-Jun 100.00% Joris X Integrate database with API (requires DB) 6-Jun 12-Jun 100.00% Everyone X Create documentation (TDD) 6-Jun 12-Jun 100.00% Everyone X Create test specifications: authorisation 6-Jun 12-Jun 100.00% Everyone X Create test specifications: authentication 6-Jun 12-Jun 100.00% Everyone X Create test specifications: javascript 6-Jun 12-Jun 100.00% Everyone X Implement test cases: authorisation 6-Jun 12-Jun 100.00% Everyone X Implement test cases: authentication 6-Jun 12-Jun 100.00% Everyone X Implement test cases: database 6-Jun 12-Jun 100.00% Everyone X Create test data 6-Jun 12-Jun 100.00% Everyone X Create backup of test data 6-Jun 12-Jun 100.00% Everyone X Legend: Todo Ready Overdue Week 16- 23- 30- 18-Apr 25-Apr 2-May 9-May May May May 6-Jun 13-Jun 20-Jun 27-Jun 4-Jul 11-Jul 18-Jul 25-Jul Description Start date End date % Complete Assigned to 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Complete authorization 6-Jun 12-Jun 100.00% Everyone X Discuss exception handling 6-Jun 12-Jun 100.00% Everyone X Discuss authorization for Facebook application 6-Jun 6-Jun 100.00% Everyone X Review progress on the ADEL project and how it corresponds to the API’s update functions 6-Jun 6-Jun 100.00% Everyone X Finish progress report sprint 6 6-Jun 6-Jun 100.00% Volker & Jeroen X Change code for new way of exception handling 6-Jun 7-Jun 100.00% Everyone X Create weekplanning 6-Jun 6-Jun 100.00% Everyone X Review and solve PMD warnings and errors 6-Jun 7-Jun 100.00% Everyone X Finish project ADEL 6-Jun 12-Jun 100.00% Everyone X Create test specifications for the new API functions 6-Jun 12-Jun 100.00% Everyone X Implement test cases for the new API functions 6-Jun 12-Jun 100.00% Everyone X Create test specifications (general) 6-Jun 12-Jun 100.00% Everyone X Implement test cases (general) 6-Jun 12-Jun 100.00% Everyone X Review simToken authentication 6-Jun 12-Jun 100.00% Joey & Jeroen X Update the Facebook application to use the changed API 6-Jun 12-Jun 100.00% Everyone X Update the Facebook application to use Facebook data 6-Jun 12-Jun 100.00% Everyone X progress report 6-Jun 12-Jun 100.00% Everyone X Prepare for the meeting with Zef 6-Jun 12-Jun 100.00% Everyone X Visit Zef to discuss the use of MOBL in our project 6-Jun 12-Jun 100.00% Joey X Send code to SIG 6-Jun 12-Jun 100.00% Jeroen X

Sprint #8 13-Jun 19-Jun 100.00% Everyone X Deadline #1 send in code to SIG 14-Jun 14-Jun 100.00% Everyone X Visit Zef to discuss the use of MOBL in our project 14-Jun 14-Jun 100.00% Jeroen & Joey X Finish ADEL 14-Jun 17-Jun 100.00% Everyone X Finish F22 Top Interests 14-Jun 17-Jun 100.00% Volker & Joris X Implement all database mockups 14-Jun 17-Jun 100.00% Everyone X

Sprint #9 20-Jun 26-Jun 100.00% Everyone X Meeting with TU supervisor and client 22-Jun 22-Jun 100.00% Volker & Joris X Set up MOBL environment 20-Jun 26-Jun 100.00% Joey X Orientation research on chat protocols 20-Jun 26-Jun 100.00% Volker & Joris X Set up Algorithm platform 20-Jun 26-Jun 100.00% Jeroen X Set up new development server 20-Jun 26-Jun 100.00% Jeroen X Chat 20-Jun 26-Jun 100.00% Volker & Joris X Offline messaging 20-Jun 26-Jun 100.00% Volker & Joris X One-To-One Matching 20-Jun 26-Jun 100.00% Jeroen X Automatic Matching 20-Jun 26-Jun 100.00% Jeroen X Progress Report 20-Jun 26-Jun 100.00% Everyone X

Sprint #10 27-Jun 3-Jul 97.14% Everyone due Chat integration (F4) 27-Jun 3-Jul 100.00% Joris & Volker X F4.3 Chat: interest sharing 27-Jun 3-Jul 100.00% Joris X Algorithm platform integration 29-Jun 29-Jun 100.00% Jeroen X F5 Private messaging 28-Jun 1-Jul 100.00% Volker X O1 Mobile App 29-Jun 3-Jul 60.00% Joey due Prepare demonstration for Peter 2-Jul 3-Jul 100.00% Joris X Make a room reservation for the presentation 27-Jun 27-Jun 100.00% Jeroen X Sprint report 2-Jul 3-Jul 100.00% Everyone X F1 Matching users 29-Jun 30-Jun 100.00% Jeroen X F2 Matching one-on-one 30-Jun 1-Jul 100.00% Volker X F3 Search for users 29-Jun 30-Jun 100.00% Jeroen X Legend: Todo Ready Overdue Week 16- 23- 30- 18-Apr 25-Apr 2-May 9-May May May May 6-Jun 13-Jun 20-Jun 27-Jun 4-Jul 11-Jul 18-Jul 25-Jul Description Start date End date % Complete Assigned to 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Install Sonar 2-Jul 3-Jul 100.00% Jeroen X host API 30-Jun 30-Jun 100.00% Volker & Jeroen X host GUI 30-Jun 30-Jun 100.00% Volker X

Sprint #11 4-Jul 10-Jul 88.89% Everyone due Give demonstration to Peter 6-Jul 6-Jul 100.00% Joris X Deadline #2 send in code to SIG 8-Jul 8-Jul 100.00% Everyone X O4 Activity monitoring 4-Jul 4-Jul 0.00% Not assigned due F4.2 Chat: photo sharing 4-Jul 10-Jul 100.00% Joris X F4.4 Chat: places sharing 4-Jul 10-Jul 100.00% Joris X Final report 4-Jul 10-Jul 100.00% Everyone X Functional design 4-Jul 10-Jul 100.00% Joris X Technical design 4-Jul 10-Jul 100.00% Everyone X Alternative assignment SIG 4-Jul 10-Jul 100.00% Volker X Website www.simlike.com 4-Jul 10-Jul 100.00% Jeroen X Finalise Facebook application 4-Jul 6-Jul 99.00% Everyone due Finalse Mobile application 4-Jul 6-Jul 60.00% Joey due

Final phase 11-Jul 15-Jul 0.00% Everyone X Feedback final report 11-Jul 15-Jul 0.00% Everyone X Prepare presentation 11-Jul 13-Jul 0.00% Jeroen X Presentation 15-Jul 15-Jul 0.00% Volker X Bug fixing 11-Jul 14-Jul 0.00% Everyone X Increase test coverage 11-Jul 14-Jul 0.00% Everyone X Plan van Aanpak Bachelor stage in opdracht van Nerval Limited

26 April, 2011, Delft

In opdracht van Nerval Limited, United Kingdom, in samenwerking met de Technische Universiteit Delft, Nederland.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Voorwoord Dit project wordt uitgevoerd als onderdeel van het bachelor project van de studierichting Technische Informatica aan de TU Delft. De opdrachtgever is een internet startup welke zich richt op social media. Binnen het bedrijf is er een mix van kwaliteiten beschikbaar waaronder sales, marketing, design en techniek. Op het technisch vlak is personeel nog ondervertegenwoordigd, daar komt deze stage opdracht van pas.

Dit Plan van Aanpak is bedoeld voor opdrachtgever en projectteam en kan gebruikt worden om de verwachtingen van de opdrachtgever en het projectteam vast te stellen, evenals om de aanpak voor het ontwikkelingsproces te bepalen.

2

Inhoudsopgave

Samenvatting ...... 4 Inleiding ...... 6 Toelichting op de opbouw van het plan ...... 6 Projectomschrijving ...... 7 Aanpak en tijdsplanning ...... 9 Requirements analysis ...... 9 Functional design ...... 10 Technical design ...... 10 Quality assurance ...... 10 Implementation ...... 10 Deliverables ...... 11 Verslagen ...... 11 Eindproducten ...... 12 Projectinrichting ...... 14 Organisatie ...... 14 Eisen aan projectleden ...... 14 Administratieve procedures ...... 14 Financiering ...... 15 Rapportering ...... 15 Resources ...... 15 Personen ...... 16 Kwaliteitsborging ...... 17 Opdrachtgever en projectteam ...... 17 Ontwikkeltools ...... 18 Risico‟s vermijden ...... 19

3

Samenvatting Dit is het plan van aanpak voor een bachelor project voor de opleiding Technische Informatica aan de TU Delft. De opdrachtgever voor het project is Nerval Limited, een internet startup. De opdrachtgever heeft een nieuw concept voor een platform voor sociale media bedacht, maar heeft te weinig technisch personeel om dit idee te verwezenlijken. Daarom zal tijdens dit project met een duur van 3 maanden een begin worden gemaakt aan dit platform.

In dit plan van aanpak worden de eisen en verwachtingen van de opdrachtgever en het projectteam behandeld. Het product zal ontwikkeld worden volgens de SCRUM methode. Dit betekent dat elke week een paar features geïmplementeerd zullen worden en een volledig werkend en getest tussenproduct geleverd wordt. Zo kunnen de requirements gedurende het project nog aangepast worden.

Betrokken personen zijn: 1. Nathan Navarro - vertegenwoordiger van de opdrachtgever. 2. Peter van Nieuwenhuizen - projectbegeleider namens de TU Delft 3. Paris Hidden - designer namens de opdrachtgever. 4. Jeroen Dijkhuizen - student Technische Informatica. 5. Joey Ezechiëls - student Technische Informatica. 6. Joris Albeda - student Technische Informatica. 7. Volker Lanting - student Technische Informatica.

Het eindproduct is een platform voor sociale media, waar gebruikers nieuwe vrienden kunnen vinden op basis van overeenkomst in interesses, locatie en leeftijd. De eisen die aan het eindproduct worden gesteld zijn als volgt: 1. Een solide basis voor de toekomstige doorontwikkeling van de web applicatie. 2. Een eerste, werkende implementatie in de vorm van een Facebookapplicatie. 3. Er moeten kwaliteitseisen opgesteld worden. 4. Het product moet gebruiksvriendelijk zijn. 5. Het product moet onderhoudbaar en uitbreidbaar zijn in de toekomst. 6. Het product moet schaalbaar zijn en dus geschikt voor grote hoeveelheden gebruikers. 7. Indien haalbaar binnen drie maanden, moet er ook een implementatie op mobile devices komen.

De volgende producten zullen geleverd worden: 1. Orientatieverslag 2. Plan van Aanpak 3. Tussenverslag per SCRUM “sprint” 4. Eindverslag 5. Technische handleiding 6. Presentatie 7. Website 8. Facebook Applicatie 9. Testomgeving

Dit document beschrijft het contract tussen opdrachtgever en projectleden. De volgende zaken worden van de opdrachtgever verwacht: 1. Aanwezigheid bij elke wekelijks SCRUM meeting om het tussenproduct te valideren. 2. Aanwezigheid bij de requirements analyse

4

3. Aanwezigheid bij discussies over technieken en hun voordelen, nadelen en kosten. 4. Beschikbaarheid van een kantoor ruimte waar ongestoord kan worden gewerkt, met stroom en internetaansluiting voor tenminste 4 computers.

De volgende zaken worden van de projectleden verwacht: 1. Het leveren van een product dat voldoet aan de gestelde eisen. 2. Het leveren van alle afgesproken producten. 3. Werken aan het project van 9 tot 5 en op donderdag van 10 tot 6. 4. Meebrengen van een eigen laptop. 5. Kennis van de gebruikte libraries, tools en programmeertalen. 6. Installatie van benodigde software op de eigen laptop. 7. Onderzoek naar veiligheid, beschikbare libraries en ontwikkeltools. 8. Het gebruik van best practices, conventies en ontwikkeltools. 9. Aanwezigheid bij wekelijkse SCRUM meetings. 10. Aanwezigheid bij dagelijkse voortgangsbespreking.

Om de kwaliteit te waarborgen zullen de volgende ontwikkeltools worden gebruikt: 1. Buildbot 2. Mantis 3. Git 4. Github

5

Inleiding In dit plan van aanpak wordt beschreven hoe deze stage opdracht voor de opdrachtgever wordt uitgevoerd. Het dient als contract tussen het projectteam en de opdrachtgever. Toevoegingen en aanpassingen kunnen in de loop van het project opgenomen worden in het plan van aanpak.

Het plan van aanpak wordt opgesteld door het projectteam en wordt vervolgens goedgekeurd door zowel de opdrachtgever als de begeleider vanuit de TU Delft. In de loop van het project kunnen de verwachtingen bijgesteld worden. Deze verandering in het contract moet opgenomen worden in het plan van aanpak. Hierna volgt weer de goedkeuring van de opdrachtgever en de begeleider van de TU Delft. Er wordt bewust gekozen voor een flexibele aanpak, daar het in de praktijk niet ongewoon is dat eisen aangepast dienen te worden. Hierbij kan bijvoorbeeld gedacht worden aan extra benodigde resources of wanneer de planning gevaar loopt.

Toelichting op de opbouw van het plan In de sectie Projectomschrijving wordt het te ontwikkelen product toegelicht en worden eisen geformuleerd waaraan dit product dient te voldoen. Deze sectie kan gebruikt worden om een beeld te vormen van hoe het eindproduct er uit zal zien.

Vervolgens wordt in sectie Aanpak en Tijdsplanning de gebruikte ontwikkelmethode besproken en een globale planning gegeven. Deze sectie kan worden gebruikt om de planning te raadplegen en de voortgang en het verloop van het ontwikkelproces te bepalen.

In de sectie Deliverables worden de te maken verslagen en producten gespecificeerd. Deze sectie kan worden gebruikt bij het plannen en geeft een duidelijk overzicht van de werklast.

In de sectie Projectinrichting wordt het contract tussen de projectleden en de opdrachtgever gespecificeerd. Ook de benodigde resources en betrokken personen worden hier genoemd. Deze sectie kan worden gebruikt om te achterhalen wat verwacht wordt van de verschillende actoren.

Tot slot wordt besproken hoe de kwaliteit tijdens het project gewaarborgd kan blijven in sectie Kwaliteitswaarborging. Deze sectie kan gebruikt worden om te controleren of de kwaliteit goed gewaarborgd blijft.

6

Projectomschrijving De opdrachtgever wil een platform voor sociale media opzetten, met in eerste instantie een Facebookapplicatie. De bedoeling is dat Facebook gebruikers hun profieldata beschikbaar stellen aan het sociale platform. Op basis van de beschikbare data zullen gebruikers aan elkaar voorgesteld worden. Het idee hierbij is dat er een voorspelling gedaan wordt over het eventuele succes van een vriendschap op basis van gedeelde interesses, opleiding, werk, spelletjes, etc. Hierbij speelt ook de geografische locatie en leeftijd van personen een rol. De gebruikers hoeven elkaar nog niet te kennen. De insteek is om nieuwe, gelijkgestemde vrienden via Facebook te ontmoeten, maar het kan ook toegepast worden op bestaande vriendschappen om te kijken in hoeverre een bestaande vriendschap op gelijkgestemdheid is gebaseerd.

Om Facebook gebruikers aan elkaar voor te stellen, moet een algoritme ontworpen worden. Dit algoritme zal op basis van beschikbare gegevens een “successcore” berekenen. Deze score geeft aan hoe groot de kans is dat het eerste contact uitloopt op een vriendschap. Hierbij dient rekening gehouden te worden met een gebruikersdomein van Facebook van miljoenen gebruikers. Met andere woorden, de tijd- en ruimtecomplexiteit van het algoritme moet zo laag mogelijk zijn. Het algoritme dient ook schaalbaar te zijn; om deze reden geldt de eis dat het algoritme parallelliseerbaar moet zijn. Een extra wens is dat een zoekende gebruiker voorkeuren kan opgeven, welke van invloed zijn op het zoekresultaat. Er zal onderzocht moeten worden in hoeverre dit haalbaar is in combinatie met de eerder genoemde eisen.

Het algoritme dient uiteindelijk geïmplementeerd te worden als een gebruiksvriendelijke Facebookapplicatie. Gebruikers kunnen zich dan aanmelden met hun Facebook profiel. Hierdoor komt de benodigde data beschikbaar en kan er een verzameling potentiële vrienden voorgesteld worden door de applicatie. Hierbij is er per voorgestelde vriend een percentage zichtbaar dat gebaseerd is op hoe goed de match is (op een bepaalde schaal), waar deze persoon zich geografisch gezien bevindt en het aantal vrienden dat deze persoon heeft. Gebruikers kunnen vervolgens elkaar berichten sturen om elkaar te leren kennen en een vriendschapsverzoek indienen bij elkaar. Het eindresultaat moet ook aansluiten bij de wensen van de eindgebruikers. Hierbij zal onder andere nagedacht moeten worden over aspecten als privacy: is het wenselijk dat de werkelijke namen van personen bij zoekresultaten komen te staan, moeten namen weggelaten worden, moeten namen veranderd worden of kan de gebruiker dit zelf instellen?

Het doel van het project is om een eerste, werkende versie van het product te maken en online te zetten. Aan deze eerste versie moet het bedrijf vervolgens door kunnen werken. Ook wil de opdrachtgever een werkende, mobiele applicatie waarmee het product gebruikt kan worden.

Het project beslaat het ontwerpen, implementeren en testen van de software. In bijlage C is de lijst met gevraagde features ingesloten. Deze zijn geordend op prioriteit.

Tevens worden er enkele voorwaarden gesteld aan het product. Het product moet gebruikersvriendelijk zijn: het moet toegankelijk zijn voor gebruikers met weinig computerervaring. Ook moet het product aanpasbaar zijn, aangezien het bedrijf het product later wil gaan uitbreiden. Er zijn geen voorwaarden gesteld aan de programmeertaal en de tools. Er zal dus nog een keuze gemaakt moeten worden 7 tussen mogelijke tools en ontwikkelmethoden. Aangezien het bedrijf geen vaste kwaliteitsnormen heeft, moeten deze nog opgesteld worden.

Samengevat zijn er de volgende eisen: 1. Een solide basis voor de toekomstige doorontwikkeling van de web applicatie. 2. Een eerste, werkende implementatie in de vorm van een Facebookapplicatie. 3. Er moeten kwaliteitseisen opgesteld worden. 4. Het product moet gebruiksvriendelijk zijn. 5. Het product moet onderhoudbaar en uitbreidbaar zijn in de toekomst. 6. Het product moet schaalbaar zijn en dus geschikt voor grote hoeveelheden gebruikers. 7. Indien haalbaar binnen drie maanden, moet er ook een implementatie op mobile devices komen.

8

Aanpak en tijdsplanning In deze sectie wordt de aanpak van het project behandeld. Eerst wordt de toe te passen ontwikkelmethode geïntroduceerd. Vervolgens wordt besproken wat de verschillende fasen van het project zijn, wat ervoor gedaan moet worden en hoe lang dit duurt.

Er zal gebruik gemaakt worden van Agile Development; dit is een ontwikkelmethode die veranderingen in requirements op kan vangen en gericht is op klantvriendelijkheid door continue validatie. Aangezien de opdrachtgever een startend bedrijf is en sociale media snel veranderen, is er voor Agile Development gekozen. Een aanname voor dit project is dat de requirements nog kunnen veranderen tijdens de loop van het project.

De specifieke ontwikkelmethode van Agile Development die toegepast gaat worden, heet SCRUM. Deze methode houdt in dat er eerst een globaal ontwerp en een globale planning gemaakt wordt. Vervolgens worden er telkens een paar features uitgekozen die ontwikkeld zullen worden in een korte periode. Een dergelijke korte periode heet een “sprint”. Aan het eind van de sprint is er een leverbaar tussenproduct dat getest en valideerbaar is. Daarom is er bij elke sprint een SCRUM meeting, waar de opdrachtgever het tussenproduct kan valideren en de projectleden de voortgang kunnen bespreken. Elke dag is er een bespreking voor de projectleden, waar iedereen aangeeft wat hij heeft gedaan en gaat doen. Dit gebeurt zowel aan het begin als aan het einde van de werkdag. Deze besprekingen duren maximaal tien minuten.

Hoe lang een sprint duurt, kunnen de teamleden zelf vaststellen. Hier zijn een aantal richtlijnen voor. Zo moet men een sprint niet te lang laten duren, want dan gaat het te veel lijken op het watervalmodel. Normaal gesproken ligt een sprint periode voor een team van zeven à acht leden tussen de twee en de vier weken. Wij kiezen ervoor om deze periodes te beperken tot één of uiterlijk twee weken, en wel om de volgende redenen: ● Aan het einde van een sprint is er een tussenproduct. Deze momenten worden gebruikt om met de opdrachtgever te controleren of juist is gehandeld. ● Periodes van één week zijn eenvoudig in te plannen en zorgen voor een goede regelmaat. ● Op deze manier kunnen de eisen aan het project enigszins flexibel blijven; als deze veranderen, betekent dit geen grote uitloop.

De totale periode die gereserveerd is voor dit project, is van 18 april 2011 tot 15 juli 2011. De eerste week dient als oriëntatiefase, de laatste week als afrondingsfase. De resterende tijd wordt aangewend als ontwikkelperiode. Zie bijlage B voor een gedetailleerd planning schema. De ontwikkelfase bestaat uit een paar fasen, welke nu kort zullen worden toegelicht.

Requirements analysis In de ontwikkelfase zullen eerst de requirements verzameld en geanalyseerd worden. Deze requirements worden elke sprint opnieuw geëvalueerd en geordend op prioriteit. Uit de requirements wordt een lijst met features afgeleid. Elke sprint zal een deel van deze features geïmplementeerd worden.

9

Functional design Nadat de requirements zijn opgesteld wordt de grafische user interface (GUI) globaal ontworpen aan de hand van mockups. Eens per sprint zullen alle features die gekozen zijn voor die sprint uitgewerkt worden in use cases, zodat de developers een goed beeld hebben van wat de feature precies moet doen. Uiteindelijk vormen al deze use cases en GUI mockups samen een functioneel ontwerp.

Technical design Nadat een begin is gemaakt aan het functioneel ontwerp, zal een technisch ontwerp worden gemaakt. Eerst wordt de globale architectuur bepaald, vervolgens zullen per sprint de geselecteerde features verder gespecificeerd worden. Waar nodig kan ook de globale architectuur aangepast worden in de sprints.

Quality assurance Bij het vaststellen van de kwaliteitseisen maken we gebruik van tools zoals unit testing en coverage tools. Om de kwaliteitseisen te bewaken, worden twee werkdagen gereserveerd.

De werkzaamheden omvatten: 1. Eens per sprint: Ontwerpen van de test requirements voor de features die voor die sprint gepland staan. 2. Eens per sprint: Ontwerpen van de test cases naar aanleiding van de ontworpen test requirements. 3. Eens per sprint: Toestandsdiagrammen; elke broncode-unit kan als een zogenaamd toestandsdiagram worden gerepresenteerd, en door daarvan gebruik te maken kan in enige mate het correct werken van het product gegarandeerd worden. Hoewel het wenselijk is deze diagrammen voor alle broncode-units te ontwerpen is dit een te tijdrovende klus, daarom worden ze slechts voor de cruciale gedeelten opgesteld. 4. Eens per sprint: Klassendiagrammen ontwerpen voor unit tests, maar alleen voor unit tests die op het klassenniveau geen platte structuur hebben. Dit levert duidelijkheid op over de relatie tussen unit tests voor toekomstig onderhoud, op dezelfde manier zoals een klassendiagram dat doet voor niet- testcode.

Implementation De implementatie vindt plaats in korte sprints, de totale tijd voor de ontwikkeling van niets naar het eindproduct wordt geschat op negen werkweken.

Eens per sprint zullen de geplande features, inclusief testcases wanneer van toepassing, worden geïmplementeerd. Aan het eind van elke sprint wordt een werkend product geleverd. Nadat alle sprints zijn voltooid zal het product dus af zijn.

10

Deliverables Er wordt een pakket eindproducten opgeleverd en een aantal begeleidende verslagen. Hoe deze eruit komen te zien wordt hier besproken.

Verslagen Om te beginnen wordt er een oriëntatieverslag opgesteld om bekend te raken met de vereisten en eventueel bestaande technieken voor het eindproduct. Daarnaast is dit plan van aanpak van belang om afspraken tussen opdrachtgever en projectteam vast te leggen. Er zal ook een eindverslag geproduceerd worden. Het eindverslag zal het ontwerp van het product omvatten, evenals de reden waarom bepaalde ontwerpkeuzes zijn gemaakt. Aspecten die hier in opgenomen worden, zijn: notulen van gesprekken met actoren en experts, het ontwerp, kwaliteitseisen, implementatiebeschrijving en onderhoudsaanbevelingen. Er zal los een technische handleiding worden gemaakt voor de opdrachtgever, waarin staat hoe het product geïmplementeerd en onderhouden dient te worden.

Oriëntatieverslag Voor het maken van het oriëntatieverslag is onderzoek nodig. Dit kan door bronnen te raadplegen op het internet en in de bibliotheek. In het oriëntatieverslag komen de resultaten aan bod van het onderzoek naar bestaande methoden en technieken. Aan de hand van de resultaten zal een keuze gemaakt worden met betrekking tot de ontwikkelmethodieken die zullen worden toegepast. Er wordt een kleine werkweek uitgetrokken voor het doen van het onderzoek en het opschrijven van de bevindingen.

De werkzaamheden voor dit verslag omvatten: 1. Uitzoeken welke ontwikkeltools en -methodieken mogelijkerwijs gebruikt kunnen worden door het ontwikkelteam. 2. Uitzoeken welke oplossingen al bestaan en hergebruikt kunnen worden voor het te ontwikkelen product.

Plan van aanpak Bij het maken van dit plan van aanpak moeten beslissingen genomen worden, gebaseerd op het onderzoek van het oriëntatieverslag. Ook eisen en verwachtingen tussen de opdrachtgever en het projectteam moeten worden geïdentificeerd.

De werkzaamheden voor dit verslag omvatten: 1. Bepalen welke ontwikkelmethodieken wel en niet gebruikt gaan worden, inclusief motivering. 2. Een afspraak maken over de features die geïmplementeerd gaan worden tijdens het project. 3. Een indicatie geven van de totale werklast door middel van een globale planning.

11

Tussenverslagen Na elke “sprint” (zie hoofdstuk Aanpak en Tijdsplanning) zal er verslag gedaan worden van wat er ontworpen en geïmplementeerd is. Zo ook zal er verslag worden gedaan van wat er geleerd is van de vorige rapportage. Het totale ontwerp en de implementatie zullen in een eindverslag worden opgenomen. Deze verslagen zullen in het Engels opgesteld worden, omdat de opdrachtgever voornemens heeft om internationaal personeel in dienst te nemen. Op deze wijze kan buitenlands personeel verder gaan waar het bachelor project geëindigd is.

Alle tussenverslagen en het eindverslag zullen de volgende onderdelen bevatten: ● Requirements analysis, met daarin een overzicht van de gestelde eisen aan het product, gesorteerd op prioriteit en haalbaarheid. ● Functional design, een functioneel ontwerp van de te implementeren eisen. ● Technical design, een ontwerp die de wijze beschrijft waarop deze functionaliteit technisch gerealiseerd zal worden. ● Quality assurance, maatregelen die worden genomen om een product van hoge kwaliteit neer te zetten. ● Implementation, hoe het product uitgerold kan worden in een productieomgeving. ● Maintenance, hoe het product onderhouden kan worden.

Maintenance Hierin wordt uiteengezet hoe de implementatie onderhouden dient te worden. De verwachting is dat hier verder geen werkzaamheden bij komen kijken, omdat deze in de eerste instantie onder ontwikkeling vallen en niet onder onderhoud.

Technische handleiding Het projectteam zal ook een rapport verzorgen met daarin een technische handleiding waarin staat beschreven hoe het product uitgerold en onderhouden kan worden. Dit rapport dient in het Engels geschreven te worden.

Eindproducten Naast de verslagen zullen er een aantal eindproducten opgeleverd worden.

Website Er zal een minimale website voor simlike.com gemaakt worden. Op dit web domein wordt de Facebookapplicatie gehost. De website dient als een visite kaartje en kan de gebruiker in een stap doorverwijzen naar de Facebook applicatie.

Facebookapplicatie Dit is het product waar het in dit project feitelijk om draait: de applicatie die gebruikers met elkaar in contact gaat brengen.

Mobile app Indien haalbaar zal er een mobiele applicatie opgeleverd worden met een deel van de functionaliteit die de Facebook applicatie ook heeft.

Testomgeving De testomgeving bevat alle middelen die het team heeft gebruikt om de kwaliteit van het product te waarborgen. De opdrachtgever wenst hier gebruik van te blijven maken na afloop van het project.

12

Eindverslag Aan het eind van het project worden alle ontwerpen en designkeuzes uit de tussenverslagen geaggregeerd in één verslag. Dit eindverslag zal een goed overzicht geven van het verloop van het project en het gemaakte product.

Presentatie Het project wordt afgesloten met een eindpresentatie met daarin centraal het opgeleverde product, de ervaring onderling en met de opdrachtgever en lessen voor de toekomst.

13

Projectinrichting In deze sectie wordt besproken hoe het project wordt ingericht. Er wordt gespecificeerd wat de verwachtingen van het projectteam en de opdrachtgever zijn. Het doel is om het contract tussen projectteam en opdrachtgever te verduidelijken. Eerst bespreken we de verdeling van verantwoordelijkheden binnen het projectteam, vervolgens de eisen die aan de projectleden worden gesteld. Daarna worden de administratieve procedures behandeld.

Organisatie Binnen het projectteam worden de taken per sprint verdeeld. Er zijn wel een paar gebieden waar bepaalde projectleden verantwoordelijk voor zijn. Het projectteam bestaat uit de volgende leden:

● Jeroen - verantwoordelijk voor de planning en het contact met de opdrachtgever en externe adviseurs. ● Joris - verantwoordelijk voor de mens-machine-interactie. ● Paris - vanuit de opdrachtgever verantwoordelijk voor het ontwerp van de user interface. ● Joey - verantwoordelijk voor de best practices en conventies. ● Volker - verantwoordelijk voor de workflow en de prioriteiten stellen.

Verder zullen de projectleden individueel of in wisselende teams aan de implementatie werken.

Eisen aan projectleden Van alle projectleden wordt verwacht tijdens werkdagen van 9 tot 5 aan het project te werken. Alleen op donderdag is er een uitzondering, dan wordt er gewerkt van 10 tot 6. Er is een half uur ingelast voor lunchpauze. Projectleden die verantwoordelijk zijn voor de implementatie behoren zich te verdiepen in de gebruikte programmeertaal, libraries, tools en algoritmen, zodat zij deze moeiteloos kunnen gebruiken.

Administratieve procedures Er zullen een aantal administratieve procedures gehanteerd worden om het project te monitoren en in goede banen te leiden. Om te beginnen zullen er met regelmaat bijeenkomsten plaatsvinden:

1. Dagelijkse bijeenkomsten met het team. a. „s ochtends: Bespreken wie wat gaat doen. b. „s middags: Bespreken wie wat gedaan heeft en of er ergens problemen zijn opgetreden. 2. Wekelijks twee bijeenkomsten met de opdrachtgever op maandag en vrijdag. a. Maandag: i.Requirements bespreken en ordenen op prioriteit. ii.Features kiezen voor de sprint. iii.Discussiëren over de features, zodat use cases kunnen worden gemaakt. b. Vrijdag: i.Voltooide features bespreken. ii.Onvoltooide features en problemen bespreken. iii.Resultaat van de sprint valideren.

14

3. Tweewekelijkse bijeenkomsten met de TU Delft begeleider. a. Tussenverslagen bespreken. b. Voortgang en problemen bespreken.

Daarnaast zullen de geproduceerde verslagen voorgelegd worden aan de opdrachtgever en aan de begeleider van de TU Delft. Hierop zal dan eventuele feedback volgen en wanneer deze voldoende verwerkt is, zal er een goedkeuring volgen van de opdrachtgever en de TU Delft begeleider.

Financiering De opdrachtgever zal betrokken zijn bij discussies over de mogelijke technische resources en welke financiering hiervoor benodigd is. Aan de hand hiervan zal de opdrachtgever een budget ter beschikking stellen.

Rapportering Tijdens de wekelijkse SCRUM meetings bij elke sprint zal de voortgang worden gerapporteerd en het huidige product gevalideerd. Hiervoor is de aanwezigheid van de opdrachtgever vereist.

Resources We zullen gebruik maken van persoonlijke laptops om de code te ontwikkelen, hoogstwaarschijnlijk in de Eclipse Integrated Development Environment (IDE). De code zal gedeeld en onderhouden worden met behulp van Git. We nemen maatregelen om de broncode eenvoudig te kunnen delen met elkaar. Daarnaast willen we ons product automatisch laten builden en unittesten. Om deze reden maken we gebruik van een Github account die als centraal toegangspunt dient tot de broncode.

We zullen meerdere servers nodig hebben om de web applicatie te hosten, bijvoorbeeld via de „Cloud‟. Zo kunnen we de last verdelen en betere betrouwbaarheid en beschikbaarheid garanderen. De database met alle gegevens zal gedistribueerd zijn voor betrouwbaarheid, beschikbaarheid en schaalbaarheid. Ook hiervoor zullen servers gehuurd en geconfigureerd moeten worden.

Voor de mobile app maken we gebruik van Titanium Appcelerator. Deze ontwikkeltool stelt ons in staat een mobiele versie van het product te ontwerpen in een enkele programmeertaal. Deze versie kunnen we vervolgens uitgeven op meerdere mobiele devices. Dit verhoogt de onderhoudbaarheid voor de mobiele versie van het product.

Op de werkplekken moeten stroom en een internetconnectie voor alle laptops aanwezig zijn. Ook is het van belang dat de ruimte gesloten is, zodat de vertrouwelijke informatie veilig besproken kan worden. Van maandag tot en met woensdag zullen we werken op de Drebbelweg locatie van de TU Delft in practicumzaal DW0.010. Op donderdag en vrijdag zullen we werken aan de „s- Gravenzandseweg in Naaldwijk.

15

Personen Naast het projectteam zijn er enkele anderen betrokken bij dit project. Hier volgt een overzicht:

Nathan Navarro is de contactpersoon van de opdrachtgever en zal namens de opdrachtgever toezicht houden op zowel de voortgang van het project als de financiën, de validatie van het product en het aanleveren van een groep testgebruikers.

Parris Hidden is vanuit de opdrachtgever nauw betrokken bij het ontwerp van de interface en het functioneel design.

Peter van Nieuwenhuizen zal het project begeleiden vanuit de TU Delft. We krijgen feedback op alle geproduceerde verslagen en hij zal toezicht houden op de voortgang van het project.

Pascal Wiggers is beschikbaar voor advies over het gebruik van machine learning methoden in het project.

Cees Witteveen is beschikbaar voor advies over complexiteit en parallellisatie van mogelijke algoritmen.

De Software Improvement Group zal tot twee maal toe de code controleren op eigenschappen als onderhoudbaarheid en conformatie met conventies.

16

Kwaliteitsborging Om de kwaliteit van het product te waarborgen, is het van belang dat zowel het projectteam als de opdrachtgever hier aandacht aan besteden. Eerst bespreken we hoe de opdrachtgever en het projectteam de kwaliteit kunnen verhogen. Vervolgens behandelen we de keuze van ontwikkeltools die gebruikt zullen worden om de kwaliteit te waarborgen. Tot slot bespreken we de activiteiten die ondernomen zullen worden om bekende risico‟s te vermijden.

Opdrachtgever en projectteam Om de kwaliteit te waarborgen is het van belang te weten hoe de verschillende actoren de kwaliteit kunnen beïnvloeden. Deze aspecten worden in deze sectie behandeld.

De projectleden kunnen op de volgende manier de kwaliteit beïnvloeden en verhogen: ● Onderzoek naar veiligheid - Hierdoor kan al bij het ontwerp rekening worden gehouden met veiligheid, zodat het product uiteindelijk makkelijker te beveiligen is. ● Onderzoek naar alternatieve technieken - Als alternatieve technieken worden onderzocht, kunnen betere afwegingen worden gemaakt tussen prijs, kwaliteit en onderhoudbaarheid. ● Best practices en conventies - Door conventies aan te houden is de code makkelijker te begrijpen en te onderhouden, en is de kans op fouten kleiner. ● Bestaande libraries gebruiken - Hierdoor wordt tijd bespaard en is het aantal fouten minder dan in het alternatieve geval, mits de libraries goed zijn getest. ● Ontwikkelomgeving opzetten - Door een ontwikkelomgeving op te zetten met de juiste tools, wordt het programmeren vereenvoudigd en daarmee de kans op fouten verkleind. ● Zelfregulatie (elkaars code lezen) - Aangezien mensen fouten maken, is het goed als de teamleden elkaars werk peer-reviewen.

De opdrachtgever kan op de volgende manieren de kwaliteit verhogen: ● Eisen en verwachtingen duidelijk verwoorden - Hoe duidelijker de eisen en verwachtingen naar voren komen, hoe gemakkelijker het is om het product hierop aan te laten sluiten. ● Voortgang evalueren - Door de voortgang in de gaten te houden is de kans op uitloop en haastwerk kleiner. ● Tussenproducten valideren - Bij validatie wordt gekeken of het product aansluit bij de wensen van de opdrachtgever. Als dit regelmatig gebeurt, zal het eindproduct dichter bij de wensen van de opdrachtgever aansluiten. ● Nauwe betrokkenheid bij functionele ontwerpfase - In de functionele ontwerpfase worden de functies van het product ontworpen. Door dit samen met de opdrachtgever te doen, sluit het ontwerp en idee van het eindproduct beter aan bij de wensen van de opdrachtgever.

17

Tot slot hebben we in elke sprint een tussentijdse validatie. Hierbij zullen we eens per SCRUM-meeting nagaan met de opdrachtgever of het product aan de gestelde eisen voldoet, of anders die richting op gaat. Aan het eind van het project plannen we nog een week in voor de laatste validatie.

Ontwikkeltools Om de communicatie, kwaliteit van het product en de workflow te bevorderen, maken we gebruik van ontwikkeltools. Er zijn een aantal ontwikkeltools die toepasbaar zijn bij de ontwikkeling van Simlike:

Buildbot Hiermee kunnen we continu de kwaliteit van de ontwikkeling in de gaten houden. Dit gebeurt door het automatisch builden en testen van de software. Buildbot is een enigszins complexe buildserver als het op configuratie aankomt, maar daar staat veel flexibiliteit tegenover. Buildbot is handig voor gecompileerde talen; echter kunnen ook scripttalen baat hebben bij het gebruik van Buildbot d.m.v. automatische unittesting.

Mantis Het doel van Mantis is om bij te houden welke bugs op dit moment bekend zijn in het project en hoe het oplossen daarvan ervoor staat. Ook stelt het iedereen in staat de vorderingen en taken van elkaar te zien. Mantis is een web gebaseerde bugtracker, waarmee zowel bugs als nieuwe ingeplande features kunnen worden bijgehouden. Eenvoudig en overzichtelijk in het dagelijks gebruik, en ook relatief eenvoudig op te zetten en te beheren. Het ondersteunt Github integratie, en kan zelfs geconfigureerd worden als een wat meer generieke planning tool.

Git Git verlaagt de afhankelijkheid van de teamleden en centrale resources, zoals een internetverbinding, zonder afbreuk te doen aan de samenwerking. Git is een gedistribueerd versiebeheersysteem met een redelijk sterke nadruk op snelheid. Wat Git technisch anders maakt dan bijvoorbeeld SVN, is het gedistribueerde aspect (wat inhoudt dat een kopie en een origineel op gelijke voet staan en geen hiërarchische relatie hebben). Een ander verschil is dat elke zogenaamde “working directory” in Git een volledige revisie van de source code tot zijn beschikking heeft en dus niet afhankelijk is van een netwerkverbinding om zijn werk te kunnen doen. In geval van een internetconnectiestoring kan elk teamlid dus gewoon door blijven werken. Commits gebeuren eerst lokaal en daarna kan de lokale source repository gesynchroniseerd worden met een andere repository, al dan niet op dezelfde machine.

Github Github is een website waar git repositories kunnen worden opgeslagen en bijgehouden. De kracht van Github is tweevoudig: ● Omdat de data in het serverpark van Github wordt opgeslagen zorgen zij voor backups, wat tot gevolg heeft dat wij dat zelf niet meer hoeven te doen. ● Github maakt collaboratie makkelijker door het visualiseren van bepaalde data, zoals de revisieboom van commits in een source tree.

18

Risico’s vermijden Als laatste worden de maatregelen beschreven die beide partijen zullen treffen om bekende risico‟s te voorkomen: ● Planningspoker - Planningspoker is een middel om de planning aan te scherpen en iedere betrokkene inzicht te laten krijgen in de moeilijkheidsgraad van bepaalde onderdelen. Door dit te gebruiken, wordt de planning realistischer en is de planning algemeen bekend onder de teamleden. ● Onderzoek naar beveiliging - Voor web applicaties die veel persoonlijke data opslaan, is veiligheid van groot belang. Door van tevoren onderzoek te doen naar veiligheid, vermijden we zoveel mogelijk bekende problemen met veiligheid. ● Dagelijkse besprekingen - Hierdoor blijft elk project lid betrokken bij het gehele project en zo voorkomen we dat iedereen op zijn eigen eilandje zit. ● Pair programming bij cruciale of moeilijke onderdelen - Moeilijke en cruciale onderdelen vereisen meer aandacht en door hier met twee projectleden aan te werken, voorkomen we dat een persoon vastloopt en voeren we tegelijkertijd een extra controle uit. ● Regelmatige tussenproduct validatie - Door elke week het tussenproduct te valideren met de opdrachtgever, voorkomen we dat we tijdens de ontwikkeling afwijken van de wensen van de opdrachtgever.

19

Requirements Analysis

Simlike platform

25 April, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Preface This is the requirements analysis of the Simlike platform. This document is made in the context of a bachelor project at the TU Delft, for Nerval Limited. More details about the Simlike platform and the authors can be found in [001]. This document will be used as a backlog, containing all requirements of the requested product. It can be changed and updated during the project.

2

Table of Contents Introduction ...... 4 Functional requirements ...... 5 F1 Matching users ...... 5 F2 Matching one-on-one ...... 5 F3 Search for users ...... 5 F4 Chat ...... 5 F4.1 Chat: Invite users ...... 5 F4.2 Chat: Photo sharing ...... 5 F4.3 Chat: Interest sharing ...... 5 F4.4 Chat: Places sharing ...... 5 F4.5 Chat: Offline chat ...... 5 F4.6 Chat: enable/disable ...... 5 F4.7 Chat: ignore function ...... 6 F4.8 Chat: status message ...... 6 F5 Private messaging ...... 6 F8 Application Program Interface (API) ...... 6 F10 Public wall ...... 6 F22 Top interests ...... 6 Other requirements ...... 8 O1 Mobile app ...... 8 O2 Security ...... 8 O4 Activity monitoring ...... 8 O5 Usability ...... 8 O6 Responsive ...... 8 O7 Reliable ...... 8 O8 Scalable ...... 8 O9 Chat: multiplicity ...... 9 O10 Expandable ...... 9 Priorities ...... 10 Must have ...... 10 Should have ...... 11 Could have ...... 11 Won’t have ...... 11 References ...... 12

3

Introduction

Before the product can be implemented, it is important to know what is required of the product. In this section, the requirements of the product will be identified and ordered based on their priority. Functional requirements are listed separate of non-functional requirements. Functional requirements are requirements that represent features of the product. By differentiating between these functional requirements and other requirements a clear overview of all required features is formed. This list of features can be reused later in the development process.

4

Functional requirements

Here the functional requirements of the project are listed.

F1 Matching users This functionality encompasses the matching algorithm that can match users, based on characteristics that are distinctive for a friendship.

F2 Matching one-on-one Given two users, what do they have in common? (i.e. what is their “success score”?)

F3 Search for users This functionality enables the user to search through the matches that have been generated.

F4 Chat Users can communicate using a chat. This chat must have these features:

F4.1 Chat: Invite users Invite other users into the conversation.

F4.2 Chat: Photo sharing Users can share photos while having a conversation (e.g. photos from their Facebook albums). These photos can be scrolled through, so multiple photos can be shared during the chat.

F4.3 Chat: Interest sharing Users must be able to share their interests visually (e.g. with a photo of the cover of their favourite book).

F4.4 Chat: Places sharing Users can share places (e.g. a Google streetview location of their favourite restaurant).

F4.5 Chat: Offline chat Users must be able to send chat messages even if the other user is offline.

F4.6 Chat: enable/disable 5

Ability to enable or disable the chat.

F4.7 Chat: ignore function Ability to ignore certain users.

F4.8 Chat: status message A small message that indicates what the user is currently doing.

F5 Private messaging Users can send each other private messages. The difference with the chat functionality is that the recipient does not have to be online.

F6 DECLARED CLASSIFIED BY NERVAL LIMITED F7 DECLARED CLASSIFIED BY NERVAL LIMITED

F8 Application Program Interface (API) An API is required to integrate Simlike functionality on external websites.

F9 DECLARED CLASSIFIED BY NERVAL LIMITED

F10 Public wall A public message board where users can post messages for everyone to read.

F11 DECLARED CLASSIFIED BY NERVAL LIMITED F12 DECLARED CLASSIFIED BY NERVAL LIMITED F13 DECLARED CLASSIFIED BY NERVAL LIMITED F14 DECLARED CLASSIFIED BY NERVAL LIMITED F15 DECLARED CLASSIFIED BY NERVAL LIMITED F16 DECLARED CLASSIFIED BY NERVAL LIMITED F17 DECLARED CLASSIFIED BY NERVAL LIMITED F18 DECLARED CLASSIFIED BY NERVAL LIMITED F19 DECLARED CLASSIFIED BY NERVAL LIMITED F20 DECLARED CLASSIFIED BY NERVAL LIMITED F21 DECLARED CLASSIFIED BY NERVAL LIMITED

F22 Top interests

6

A user had to be able to specify 5 interests as his top 5 interests. This means they get higher priority when matching with other users and it can be used a simple statement to others about what you like.

7

Other requirements

Requirements that are not functional requirements include performance requirements, complementary products and technical requirements. These non-functional requirements are listed in this section.

O1 Mobile app A mobile app on which users can search for matches based on their geographic location. The mobile app must at least offer a possibility to initiate contact, and bring geographically close users together in real life.

O2 Security Users will be sharing personal information from their Facebook profile in order for Simlike to match users. This information needs to be stored and handled in a secure way.

O3 DECLARED CLASSIFIED BY NERVAL LIMITED

O4 Activity monitoring User’s activity has to be monitored so Simlike users can see who is online and who is not.

O5 Usability Simlike must be easy to use for all users, including inexperienced users.

O6 Responsive Simlike must not take too long to respond for a real-time experience.

O7 Reliable Social media must always be accessible, especially when operating on a global scale. Therefore Simlike must be reliable and some redundancy is needed to prevent data loss or loss of service.

O8 Scalable Simlike aims to achieve a user base of millions of users. This means the product should easily be scalable.

8

O9 Chat: multiplicity Since users should be able to invite others into the chat, it is important that the chat function can support around 50 people.

O10 Expandable The Simlike platform has to be easily expandable, so new features can be added without much trouble.

9

Priorities

The requirements are prioritized in accordance with the MoSCoW method. This method is intended for reaching a consensus between stakeholders [001]. MoSCoW embodies the following categories in which requirements are classified [001]: ● M: Must have this. ● S: Should have this, if at all possible. ● C: Could have this, if it does not affect anything else. ● W: Won’t have this time, but would like to have in the future.

The requirements are classified as follows:

Must have ● F1 Matching users ● F2 Matching one-on-one ● F3 Search for users ● F4 Chat functionality ● F4.2 Chat: photo sharing ● F4.3 Chat: interest sharing ● F4.4 Chat: places sharing ● F5 Private messaging ● F22 Top interests ● O1 Mobile App ● O2 Security ● O4 Activity monitoring ● O5 Usability ● O6 Responsive ● O7 Reliable ● O8 Scalable ● O9 Chat: multiplicity ● O10 Expandable

10

Should have ● F8 API ● F4.1 Chat: invite users ● F4.5 Chat: offline chat ● F4.6 Chat: enable/disable ● F4.7 Chat: ignore function ● F4.8 Chat: status message

Could have ● F10 Public wall

Won’t have DECLARED CLASSIFIED BY NERVAL LIMITED

11

References

[001] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak”

12

Functional Design Document

Simlike platform

April 29, 2011, Delft

Joris Albeda Jeroen Dijkhuizen Joey Ezechiëls Volker Lanting

Preface

This document is part of a series of documents produced as part of the development of the Simlike platform. In particular it serves as an instrument in communication between developers and the Simlike management. Readers that are mostly interested in the use cases can find them in chapter Use cases.

2

Table of contents

Summary ...... 4 Introduction ...... 5 Facebook Application ...... 6 Use cases ...... 6 Must haves ...... 6 Should Haves ...... 10 GUI mockups ...... 12 Mobile Application ...... 15 Use Cases ...... 15 Auto matching ...... 15 Buddy List ...... 15 Location ...... 16 Messaging ...... 16 Top simlikes ...... 16 GUI Mockups ...... 18 References ...... 24

3

Summary

This document describes the functionality of the system that will be developed. A description of the requirements specified by the client is given, as well as some Graphical User Interface mock-ups to give a first impression of the system. The central functionality of the system is the matching function, the function for matching two users for friendship. Other functions that are required, are the Search, Chat, and Messaging functions. The client has specified more functions, some of which will be implemented if possible.

A description of how the user will access these functions is provided in the report in the form of use cases, as well as several pictures of the graphical user interface. The user interface will consist of a section with some information about the user, a menu and a main section showing the current active functionality.

There are two applications to be designed: a Facebook application and a mobile application.

4

Introduction

This Functional Design Document (FDD) is intended primarily as an account of the various functional features the team agreed to include in the project with the stakeholders, the feature priorities and any relevant contexts.

The features are both prioritized and categorized in the section Priorities, and the features themselves are presented as complete use cases. Though internally, the team only utilizes user stories when designing features. User stories are concise descriptions of features which the user may directly interact with by means of a user interface. This functional design process is intended to make the design mesh better with the development methodology utilized by the team, i.e. SCRUM.

Every SCRUM sprint several user stories are chosen from the backlog (a pool of user stories that can be chosen for design and implementation). The chosen user stories are then used to discuss the associated features and reach consensus about their design and implementation. This occurs verbally and is crucial to the SCRUM process. The notes from this meeting are used afterwards to construct proper use cases.

The SCRUM method has been chosen over the traditional waterfall model, where the functional design would have to be finished before the team would be able to start on the technical design. The advantage of SCRUM is that its inherent flexibility is maintained since user stories are only properly designed and implemented after they’ve been elected for a certain SCRUM sprint.

5

Facebook Application

The Facebook application is a small website which is integrated in Facebook’s social media platform. The first main version of Simlike is the Facebook application. Users will get to know Simlike for the first time via the Facebook application. It integrates the user’s Facebook profile into the Simlike services. Its functionality is described below.

Use cases This chapter lists the use cases, based on the functional requirements specified by the client [REQ]. Note that use cases for the non-functional requirements are not listed: functional requirements are the requirements that represent features of the product. For an overview of non-functional requirements, see the Requirements Analysis [REQ].

Must haves This section the describes the use cases for the features that must be implemented during the implementation:

1. Matching The system can match users, based on characteristics that are distinctive for a friendship. The use cases here describe how the user can start this functionality, and what kind of actions the user can take.

Use case F1.1 Summary: The user chooses one of the possible matches. Situation: The user is logged on. Step 1: The system calculates the matches and shows the user the main screen. In the right section of the main section, there is a list of the best matches. Step 2: The user can scroll through this list. He selects a match that appears interesting and click that match’s name. Step 3: The system shows the match’s profile. Result: The user can view the match’s information and click one of the buttons for some extra actions regarding his choice.

Use Case F1.2 Summary: The user chooses one of the possible matches from a different screen. Situation: The user is in any screen. Step 1: The user selects “Auto matching” from the menu. Step 2: The system switches to the auto matching screen and displays the user’s best matches. Step 3: The user can scroll through this list. He selects a match that appears interesting and click that match’s name. Step 4: The system shows the match’s profile. Result: The user can view the match’s information and click one of the buttons for some extra actions regarding his choice.

6

Use Case F1.3 Summary: The user sends a message to one of the possible matches. Situation: The user has chosen a match. Step 1: The user presses the "send mail" button. Step 2: The system opens the screen for making a new post. Step 3: The user types his message and presses "Send". Step 4: The system sends the message and goes to the user’s Sent messages screen.

Use Case F1.4 Summary: The user opens a chat with one of the potential matches. Precondition: The match in question is online. Situation: The user is viewing a profile. Step 1: The user presses the "Chat" button. Step 2: The system opens a chat window with that person.

2. Search The user can search for users based on interests. The results will be matched with the user and sorted by the number of common interests.

Use Case F2.1 Summary: The user searches for a co-user based on specific interests. Situation: The user can be in any screen. Step 1: The user clicks in the search bar. Step 2: The user types the interests and presses enter.. Step 3: The system switches to the search screen and displays a list of the results. Step 4: The user can scroll through this list. He selects an outcome that interests him. Result: The user can view his choice’s profile, and click one of the buttons for some extra actions regarding his choice.

3. Chat The previous two features concerned finding a user. The chat functionality enables a user to actually communicate with a user he has found. In a chat, two users can talk to each other and share their interests, photos or places.

Use Case F3.1 Summary: The user opens a chat with a co-user. Situation: The customer is viewing the profile of the co-user. Step 1: The user presses the chat button. Step 2: The system switches to the Chat panel and opens a chat window with that person.

Use Case F3.2 Summary: The user returns from his chat to where he was before. Situation: The customer is in the chat menu. Step 1: The user presses the back button. Step 2: The system saves the chat and goes back to the page it was on before opening the chat.

Use Case F3.3 Summary: The user wants to continue a saved chat. Situation: The user is in a different page from the chat menu. Step 1: The system shows all active conversations in the Chat bar. When one of the chat partners says something, the matching bar flashes.

7

Step 2: The user selects the conversation in the chat bar. Step 3: The system switches back to the chat menu and displays the conversation.

Use Case F3.4 Summary: The user shares one of his interests. (Photos and places are shared in the same way) Situation: The user is in the chat menu and has an active chat. Step 1: The user presses the “share” button. Step 2: The system opens a list of the user’s interests, photos and places. Step 3: The user drags an interest to the chat screen. Step 4: The system displays the interest on the user’s and the chat partner’s screen.

4. Messaging For two users to be able to chat, both of them have to be online. The messaging functionality enables a user to send a message to an offline user.

Use Case F4.1 Summary: The user reads a new message. Precondition: The message is new. Initial situation: The user is logged on. Step 1: The system shows the main screen. In the main section, there is a list of notifications, including any new messages. Step 2: The user selects a message. Step 3: The system switches to the message menu and displays the message.

Use Case F4.2 Summary: The user goes to the "messages" menu. Situation: The user is logged on. Step 1: The system shows the main screen. Step 2: The user selects "mail" from the menu. Step 3: The system opens the messages menu.

Use Case F4.3 Summary: The user reads a message. Situation: The user is in the messages menu. Step 1: The system displays a list of all messages. Step 2: The user can scroll through this list and view all of his messages.

Use Case F4.4 Summary: The user views his sent messages. Situation: The user is in the messages menu. Step 1: The user clicks the “Sent Messages” button. Step 2: The system displays a list of all messages the user has sent. Step 3: The user can scroll through this list and view all of his sent messages.

Use Case F4.5 Summary: The user deletes one of his messages. Situation: The user is in the messages menu, either in the inbox or the sent messages folder. Step 1: The user selects the checkbox next to the message he wants to delete. Step 2: The user presses the “delete selected messages” button. Step 3: The system removes the selected messages from the folder.

8

5. Top Interests A user can specify five interests as his top five interests. This means they get higher priority when matching with other users and it can be used a simple statement to others about what you like.

Use Case F5.1 Summary: The user selects an interest as one of his top interests. Situation: The user is in any menu. The user’s top interests are displayed in the top of the screen. Step 1: The user hovers the mouse cursor above the top interests. Step 2: The system opens up a list of all the user’s interests. Step 3: The user drags the interest of his choice to one of the top interest slots. Step 4: The user moves the mouse cursor away from the interests. Step 5: The system saves the top interests as they are now.

Use Case F5.2 Summary: The user removes one of his top interests. Situation: The user is any menu. Step 1: The user hovers the mouse cursor above the top interests. Step 2: The user drags the top interest of his choice away from the slot. Step 3: The system empties the slot. Step 4: The user moves the mouse cursor away from the interests. Step 5: The system saves the top interests as they are now.

9

Should Haves This section contains use cases that were designed for the features that should be implemented in the project. Features which are not implemented can be used in the design for future work on the system.

6. Trust button The Trust button allows users to show that another user can be trusted. Each user has a “Trust” count that increments each time a different user presses his Trust button. It can be compared to Facebook’s Like button.

Use Case F6.1 Summary: The user indicates that a match can be trusted. Situation: The user is viewing the profile of the match. Step 1: The user presses the Trust button. Step 2: The system records that the match has a new Trust.

Use Case F6.2 Summary: The user withdraws his trust. Situation: The user has chosen a match. Precondition: The user has previously expressed his trust towards the match. Step 1: The system has replaced the Trust button with a message that the user trusts this match and a Distrust button. Step 2: The user presses the Distrust button. Step 3: The system records that the match has a Trust less.

7. Mood status Each user has a “mood bar”, which visualises the user’s mood with a colour.

Use Case F7.1 Summary: The user adjusts his mood status. Situation: The user is logged on. Step 1: The system shows the main screen. Next to the profile picture, the mood bar and the mood box for today are displayed. Step 2: The user presses the mood box. Step 3: The system shows a drop-down menu with different colours to choose from. Step 4: The user chooses a colour. Step 5: The system removes the menu, fills the mood box with the chosen colour in the mood box and extends the mood bar with the new colour.

8. Chat function on and off Some users might want to use the services of Simlike without other users being able to start a chat with them. This functionality allows them to switch the chat off. Other users will not be able to start a chat with a user who has switched his chat off.

Use Case F8.1 Summary: The user switches his chat function on or off. Initial situation: The user is in the chat menu. Step 1: The system shows the chat menu, including a section with a message indicating if the user’s chat function is on or off, and a button "Disable Chat" or "Enable Chat".

10

Step 2: The user presses the button. Step 3: The system saves the user’s choice.

9. Additional features in Chat This section describes a number of features to expand the chat functionality with. New features are blocking a specific user from chat and changing one’s status (away, online, busy, etc.).

Use Case F9.1 Summary: The user blocks a chat partner. Situation: The user is viewing active chat window with a fellow user. Step 1: The user presses the "Block" button in the chat window. Step 2: The system asks for confirmation. Step 3: The user presses the "Yes" button. Step 4: The system displays a popup stating that the co-user is blocked. Step 5: The system closes the chat window. Result: The system has detected that the user has blocked the co-user. The co-user will be unable to see the user in the future.

Use Case F9.2 Summary: The user blocks a co-user. Situation: The customer is viewing the profile of the sharer. Step 1: The user presses the "Block" button on the page. Step 2: The system asks for confirmation. Step 3 : The user presses the "Yes" button. Step 4: The system displays a popup stating that the co-user is blocked. Step 5: The system remains on the co-user’s profile. Result: The system has registered that the user has blocked the co-user. The co-user will be unable to see the user in the future.

Use Case F9.3 Summary: The user unblocks a co-user. Situation: The customer is viewing the profile of the sharer. Step 1: The user presses the "Unblock" button on the page. Step 2: The system asks for confirmation. Step 3 : The user presses the "Yes" button. Step 4: The system displays a popup stating that the co-user is no longer blocked. Step 5: The system remains on the co-user’s profile. Result: The system has registered that the user is no longer blocking the co-user.

Use Case F9.4 Summary: The user changes his status. Situation: The customer is in the Chat menu. Step 1: The user selects "status". Step 2: The system displays a drop-down menu with the different possible states. Step 3: The user chooses a state. Step 4: The system closes the menu. Result: The system has registered that the user has a new status.

11

GUI mockups This chapter contains a number of images designed to give an idea of what the user interface will look like. The final GUI has been based on these designs. Listed are a screenshot of the main screen, the chat function, two messaging mock-ups, and the top interests.

Mockup 1: The main screen

12

Mockup 2: Chat screen

Mockup 3: Message system - Inbox

13

Mockup 4: Viewing a message

Mockup 5: Editing one’s top interests

14

Mobile Application

The mobile application is made for smartphone users. They can log in with their Facebook account. The mobile application provides the following features: ● Auto matching, make suggestions about other users which have the same interests as you. ● Buddy List, a list of users which the user has bookmarked. ● Location, determine the user’s physical location to offer better suggestions when matching. ● Messaging, the ability to send each other messages regardless of the recipient’s offline/online status. ● Top Interests, view and edit a user’s top interests.

Use Cases This section lists the use cases for the mobile application. These use cases serve as a validation of the implementation.

Auto matching Like the Facebook app, the mobile app supports auto matching. The functionality is the same.

Use Case M1.1 Summary: The user views his matches. Situation: The user is logged in. Step 1: The user presses the “Auto matching” button. Step 2: The system displays a list of the user’s best matches. Result: The user can scroll between the list of users.

Buddy List The mobile app features a buddy list where users can insert their favourite co-users.

Use Case M2.1 Summary: The user views a buddy’s profile. Situation: The user is logged in. Step 1: The user presses the “Buddy list” button. Step 2: The system displays the user’s buddy list. Step 3: The user selects the user of his choice and clicks the “Profile” button. Step 4: The system displays additional information about the chosen user’s profile.

Use Case M2.2 Summary: The user adds a co-user to his buddy list. Situation: The user is viewing a co-user’s information. Step 1: The user presses the “Add to buddy List” button. Step 2: The system adds the co-user to the user’s buddy list.

15

Location The mobile app makes use of the smart phone’s GPS system to determine the user’s location. It can be viewed on an interface similar to Google Maps.

Use Case M3.1 Summary: The user views his own location. Situation: The user is logged in. Step 1: The user presses the “Map” button. Step 2: The system displays the map, and on it a marker indicating the user’s location. Result: The user can scroll around the map, viewing the area around his location.

Messaging As with the Facebook app, the mobile app allows users to send offline messages to each other.

Use Case M4.1 Summary: The user reads a message. Situation: The user is logged in. Step 1: The user presses the “Messages” button. Step 2: The system displays a list of all messages. Step 3: The user can scroll through this list and view all of his messages.

Use Case M4.2 Summary: The user sends a message. Situation: The user is viewing a user’s information. Step 1: The user presses the "Mail" button. Step 2: The system opens the screen for typing a new message. Step 3: The user types his message and presses the "Send" button. Step 4: The system sends the message and returns to the previous menu.

Use Case M4.3 Summary: The user replies to a message with a new message of his own. Situation: The user is viewing a message. Step 1: The user presses the “reply” button. Step 2: The system adds an interface for typing a new message to the screen. Step 3: The user types his message and presses the "Send" button. Step 4: The system sends the message and returns to the message.

Top simlikes Users can edit their favourite five simlikes in the mobile app.

Use Case M5.1 Summary: The user edits his top interests. Situation: The user is logged in. Step 1: The user presses the “Edit top interests” button at the top of his screen. Step 2: The system displays a list of the user’s top interests with checkboxes. A marked checkbox indicates a top interest. Step 3: The user marks the checkboxes of the interests of his choice. Step 4: The user presses the “Save” button.

16

Step 5: The system registers the changed top interests. Result: The user is returned to the main menu.

17

GUI Mockups As with the Facebook App, the Mobile App’s design is based on a number of mockups. Please note that some of the mockups contain links to functionality that is not planned for this project, such as the chat function. These are not implemented in the final product, and are not discussed in the documentation. However, they are included in the mockups to give an idea of what the design is aiming for.

Mockup 6: Matching

18

Mockup 7: Messaging

19

Mockup 8: Location

20

Mockup 9: Buddy List

21

Mockup 10: Top Interests

22

Mockup 11: Main menu

23

References

[REQ] Albeda, Dijkhuizen, Ezechiëls, Lanting, 2011, “Requirements analysis”.

24

Technical Design Document

Simlike platform

July 7, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Preface This document, the Technical Design Document, is part of the documents produced in the development of the Simlike platform. It contains all the technical details and decisions of the design. It can be used to review the decisions made while designing, as well as the questions that are yet unanswered. The document will mainly be used by the team, and by any who may wish to continue the project.

For more information about the product, called Simlike, see [PVA].

2

Table of contents

Summary ...... 4 Introduction ...... 5 Architecture ...... 6 Foundation ...... 7 Algorithm platform ...... 7 Database Platform ...... 12 Hardware platform ...... 20 Web platform ...... 21 Simlike ...... 25 Mobile application ...... 25 Facebook application ...... 26 Website ...... 29 API ...... 31 Java packages ...... 39 JavaScript SDK ...... 40 Quality assurance ...... 44 Security ...... 44 Licenses ...... 44 Availability/Reliability ...... 45 Expandability/Maintainability ...... 45 Implementation recommendations ...... 47 References ...... 48

3

Summary

This document describes the technical design of a social media platform called Simlike. Simlike is unique in the sense that it focuses on expanding a user‟s social network with new connections. This is achieved by a unique matching algorithm, which matches users based on interests they have in common. This is in contrast with other popular social media, where the focus lies on maintaining current social relations.

The Simlike platform offers several services to its users: a Facebook application, a mobile application and a website. An application programming interface is set up as the core of these services. The application programming interface may also be offered to third parties in the future.

The infrastructure for the services is designed to be as scalable, reliable, secure and available as possible, while ensuring the possibility that more features can be added relatively easy.

At the core of the platform, several algorithms are working which are of great importance to Simlike‟s functionality. These algorithms operate on potentially huge amounts of data, which have to be stored and processed securely and efficiently.

Special measures have been put into place to secure the quality of Simlike‟s services and the design of the products. These measures are related to security, licenses, availability, reliability and maintainability. These aspects are important because they play an important role in user experience. Maintainability is mainly important for keeping development costs at a reasonable level.

4

Introduction

This document describes the architecture of the Simlike system as a whole that will be created for the client. It starts off with a global overview of the architecture in section Architecture and proceeds to discuss how the quality of the product will be assured in section Quality Assurance. The document finishes with some recommendations for the implementation of the product in section Implementation Recommendations.

The architecture consists of several layers. First the foundation layer is discussed in section Foundation. This layer contains various required elements for Simlike such as the hardware, databases and algorithms. Next the Simlike application layer is discussed in section Simlike. This layer contains the applications users will interface with and abstract services used by these applications.

The requirements and functional design of the product can be found at [REQ] and [FUNC] respectively.

5

Architecture

This chapter covers the architectural design choices for the Simlike platform. First, the foundation for the Simlike platform will be laid out. Secondly, the Simlike platform is explained. The architecture for the Simlike platform can be visualised as follows (figure 1):

Figure 1. Simlike platform

Starting in the top layer, there are four components which are interfaces to the user: the mobile application, the Facebook application, the website and background services. The mobile application is an interface to Simlike on mobile devices and is a tool which facilitates location- based functionality. The Facebook application will be the main interface to users on the web. The website will be a directing post for the Facebook application. A fourth component, “background services”, is mainly responsible for integration with third party web services, such as updating user data from Facebook in the background.

All these components utilise the same interface to the underlying services: the Application Programming Interface (API). The API abstracts the entire foundation from the upper services.

6

Beneath Simlike lies a foundational layer which contains four platforms which are not directly exposed to third parties. It contains components such as algorithms and data storage. The algorithms can work on data from the database. The results of the algorithms are made available to the applications in the top layer via the API. Note that there is only vertical communication. E.g. if the mobile application wants to communicate with the Facebook app, it would have to do so via functionality of the API.

Foundation The foundation entails everything that provides core services to the rest of the Simlike platform as a whole, from more mundane tasks such as data persistence to services such as matching Simlike users. The foundation exists out of four platforms: 1. Algorithm platform: responsible for processing the data. 2. Database platform: responsible for storing the data. 3. Hardware platform: responsible for running the other platforms. 4. Web platform: responsible for making the services available to the outside world.

Algorithm platform In this section the algorithms that are needed for Simlike‟s features will be discussed. A brief overview will be given of the possible methods to solve the problems and the decisions made. These algorithms involve a concept called a simlike. This is a piece of information which identifies a user interest, which could be shared with other users.

DECLARED CLASSIFIED BY NERVAL LIMITED

7

DECLARED CLASSIFIED BY NERVAL LIMITED

8

DECLARED CLASSIFIED BY NERVAL LIMITED

9

DECLARED CLASSIFIED BY NERVAL LIMITED

10

DECLARED CLASSIFIED BY NERVAL LIMITED

11

DECLARED CLASSIFIED BY NERVAL LIMITED

Database Platform The database platform is responsible for storing all the data. It must do this in a secure, scalable and reliable way. Furthermore it must be able to serve up the data in reliable and responsive manner. The database of choice is Cassandra. This is the result of a study done by the development team in [SR].

Cassandra Cassandra will be used as a storage platform. The “Cassandra database schema” is very different from traditional SQL tables. In fact, it does not even use tables. Cassandra is known as a “schema-less data-store”. Let us look at the terminology first, aided by a few examples, to help to get the gist of it. These are the terms to get familiar with: Keyspace, Column Family, Row and Column. There is a hierarchical relation between these terms, which may help to understand their function. This relation can be visualised as follows in figure 6:

Figure 6. Cassandra data model abstracted hierarchy

12

Multiple Keyspaces may exist at the architect‟s discretion. Typically there is one Keyspace per application. In this sense a Keyspace is analogous to a database in traditional Database Management Systems (DBMS), but be careful not to think too traditionally. Thinking in Columns and Column Families is quite a paradigm shift from traditional DBMS. A Keyspace may contain an unlimited number of Column Families. A Column Family is used to group relevant Columns and may contain an unlimited number of Columns. The Cassandra Wiki says the following about Column Families [AC1]:

“Each column family is stored in a separate file, and the file is sorted in row (i.e. key) major order. Related columns (those that you'll access together) should be kept within the same column family. The row key is what determines what machine data is stored on. Thus, for each key you can have data from multiple column families associated with it. However, these are logically distinct.”

The file is sorted in row major order, meaning rows are stored one after another (as opposed to column major order, where the columns are stored one after another) [WP1]. The file on disk that contains the rows is immutable [CDG]. This means rows can only be appended to the file; this is how Cassandra guarantees fast writes. Cassandra periodically rebuilds the storage file in the background to optimise read performance.

There are two types of columns: a Column and a Super Column. A Super Column can contain other Columns, but not other Super Columns. This limits it to a two-layer column architecture. Likewise, the Column Families come in two flavours: Standard and Super. If a Column Family contains Columns, it is called a Standard Column Family. If a Column Family contains Super Columns, it is called a Super Column Family. A Column Family and Super Column Family can contain an unlimited number of Columns and Super Columns, respectively. A Super Column Family can only contain a Standard Column Family and thus cannot contain anther Super Column Family. See figure 7. A Column which resides in a Super Column, is often referred to as a Subcolumn, though is not a standard notation.

Figure 7. The two possible ways to use Column Families: Standard and Super.

13

The final aspect of the Cassandra data model is the Row. A Row is contained within a Column Family and has a row key. Furthermore, the Row contains values for (some) Columns. Cassandra indexes the data by storing the rows and (Super) Columns in a sorted manner. The rows are sorted by their keys. (Super) Columns are sorted by their names (and not by Column value!). Which name is chosen for a column is an important design decision, because this has implications for querying the data. E.g. queries which return a time sorted result would benefit from the decision to name the columns with a unique timestamp (e.g. a Time UUID) [FB1]. The big picture visualised as class diagram in UML looks like this (figure 8):

Figure 8. Cassandra data model as a class diagram in UML.

As mentioned earlier, the data is stored in a sorted data structure. To be able to sort (Super) Columns the data has to be compared in some manner. (Super) Column names can be compared by a number of built-in comparators. It is also possible to design a custom comparator. The built-in comparators are: ● AsciiType; this one works the same as BytesType, but ensures that column names can only be ASCII strings. ● BytesType, which compares the column names byte for byte (this is a lexical comparison). ● LexicalUUIDType, which works the same as BytesType, but ensures column names can only be 16-byte (128-bit) UUID‟s. ● LongType, which sorts by 8-byte (64-bit) long numeric type. ● IntegerType, which sorts integers of any length, works faster than LongType. ● TimeUUIDType, which sorts by a 16-byte (128-bit) version 1 time UUID. ● UTF8Type, which works exactly like BytesType, but also validates the column names as UTF-8.

14

Comparators are specified per Column Family. In the case of a Super Column Family, a comparator and subcomparator need to be specified. The comparator compares the Super Column names and the subcomparator compares the Column names. In the case of a Standard Column Family, only a comparator is needed which compares the Column names. Cassandra never compares values stored in the database, only (Super) Column names. Comparators are not used for Rows either, for this purpose there exist partitioners. In this way, a partitioner for Rows is analogous to a comparator for columns (though in fact partitioners are slightly more complicated). The partitioner is a property which can be defined per cluster.

The partitioner which is most easy to understand is the OrderPreservingPartitioner. This stores the rows by row keys. This has benefits performance-wise when it‟s needed to iterate sequentially over a sorted set of rows. Because rows are stored in row-major order, they can be accessed on disk in one fell sweep.

The standard partitioner is RandomPartitioner. This hashes the row keys with an MD5 algorithm, and then uses the OrderPreservingPartitioner to store the rows by row key hash. A property of MD5 is that the resulting hashes of a sorted list of values, are uniformly random distributed. Because of this property, the data is also uniformly distributed over the cluster. This helps in the prevention of so called „data hot spots‟ (spots of frequently accessed data). Note that it is still possible to iterate over the rows, but they will be returned in random order, in contract to OrderPreservingPartioner.

The RandomPartitioner will be used, because of its excellent load balancing properties. In the design of the Column Families, this must be taken into account. If it would be necessary to iterate over values in sequential order, these values should be places in the column names and not in the row keys. This is because column names are always stored on the same node in the cluster. Rows may be distributed over the entire cluster, which would introduce high latencies for queries when trying to access them in sequential order.

Cassandra Schema This schema explicitly defines the structure the data in Cassandra should adhere to. The defined Column Families are constructed to guarantee optimal performance. Wherever it is optimal, disk space is sacrificed in order to achieve less disk seek time and network latency.

FacebookSimlike Column Family Name: FacebookSimlike Column Family Type: Standard Comparator: UTF8Type Description: An index to look up a Simlike user ID given a Facebook user ID. Row key: Columns: ● Name: “simid” - Value: contains the Simlike user ID.

Simtoken

15

Column Family Name: Simtoken Column Family Type: Standard Comparator: BytesType Description: Contains the Simlike tokens used for authentication. Row key: Columns: ● Name: - Value:

Design notes: the simtokens are stored in the column names. This allows to logout the user on all sessions at once by deleting all the columns in one query (ie. the number of queries needed for this is constant). If the simtoken would be used as key, another column family would be needed to support a global logout function. Another possibility would be to put the simtokens in the column values and use a secondary index. This approach has serious performance issues, secondary indexes are only preferred on data with a low cardinality (ie. a lot of duplicate values). Simtokens are unique values, hence a custom index in the form of a column family is used.

User Column Family Name: User Column Family Type: Standard Comparator: UTF8Type Description: Contains the Simlike-specific information about the users. Row key: Columns: ● Name: “first_name” - Value: ● Name: “last_name” - Value: ● Name: “full_name” - Value: ● Name: “gender” - Value: ● Name: “date_of_birth” - Value: ● Name: “current_location” - Value: ● Name: “last_activity_timestamp” - Value: ● Name: “last_login_timestamp” - Value: ● Name: “last_login_source” - Value: ● Name: “last_login_address” - Value: ● Name: “profile_privacy” - Value: ● Name: “photo” - Value: ● Name: “top1” - Value: ● Name: “top2” - Value: ● ... ● Name: “top” + - Value:

Design notes: Roughly contains the basic user information and a little more (e.g. activity information).

16

UserFriends Column Family Name: UserFriends Column Family Type: Standard Comparator: UTF8Type Description: Contains the Simlike user IDs (simids) of the users who are (Facebook) friends of the given user. This includes the user him-/herself. RowKey: Columns: ● Name: - Value: “” - The value is not important, it‟s the existence (or absence) of the column that matters. ● ... ● Name: - Value: “”

UserFrof Column Family Name: UserFrof Column Family Type: Standard Comparator: UTF8Type Description: Contains the Simlike user IDs (simids) of the users who are (Facebook) friends of the given user or friends of those friends. This includes the user him-/herself. RowKey: Columns: ● Name: - Value: “” - The value is not important, it‟s the existence (or absence) of the column that matters. ● ... ● Name: - Value: “”

UserData Column Family Name: UserData Column Family Type: Super Comparator: UTF8Type Subcomparator: UTF8Type Description: Contains third party specific information about the users, but in a source specific format (e.g. JSON). Row key: Supercolumns: ● Name: + “:” + ● Name: + “:” + ● …

Subcolumns: ● Name: - Value: ● Name: - Value: ● …

17

Design notes: Contains all kinds of data about a user. The super column name uniquely identifies the information. The part has to be unique within the source it came from. (By prefixing the super column name with the , the super column name is guaranteed to be unique within Simlike). The subcolumns contain more information about the element, e.g. date of creation, name, URL, who created it, type of data etc.

UserDataExternal Column Family Name: UserDataExternal Column Family Type: Super Comparator: UTF8Type Subcomparator: UTF8Type Description: Contains third party information about users which is specific for the Simlike application. This should only contain frequently accessed information. Row key: Supercolumns: ● Name: ● Name: ● ...

Subcolumns: ● Name: - Value: ● Name: - Value: ● ...

Design notes: it is important that the number of subcolumns does not grow too big. The reason for this is that all subcolumns are deserialized (ie. loaded into memory) upon accessing a supercolumn. This is a Cassandra limitation. External information such as “Facebook likes” are not stored in this table, for this purpose there is another table called UserData. It is important that this column family only contains data that requires frequent lookups, such as IDs, index names, etc. Less frequently used data should be moved to another column family to keep up performance.

UserDataRaw Column Family Name: UserDataRaw Column Family Type: Standard Comparator: UTF8Type Description: Contains third party specific information about the users, but in a source specific format (e.g. JSON). Row key: Columns: ● Name: + “:” + - Value: ● Name: + “:” + - Value: ● …

18

Design notes: Usage of this column family should be avoided in production use. It serves as a mock data source. Should only be written to by the API and read by the background service. The column name is a composite name to avoid using a super column family (which would deserialize all columns upon information retrieval, which is undesired behaviour because of the performance penalty).

UserSimlike Column Family Name: UserSimlike Column Family Type: Standard Comparator: BytesType Description: Contains the simlikes (a.k.a. UserData which are visible to other users) sorted by their recency. Row key: Columns: ● Name: + “:” + - Value: ● Name: + “:” + - Value: ● …

Design notes: The timestamp part of the column name is fixed length. This forces Cassandra to sort the columns by timestamp. If the finer grained parts (e.g. minutes, seconds) of the timestamp are not known, fill in zeros for those parts.

UserMessages Column Family Name: UserMessages Column Family Type: Super Comparator: TimeUUIDType Subcomparator: UTF8Type Description: Contains all messages of the users. Row key: SuperColumns: ● Name: ● Name: ● …

SubColumns: ● Name: “subject” - Value: ● Name: “to” - Value: ● Name: “from” - Value: ● Name: “content” - Value: ● Name: “map:” + - Value: empty ● … ● Name: “map:” + - Value: empty

19

Design notes: The column name is unique (as is required) and this way it is easy to request the latest messages, since Cassandra sorts the columns by timestamp. Each message stores the basic information (subject, to, from, content) and a subcolumn per map it is stored in. This makes it possible to store messages in several maps, without copying the entire message.

UserMessageMaps Column Family Name: UserMessageMaps Column Family Type: Standard Comparator: TimeUUIDType Description: Contains all message maps of the user. Row key: + “:” + Columns: ● Name: Value: ● Name: Value: ● ...

Design notes: Multiple maps are allowed per message. When a message is added to this map, a subcolumn map: should be added to the message at the UserMessages column family. Just like maps can contain messages it is possible to create a new column family to let unique subjects contain messages (all replies to that subject). This way all replies to a message can be obtained and a „homopost‟ feature can be implemented.

UserMessageMapNames Column Family Name: UserMessageMapNames Column Family Type: Standard Comparator: UTF8Type Description: Contains all the names of the message maps of the user. Row key: Columns: ● Name: Value: ● Name: Value: ● ...

Design notes: This column family is needed, since map names are stored implicitly in row keys in UserMessageMaps. It is important to insert the map name in this CF when a message is added to a map. Checking for the map‟s existence first would require an extra call, so just insert it instead. Cassandra will do the rest.

Hardware platform The Hardware platform consists of the hardware required to host the Web and Database platforms. It has been opted to spend as little time as possible on maintenance on the hardware

20 platform. To guarantee the quality of this platform, maintenance is outsourced to Amazon Web Services (AWS). All servers will be hosted on the Elastic Cloud Compute (EC2) service. This service provides Virtual Private Servers (VPS) which are extremely scalable and very maintenance friendly. Literally at the press of a button, extra VPSs can be added or replaced, making hardware maintenance a breeze and instantaneous. Extra services are offered by AWS to make the live of the developers easier: ● Load balancing. Automatically spread the load over multiple VPSs so that the service is equally responsive to all users. ● Auto scaling. Automatically add or remove VPSs based on the load of the platform. ● Backups. It is possible to create snapshots of VPSs. This allows restoring a VPS in case of failure with the click of a button. There is no need to reinstall and reconfigure all the services installed on that VPS. ● Worldwide coverage. AWS has data centres spread out over all major continents, making it possible to provide a reliable service to users worldwide.

Other services are offered by AWS. The services that are of interest to the design of the system are: ● Amazon Elastic MapReduce. Amazon offers out-of-the-box Hadoop MapReduce functionality. It is possible to run a MapReduce job on a cluster of servers, with control over how many servers run the job. There is no need to configure the infrastructure, as Amazon takes care of this. This service is potentially of interest for the Algorithms platform. ● Amazon CloudFront. This is a content delivery service, which allows for low latency and high speed data transfer. This is used by for the Web platform. ● AWS Elastic Beanstalk [AAEB]. A distribution system for web applications, it can handle capacity provisioning, load balancing, auto-scaling, and application health monitoring automatically. This is used by the Web Platform. ● Amazon Route 53. A scalable DNS service, capable of load balancing requests to the data centres with the best network conditions for the incoming request. ● Amazon Simple Storage Service and Amazon Elastic Block Store. Both scalable persistent storage platforms. Particularly of use to the Database and Web platforms.

Web platform This section describes the web platform, the responsibilities of the web platform and how it is set up.

The web platform is consists of all the services which are connected to the internet directly. For example: web services, load balancers, firewalls and proxies. The web platform must guarantee the following properties (as dictated by the requirements): ● Responsive, it must not take too long to respond for a real-time experience.

21

● Reliable, it must always be accessible. ● Scalable, it should easily be scalable. ● Secure, information needs to be stored and handled in a secure way. ● Expandable, adding new features should be possible without much trouble.

There are two kinds of content that are served from the web platform: static content and dynamic content. Static content consists of files which do not change after they have been uploaded, e.g. JavaScript files and HTML files. Dynamic content is not actually uploaded. Dynamic content is generated on the server side, e.g. it could be generated by a Java servlet.

The static content will be served using Amazon CloudFront: a Content Delivery Network (CDN). A CDN specialises in delivering static web content to end-users with low latency and high availability. It achieves this by storing copies of the files in various geographically distinct locations. Users are redirected to the nearest copy of the file. Currently they are hosted at 19 different locations spread out over three continents: North America, Europe and Asia. Any changes to the files will be automatically propagated to these locations. For a list of these specific locations, please see [ACF].

The servers which serve the dynamic content will be hosted at AWS Elastic Beanstalk (AEB). This is the result of a study done prior to constructing this TDD [SR]. One of the alternatives was hosting a custom maintained web server on Amazon‟s Elastic Cloud Compute (EC2). This option was not chosen for the API, because this option is more maintenance intensive. The AEB servers can be configured to run from multiple data centres which are close together, in order to minimize the impact in the case of data centre outage.

It is also possible to run the AEB servers from multiple data centres which are on different continents. This brings down the latency for dynamic content worldwide. It also minimises the chances of complete service outage in the case of e.g. a natural disaster. This requires extra work, as this setup is not supported as an Amazon Service. The solution would be to host two identical versions of the web platform and make them synchronise with each other. A geographical location aware DNS service could be used to send visitors to the data centre which is nearest to them for optimal response times. In the case of a natural disaster, all traffic could be redirected to the other continent, so that the Simlike service remains up and running. However, this requires twice the infrastructure, which also means double costs. Next to that, extra costs would be generated by the traffic between the continents, which is needed to synchronise the services. A setup with multiple data centres on multiple continents for AEB will not be implemented. Instead, a setup with multiple data centres close to each other will be implemented.

The AEB servers run an Apache Tomcat Java server. This means Java will be used to serve dynamic content to the internet. A property of AEB is that extra servers are added automatically if the load becomes too high. This also works the other way: servers are automatically removed when the load is very low. The total load is balanced over the active servers. This saves a great deal of money, by paying only for what you use.

22

For the chat service, an Extensible Messaging and Presence Protocol (XMPP) service will be hosted. XMPP is a chat protocol, which is often referred to as Jabber. The software used for the server side of the chat service is ejabberd.

Together with open fire, ejabberd is the most used XMPP chat server. The biggest difference between the two servers is that open fire has the possibility of buying commercial support and ejabberd has the possibility of easy clustering. This last feature is very important to keep the chat service scalable so ejabberd was chosen.

The chat service is exposed to the internet via a proxy. Nginx is used for hosting this proxy. It is capable of handling more than 10.000 connections per second, which makes it perfectly suited for proxying chat traffic. It is possible to install Nginx on the AEB servers, so that no extra servers are needed for the XMPP service. This way, also the proxies are automatically load balanced.

There is another advantage of hosting the Nginx proxies on the AEB servers. This makes the AEB the only group of servers which offer services to the internet. All these servers share the same firewall rules, thanks to the Amazon Web Services (AWS) management console. The management console allows configuring all the servers by a single set of firewall rules.

More security measures, besides firewalls, have been taken. A number of recommendations have been made in [SEC], and a subset of these has been used in the design and implementation of the Simlike platform: ● An access control list (ACL) has been implemented to control what simlike users can do within the system. This has been done to ease the implementation of other kinds of users (e.g. companies) in the future. ● The Least Privilege Principle has been used, which means that users have the rights to do what they need to do, but no more than that. By doing this, the project team has also had to think about which users should have access to which pieces of data belonging to other users. ● Availability of the system has been ensured by using Amazon‟s EC2 and AAEB offerings, so that more or less resources are used as necessary. ● The Simlike system is designed to be secure itself, that is, no “turtle shell” design has been chosen. ● By using Apache Shiro, the Simlike system is using as much reusable security-related code as possible ● The Simlike system treats any input that comes directly from users purely as data and never as any form of code, thereby avoiding means of attack such as SQL injections. ● The Simlike system has been designed to be as technically simple as possible; complexity is only added when absolutely necessary, thereby avoiding bugs resulting from such unnecessary complexity. In combination with using Apache Shiro, this also avoids Security Through Obscurity.

23

● There is only one way to accomplish any action within both the Simlike Facebook application and the mobile application. This can only be done via the API. This means that any action that can be taken is necessarily done in a secure manner to avoid big security leaks.

With respect to the expandability of the platform: AWS also offers other services, besides AEB and EC2, which could be used to expand the Simlike platform. For a complete overview of available services, please see [AWS]. A significant advantage of adopting these services is that they are hosted in the same data centres. This eliminates data traffic costs between these services; there is no charge for data traffic inside the same data centre.

To summarise, all the criteria have been met: ● Responsive: by using a CDN, load balancers, geographical location aware DNS service and high performance proxy, all content is served with low latency everywhere around the world. ● Reliable: the content is served from multiple data centres. The static content is served from multiple data centres on multiple continents, while the dynamic content is served from multiple data centres on one continent. ● Scalable: by using the AWS Auto Scaling feature, more servers are automatically added if the load becomes too high. ● Secure: measures are effectuated to fend of attacks. Other measures are in place to minimise the impact of successful attacks. ● Expandable: the AWS infrastructure offers other services which could be adopted relatively easy at relatively low costs.

24

Simlike The outer shell of Simlike is composed of four public services: a mobile application, a Facebook application, a Website and the Simlike API (also see figure 1). There is also a service running in the background which takes care of asynchronous tasks, such as updating the Simlike user data from Facebook.

Mobile application One of the must-haves as defined in the requirements analysis is that Simlike should support mobile phones, which has led to the development of a web-based application with a user interface designed for smart phones [REQ]. To accomplish this, the team has decided to use Mobl since it shows great promise in reducing boilerplate code while at the same time encouraging the development of an application logic flow that is easy to understand for inexperienced and experienced developers alike with regards to the Mobl code base [MOBL].

The mobile application roughly stands on equal footing with the Facebook application in the sense that they are both little more than presentation layers towards users for the services implemented in lower architectural layers, such as matching users, enabling users to indicate what their top simlikes are, data storage in the cloud and the automatic matching of users. This is possible because of the Simlike API, which provides these services both to Simlike and third party developers.

However, there are also differences between the mobile and the Facebook applications: no chat functionality will be available in the mobile application, while the nature of smartphones allows the mobile application to offer, for example, location-based services which will allow users to see which of their contacts are close by and available for an activity.

When designing a new software product it is important to take the licenses of the used components into account, and the mobile application is no different. The most prominent of these is Zef Hemel‟s Mobl project, which provides the domain-specific language used to write most of the code for the mobile application. The Mobl code is licensed under an MIT license. The other main external component that warrants mentioning is jQuery, which is licensed under both the GPL and an MIT license.

To introduce and retain as much modularity as possible in the mobile application, it has been divided up in multiple screens, where a screen is a well-defined part of the GUI that can be viewed separately from other screens if required data is provided. While not enforced by Mobl, every screen has been given a focused responsibility. This is made easy and more natural to developers by the fact that Mobl supports this directly by means of the “screen” syntactic construct.

These screens have been defined:

25

● root: A screen that is invisible to users. This is required by Mobl and is used as akin to “main” methods and functions in programming languages such as C, C++ and Java. It is used here to perform authentication before any other part of the mobile GUI is presented to the user. ● main: The entry point of the application from the perspective of the user. This enables users to select which activity they want to perform. ● buddy list: Displays an overview of the user‟s contacts and enables the user to interact with them. ● location: Displays a user‟s current location ● mail overview: Displays an overview of mails the user has received and enables the user to select a received mail for viewing. It also enables the user to send new mails. ● mail reply: Enables a user view and reply to a mail that was selected in the mail overview screen. ● suggestions: Suggests other Simlike users to the user based on his/her simlikes. ● top simlikes: Displays the user‟s simlikes, and enables the user to select his/her top simlikes.

Facebook application The Facebook application is a small website which is integrated in Facebook‟s social media platform. The first main version of Simlike is the Facebook application. Users will get to know Simlike for the first time via the Facebook application. It integrates the user‟s Facebook profile into the Simlike services. The application is written in HTML, JavaScript and CSS. It communicates with the API via the JavaScript SDK, to request any data from the database.

Graphical User Interface The Facebook app requires the design of a Graphical User Interface (GUI). The colours and graphical details have been made by Paris Hidden. It is vital that the GUI is user-friendly. GUI mock-ups can be found in the Functional Design Document.

The GUI consists of a frame with three sections. The top section displays information about the user, such as the user‟s name and the user‟s top five interests. The left section is a side menu with the user‟s profile, and the menu options. The middle section, the content section, displays the current mode (automatic matching, chat, etc.). See figure 9. The JavaScript contains a function for each menu item. Each function inserts the appropriate content into the content section. The functions make use of jQuery templating. jQuery templating allows separating the GUI from the GUI logic. The HTML file contains templates for the functions to fill in, separating static text from variables. This makes the GUI adjustable, since for each new menu item, a new function and template can be added.

26

Figure 9. Facebook Application Graphical User Interface Layout.

JavaScript Most of the JavaScript functions require API calls. The GUI uses the Simlike JavaScript SDK (JSSDK) for the interaction with the API. When the user triggers the GUI to request information, the JavaScript makes a call to the JSSDK to process the request. The result is then processed by jQuery templates to fill in the content section using the result. The GUI uses the Simlike.Gui.FbApp JavaScript component for the business logic related to the Facebook application. The Simlike.Gui.FbApp component contains various methods for the GUI to call. Upon initialisation, it loads the other necessary JavaScript files, logs the user in on Facebook and Simlike and fetches the user data necessary for loading the main screen. Then the showMainScreen function is called, which actually displays the user data. Additionally, the Simlike.Gui.FbApp contains function like autoMatching and viewProfile, which insert data into the main screen. These methods serve as a link between the HTML and the API.

To support the chat and mail functionality, the Simlike.Gui.FbApp.Chat and the Simlike.Gui.FbApp.Mail components were created. The Simlike.Gui.FbApp.Chat component is

27 design to work with the corresponding Simlike.Chat component on the JSSDK. It manages all conversations in Simlike.Gui.FbApp.Chat.Conversation objects. It has functions for opening a conversation and to react on an incoming conversation. The conversation object can be called to send and receive messages. The Simlike.Gui.FbApp.Mail component contains functions for viewing and sending mails. It calls the API through the Simlike.api method for this. An overview of the communication flow may be found in figure 10.

Figure 10. Communication overview between the Facebook application and the JSSDK.

Authentication Before the user can access the GUI, he must be logged in on both Facebook and Simlike. Logging in on Simlike is done automatically when the user has logged in on Facebook. If the user is not yet logged in on Facebook, he will be redirected to the Facebook log in page. The first time the user logs into Simlike with his Facebook account, a number of permissions are asked from the user. These permissions are required to download the user data from Facebook into Simlike.

Logging in provides the Facebook application with a session object, including a Facebook access token and a Facebook user ID. These two values are required to authenticate the user on the Simlike platform. They are passed to the Simlike Authentication servlet with an authenticate call. The servlet processes the request, and returns the matching simlike ID and a generated simlike token. The app uses the Simlike ID and the simtoken to authenticate in every

28

API call. After the user has been logged in, the GUI becomes accessible. For more detail on the API communication, the reader is redirected to the API section of this report.

Layout The layout defines the flow of interaction of the user with GUI. This is an important aspect in realising a user friendly experience. The following requirements are stated for the GUI layout:

Flow ● With each action, the user should get feedback immediately. ● The user needs to be able to perform each action efficiently. ● The user‟s trail of thoughts must be disturbed as little as possible. ● The application should always provide something new to see. (New matches, a notifications section, etc.)

Art direction and design The application should have a good balance between art direction and design. Design is about the how: what the interface looks like and how the user can access the functions. Art direction is about the why: what the atmosphere should be, the general feeling of the interface, and what kind of message you want to get to the user with the look of the application.

User interface design The user interface design inherits the guidelines of the layout. This translates to the following criteria for the user interface:

General design ● The design should be consistent. ● Pop-ups should be avoided as much as possible. ● The user should be able to switch to a different feature in one click. (Such as from chat to messaging)

Navigation ● A page should never contain a link to the same page. ● It should always be clear to the user on what page he currently is viewing.

Website Simlike will have its own website. This will be a gateway to the Facebook application of Simlike. Until the Facebook application is launched, there is a simple web page showing a message to check back when the application is launched. See figure 11.

29

Figure 11. The temporary website of Simlike.

The infrastructure for the website has been set up. Currently it is being served via Amazon CloudFront: a Content Delivery Network (CDN) service of Amazon Web Services (AWS) [ACF].

The website may be modified directly from the AWS console, by selecting the “Amazon S3” tab in the console and uploading new files to the website. See figure 12. Alternatively, there are also third party software tools which bring alternative interfaces to Amazon S3, such as FTP. None of these tools were used, the AWS console proved to be sufficient.

30

Figure 12. Uploading the website to the AWS infrastructure.

API The API acts as a bridge between a client application and the technical foundation of Simlike. For the purpose of keeping the different parts of the Simlike system as loosely coupled as possible, the API has been created. It processes HTTP requests from a client (e.g. the Facebook or mobile application), and makes calls to the database. It then processes the acquired information into JSON objects, which are then returned as a HTTP response to the client. First, the contracts for these JSON objects are defined. Then the API calls are defined.

JSON contracts The servlets return the data in JSON format. The JSON formats will be stored in template files for reference and easy testing. Adding new (key, value) pairs can easily be done by altering the corresponding Servlet class and template file. The JSON contracts are as follows: ● BasicUser { “simid” : “Simlike user ID of the requested user” “first_name” : “first name”, “last_name” : “last name”, “full_name” : “full name”, “age” : “age of the user”, “photo” : “URL to photo”, “gender” : “gender of the user”, “current_location” : “city”, “matches” : amount of matching simlikes as a number }

31

● MatchNumber { “matches” : amount of matching simlikes as a number } ● MatchingResult { “users” : [ {BasicUser}, …, {BasicUser} ] } ● Profile { “simid” : “Simlike user ID of the requested user” “first_name” : “first name”, “last_name” : “last name”, “full_name” : “full name”, “age” : “age of the user”, “photo” : “URL to photo”, “gender” : “gender of the user”, “date_of_birth” : “YYYYMMDD”, “current_location” : “city”, “simlikes” : [{simlike}, {simlike}, {simlike}, {simlike}], “top_simlikes” : [{simlike}, {simlike}, {simlike}], “matching_simlikes” : [{simlike}, {simlike}, {simlike}] } ● simlike { “ciid” : “Id of the simlike”, “name” : “name of the ContextItem”, “photo” : “URL to photo” } ● SimlikeCollection { “simlikes” : [ {simlike}, {simlike}, {simlike} ] } ● Error {“error” : “message”} ● SimAuth { “authsimid” : “Simlike user ID”, “simtoken” : “A token used for authentication on the Simlike Realm” } ● Message { “to” : “Simlike user ID of receiver”, “to_photo” : “URL to photo of receiver” “to_name” : “full name of receiver” “from” : “Simlike user ID of sender”, “from_photo” : “URL to photo of sender”, “from_name” : “Full name of sender”, “subject” : “subject of the message”,

32

“text” : “content of the message” }

API calls The front end of the API is formed by servlets. Each servlet handles a different API call. The servlet calls a class that translates these calls into database calls. These database classes are located in the java package com.simlike.api.db.controllers. The servlets are located in the java package com.simlike.api.servlets. The following API calls have been created: ● automaticMatching- This call returns a list of BasicUsers that match well with the user that made the API call (the caller). A BasicUser does not contain all information about a user and is visible to all Simlike users. This list of BasicUsers is called a MatchingResult. ● searchMatching - Works the same as automaticMatching, but requires a parameter: ○ cNames: a list of simlike names to search for matching users. ● simpleOneToOneMatching - These call accepts the simid of a user and return the number of simlikes the user and the caller have in common as a MatchNumber. ○ simid: the simid of the other user to calculate number of matching simlikes. ● extendedOneToOneMatching - These call accepts the simid of a user and return the simlikes the user and the caller have in common as a SimlikeCollection. ○ simid: the simid of the other user to calculate matching simlikes. ● getBasicUserInformation - This call accepts the simid of the user whose information should be returned and returns the result as a BasicUser. Users can use privacy settings to keep their profile information hidden from other users. ○ simid: the simid of the user to request information about. ● getAllUserInformation - This call accepts the simid of the user whose information should be returned and returns the result as a Profile. Users can use privacy settings to keep their profile information hidden from other users. If a user is not authorized to view the complete profile, a call to getBasicUserInformation can be made to obtain basic information that is public for every user. ○ simid: the simid of the user to request information about. ● getSimlike - This call accepts the ID of a simlike and returns the item as a simlike JSON object. ○ sid: the simlike ID. ● getTopSimlikes - This call accepts the simid of the user whose top simlikes should be returned and returns the requested simlikes as a SimlikeCollection. The privacy settings that apply to the Profile data, also apply to this call. ○ simid: the simid of the user to request the top simlikes from. ● updateUser - This call updates the local data of the user that made the API call, from Facebook. ● sendMail - This call accepts the receiver‟s Simlike ID, message subject and message content as parameter and constructs a Message from it. The Message will be stored in the inbox of the receiver and the sent map of the sender. ● deleteMail - This call accepts the map from which the mails should be deleted and the IDs of the mails that should be deleted and removes the messages from the given map. If no maps contain the given messages, they are removed entirely from the database.

33

● getMail -This call accepts the name of the map from which the messages should be retrieved, the number of messages that should be retrieved and the id of the start message as inputs. It then retrieves a sequence (based on the time the messages were created) of messages from the database, starting from the given start id (or from the most recent message if this id is null). The length of this sequence is equal to the given number.

The calls are made as follows. The client (e.g. Facebook app) links the user to a URL which defines the call to issue, along with the parameters. For example, a user searching for “soccer” would be linked to https://[API URL]/searchMatching?cNames=soccer&[AUTHENTICATION PARAMETERS]. The API would then respond by making a request to the database, and returning the results as a JSON list to the client. The client can then process this list and display it to the user. The communication is visualised in the sequence diagram in figure 13.

Figure 13. Sequence diagram for SearchMatching.

Authentication To make sure online registered users can use the API, authentication will be used. Before making any API calls, the client must authenticate via the AuthenticationServlet, which is accessible via https://[API URL]/authenticate. Currently it supports authentication with Facebook credentials, which is done by passing three parameters:

34

1. type. The type of authentication requested, currently only the value “facebook” is supported. 2. fbid. The Facebook user ID. 3. fbtoken. The Facebook access token. This can be requested from the Facebook SDK [FB2].

The response is a SimAuth object. The parameters from this response should be passed along with every API call made hereafter.

Apache Shiro Internally, the API works with Apache Shiro to authenticate and authorize API calls. Two types of authentication and authorization are currently implemented: Facebook and Simlike. These are implemented in the form of Realms. The realms are located in the java package com.simlike.api.auth.realms.

Authentication Authentication is the act of verifying whether a user is who he says he is. The user supplies a (principals, credentials) pair, where the principals are the information that identifies the user and the credentials are the information that can be used to check whether the user is „telling the truth‟.

For example, in real life a name can be a principal (although it should be a unique identifier in Simlike) and a passport can be the credentials.

Facebook Realm When a user supplies his Facebook token (fbtoken) and user id (fbid) to the AuthenticationServlet and requests authentication on the FacebookRealm (type = facebook), Shiro will ask the FacebookRealm for the AuthenticationInfo that belongs to the (fbid, fbtoken) pair. The realm asks Facebook for the id (principals) belonging to the fbtoken (credentials) and returns (fbid, fbtoken, FacebookRealm) as AuthenticationInfo if the given principals match those gotten from Facebook. Otherwise an AuthenticationException is thrown. If the information is returned, the user is authenticated on the FacebookRealm and the information can be used to generate a simtoken.

Simlike Realm When a user supplies his simid and simtoken along with an API call, the AbstractSimlikeServlet will ask Shiro to authenticate the user at the SimlikeRealm with a (simid, simtoken) pair. Shiro will call the SimlikeRealm‟s doGetAuthenticationInfo() method. This method will use the simtoken to read the simid corresponding to this simtoken from the database. This work is done by the method SimlikeController.read(). When the returned simid matches the given simid, the user is authenticated and the (simid, simtoken, SimlikeRealm) AuthenticationInfo is returned.

Authorization

35

Authorization is the act of checking whether a user (once authenticated) has permission to do what he is trying to do. A driver‟s license is an example of authorization. If you own a driver‟s license you have the permission (you are authorized) to drive a car.

Shiro offers the isPermitted() method to check whether a user has a given permission. When this call is made, Shiro calls the doGetAuthorizationInfo() method on the realms where the user is authenticated and supplies it with the principals of the user.

Facebook The FacebookRealm‟s authorization is currently not used. Users only authenticate on this realm to obtain a simtoken. All users therefore have the implicit permission to request a simtoken.

Simlike When doGetAuthorizationInfo() of the SimlikeRealm is called, it loads the Role corresponding to the given principals. A Role can be viewed as a set of permissions, so if a user has a certain Role, he has all the permissions that belong to that role. These Roles and Permissions are returned in the AuthorizationInfo and used by Shiro to determine if the user is authorized or not. The Roles provide a way to distribute permissions amongst large groups of users. However, to implement privacy settings that control the access to a user‟s profile data, individual permissions are needed. Let‟s say user A requests the profile of user B. This is only allowed if A has the permission “access:profile:B”. To dynamically generate the individual permission, the simid of B is added to the principals of A. This way the doGetAuthorizationInfo() method of a Realm can ask the PermissionChecker whether A has permission to access B‟s profile. If A has permission, the actual permission can be added and the AuthorizationInfo can be returned. Shiro will do the rest, so a simple call to isPermitted() can be used in the servlet.

Roles The following Roles and their permissions exist: ● Admin - An administrator has all rights, but API calls will be logged to be able to check if the admin is not misusing any privileges. ● Person - A person is the basic user of Simlike. The following permissions are given to all Persons: ○ automaticMatching ○ searchMatching ○ simpleOneToOneMatching ○ getBasicUserInformation ○ updateUser ● Background Updater - The background updater is responsible for updating user information (either in batch jobs or for single users). ○ updateUser

Authentication flow An illustration is given for the authentication flow of clients, the Facebook application will be used as example. The Facebook app requests authentication from the web server (Tomcat). The servlet running on the web server redirects the request to Apache Shiro which performs the

36 authentication using the Facebook Realm. The data needed to authenticate is pulled from the database using the UserController. A Simtoken is generated and persisted in the database so that it may be used to authenticate on subsequent requests. This flow is illustrated in a sequence diagram in figure 14.

37

Figure 14. Sequence diagram for logging in.

38

Privacy A user can restrict access to his/her simlikes and date of birth by adjusting privacy settings. The following privacy settings are possible: ● Public - When the information is public, all authenticated Simlike users will be able to access the information. ● Friends only - Information for friends only will only be displayed to the user him-/herself and to the user‟s Facebook friends. ● Friends of friends - When information is accessible to friends of friends, it can only be viewed by the user him-/herself, the user‟s friends and their friends. ● Private - Private information can only be accessed by the user who owns the information.

There are a few exceptions to these rules. Administrators have access to the information. Also, a user can request access to the information of another user. If the request is granted, the user will be able to view the information until the permission is revoked.

Java packages The Java code for the API has been divided into the following packages: ● com.simlike.api: the main java package. ● com.simlike.api.auth: contains all the code relevant to authentication and authorisation. ● com.simlike.api.auth.realms: contains all classes that extend Apache Shiro Realms. ● com.simlike.api.auth.tokens: contains all tokens which are authenticated by the realms. ● com.simlike.api.db: contains all the database specific code. ● com.simlike.api.db.controllers: contains the controllers which operate on the database. ● com.simlike.api.db.data: contains entities that can be written and read from the database. ● com.simlike.api.exceptions: contains all custom exceptions for the Simlike platform. ● com.simlike.api.facebook: contains a Facebook relevant code. ● com.simlike.api.filters: contains filters that operate on servlet requests or responses. ● com.simlike.api.jabber: contains all chat specific code. ● com.simlike.api.matching: contains all the matching specific code. ● com.simlike.api.servlets: contains all the servlets.

A diagram showing the dependencies between the most important packages can be found at Figure 15. The blocks that do not belong to the API have a different background color. Shiro is the authentication and authorization framework that was used. The Applications block represents all applications that use the API. This could be the Facebook application, mobile application or some third-party application. Class diagrams for all classes can be found at the Javadoc of the project. The Javadoc may be generated by running the `mvn site` command from any of the project directories. This will generate all documentation in the [project_dir]/target/site/ folder.

39

Figure 15. The dependencies between the packages of the API.

JavaScript SDK A JavaScript SDK (JSSDK) has been developed to expose the Simlike functionality to web platforms. It consists of the following components: ● Simlike - this is the core JSSDK component. ● Simlike.Facebook - handles all the Simlike/Facebook interaction. ● Simlike.Chat - exposes the chat functionality to web applications. 40

Simlike The Simlike component allows the caller access to the other components, as well as the API. The three most important functions it provides are the load() function, the api() function and the authenticate() function. The load() function loads any of the other components. The api() function makes any GET request to the API. The authenticate() function logs the user on to Simlike and retrieves the user‟s simid and a matching simtoken.

Simlike.Facebook The Simlike.Facebook component serves as an interface to the Facebook API. It contains functions to log in to Facebook and to request various data from Facebook.

Simlike.Chat One of the services that the JavaScript SDK provides is the chat functionality. For this purpose, a chat server has been set up. Jabber is the server that has been chosen. Jabber is a chat server that implements Extensible Messaging and Presence Protocol (XMPP), a chat protocol using XML [XMPP][JB]. The chat client uses Strophe, a JavaScript library for XMPP [STR]. When the user logs in on Simlike, the app creates a jabber id based on the user‟s simid, and connects the user to jabber behind the scenes. The user can open a conversation with any online user, and send message and share simlikes, places or photos.

The Simlike.Chat component contains functions to access this functionality. These functions allow the caller to connect to the chat server, start a new conversation, and to set a callback function to be called when a new conversation is opened (by another user). To support this, the Simlike.Chat.Conversation class was made. Each conversation has its own Simlike.Chat.Conversation object, which can be called to send messages and to set a callback function for incoming messages.

The flow of the chat functionality is documented in the sequence diagrams in figures 16, 17 and 18.

41

Figure 16. Connecting to the Jabber server. This is done when the user is logged in.

Figure 17. Sending a message to another user. This is done when the user presses the “Send” button.

42

Figure 18. Receiving a message. This occurs whenever a message is sent to the current user‟s jabber id.

43

Quality assurance

There are a lot of aspects that determine the quality of Simlike. To ensure some useful level of quality, all of these aspects need to be taken into account or monitored during the process. The most important of these are: ● Security - Simlike will eventually possess a lot of personal information about users. This information will need to be stored securely. ● Licenses - Using external software and services usually involves licenses. It is important to know the legal implications of using external software or services. ● Availability/Reliability - The system will need to be online and available to a lot of users at the same time all over the world. ● Expandability/Maintainability - The client plans to keep innovating and therefore changing or adding to the product. Therefore Simlike will have to be easy to maintain and expand.

Security Security is not absolute, it is a matter of risk management and it should be built into the application, not tacked on to it as an afterthought. This means security will need to be taken into account when designing the product‟s features. So in order to keep the product secure some research on security was required. The results of this orienting research can be found in [SEC].

Apache Shiro will be used for authentication and authorization. Using an external library should be safer than trying to invent our own.

Licenses To make a Facebook application, Facebook‟s policy regarding the data that is pulled from Facebook [FB1] should be adhered to. The following are the most important points of this policy: ● It is only permitted to request the data needed to operate Simlike. ● It is allowed to cache the data, though it should be kept up-to-date as much as possible. ● A privacy statement about the data is required. ● The data which is relevant to a particular user‟s friends, may only be used in the context of that friendship (this is to prevent leaking of personal data to other people). ● Facebook data (and derivatives) may not be transferred to third party advertising networks or data brokers. ● Facebook data may not be sold. If Nerval Limited is acquired by or merges with a third party, it is allowed to continue to use the data within the application the license was given to. ● If the license is terminated, Nerval Limited must delete all Facebook data. ● A user may request to delete all of his data. In this event, Nerval Limited is obligated to comply.

44

● Facebook data may not be included in any advertisements.

Apart from the Facebook policy, the use of external software also brings licenses along that need to be adhered to. Most of the open source licenses will not be a problem, since no binary files will be distributed to clients.

Availability/Reliability For a social media platform to be successful it needs to have a large user base and be almost always accessible. To achieve this, some redundancy is needed in data storage and no single point of failure should exist. This can be done by using clusters of computers and storing copies of data on multiple machines.

The distributed database system Cassandra will take care of the data duplication and is easy to cluster. These qualities make it a good fit for this project.

To assure an acceptable level of correctness and reliability, Test Driven Development will be used. This means the important aspects of correct behaviour of features will be identified and a test suite will be made before the features are implemented. These test suites will contain unit tests. Design by Contract will be used to make it easier to spot faults. When the contract is violated an exception will be thrown to prevent wrong use of the application and create higher fault sensitivity, which is to say that a triggered fault will more easily generate some kind of exception.

The test suite will be analysed to obtain its coverage (the code it executes). If this coverage is too low it is a sign that the test suite does not cover all important aspects of the feature and so more test cases should be made. When the coverage is high, it does not indicate anything about the test suite. So it is merely a tool to „identify‟ a poor test suite, rather than determine if a test suite is good.

Except for trivial methods like (non-singleton) getters and setters all methods should be covered. In other words, the method coverage of non-trivial methods should be as close to 100% as is reasonably achievable. Statement coverage will have to be assessed manually by the developers. When using coverage tools like Emma, it is easy to see which code has no coverage and whether or not that is a problem.

Expandability/Maintainability When keeping code simple, easy to understand and well documented it is easier to maintain the product. Therefore tools like Checkstyle, PMD or Sonar will be used to analyse the style of the code to assure the maintainability.

To assure easy expandability several well-known Design Patterns can be used. Possible examples are the Template and Strategy Patterns. Using these patterns will make it easier for

45 new developers to understand how the product works. Together with proper documentation this will make the product easily expandable.

46

Implementation recommendations

In this section some recommendations for the implementation are given. Adhering to these recommendations will increase the quality of the end product. ● General (language independent): ○ Use as much well-tested already-written code as possible, usually in the form of external libraries. This decreases the implementation workload on the team, because the maintenance, including testing, is done by an external party that is dedicated to the library. ○ Use design patterns where applicable. Since the team has no extensive experience with this yet, this implies looking at existing design patterns every time a feature is to be implemented. This overhead will decrease with time. ○ Write unit tests (which are based on test specifications) before writing the functionality. The idea is that the feature must be designed before implementing it. ● Java: ○ Use generics when it makes the code better, but always ask the question “Is it necessary?”. Generics – especially when nested – can make Java code a lot harder to read. ○ Try to use Google Guava collections (e.g. ImmutableList) whenever possible and appropriate. The invocations are usually shorter, and the implementations are quite efficient.

47

References

[AAEB] Amazon, “AWS Elastic Beanstalk”, http://aws.amazon.com/elasticbeanstalk/, accessed on May 9, 2011. [AC1] Cassandra Wiki, “DataModel”, http://wiki.apache.org/cassandra/DataModel, accessed May 11, 2011. [ACF] Amazon Web Services, “Amazon CloudFront”, http://aws.amazon.com/cloudfront/, accessed on July 8, 2011. [AWS] Amazon, Amazon Web Services, http://aws.amazon.com/, accessed on July 8, 2011. [CDG] Eben Hewitt, 2010, “Cassandra: The Definitive Guide”, O‟Reilly Media, 978-1-449-39041- 9. [CLUST] Lin D. (1998), “Automatic Retrieval and Clustering of Words”, Department of Computer Science, University of Manitoba, http://acl.ldc.upenn.edu/P/P98/P98-2127.pdf accessed on July 11, 2011. [FB1] Avinash Lakshman and Prashant Malik, 2010, “Cassandra: a decentralized structured storage system”. SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 35-40. DOI=10.1145/1773912.1773922 http://doi.acm.org/10.1145/1773912.1773922, accessed on July 11, 2011. [FB2] Fb.getLoginStatus, Facebook developers, http://developers.facebook.com/docs/reference/javascript/FB.getLoginStatus/, accessed on July 7, 2001. [FUNC] Albeda, Dijkhuizen, Ezechiëls, Lanting, April 29, 2011, “Functional Design Document” [JB] Peter Saint-Andre, “jabber.org” http://www.jabber.org/, accessed July 5, 2011. [MOBL] The Mobl project homepage, http://www.mobl-lang.org/, accessed July 5, 2011. [PVA] Albeda, Dijkhuizen, Ezechiëls, Lanting, April 26, 2011, “Plan van Aanpak” [REQ] Albeda, Dijkhuizen, Ezechiëls, Lanting, April 25, 2011, “Requirements analysis” [SEC] Ezechiëls, J., May 4, 2011, “Security orientation & design guidelines”. [SR] Albeda, Dijkhuizen, Ezechiëls, Lanting, May 16, 2011, “Study report: Simlike platform”. [STR] Jack Moffitt, “Strophe.js” http://strophe.im/strophejs, accessed July 5, 2011. [WP1] Wikipedia, “Row-major order”, http://en.wikipedia.org/wiki/Row-major_order, accessed May 11, 2011. [XMPP] XMPP Standards Foundation,”XMPP Standards Foundation” http://xmpp.org/, accessed July 5, 2011.

48

Progress Report

Simlike platform, sprint #1

April 29, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Preface This is the progress report of the first sprint of the development of the Simlike platform. It is a part of a series of reports that each describe the development process of a single SCRUM sprint [001]. More details about the Simlike platform and the authors can be found in [002]. This report is meant as a reference for the developers and their supervisors and can be used to gain insight in, and analyze, the development process of the Simlike platform.

2

Table of contents

Progress Report ...... 1 Simlike platform, sprint #1 ...... 1 Table of contents ...... 3 Summary ...... 4 Introduction ...... 5 Progress ...... 6 Requirements analysis ...... 6 Functional Design ...... 6 Technical Design ...... 6 Retrospective ...... 7 References ...... 8 Attachment A - Gantt time schedule ...... 9

3

Summary

We have made significant progress and are ahead of schedule. The requirements have been gathered and analysed, they have been incorporated in the functional design. Several requirements have use cases, more will be added to the functional design document as more sprints are executed. Simple GUI mockups are created as part of the functional design. The team communication is going reasonably well, but there is room for improvement. This is to be expected however, as we are still getting used to each other.

4

Introduction

This document describes the design and design choices made in the first sprint of the development of the Simlike platform. In each sprint a selection of features of the product will be implemented. In this sprint a start was made on the global design. The requirements of the product are analyzed and listed at section Requirements Analysis. Since SCRUM is used as development process, this list can still change during the project. It can be used as the backlog with requirements that still have to be implemented.

The functional requirements are further elaborated in section Functional Design. This section contains the global functional design. Not all functional requirements have use cases yet. This is done because the functional requirements may yet change, so the use cases for a requirement will be made when the requirement is planned for a sprint. The global functional design will be updated after each sprint and can be used as a reference of what things should do.

Next, the global technical design is discussed in section Technical Design. During this sprint only a part of the global technical design was made. Once it is completed it can be used as a reference during implementation and can be used by future developers to gain insight in the architecture of the system.

In section Retrospective we reflect upon the progress during the sprint. It can be used to learn from mistakes and identify problems which should be avoided in other sprints.

A Gantt time schedule of activities is attached to this report.

5

Progress

Here the progress made in the first sprint is described. Originally it was planned to gain more insight into the requirements that were elicited, and create a functional design. We got ahead of schedule, so we decided to also start with the technical design.

Requirements analysis The goal of this sprint was to analyze the requirements. Next to extra functional requirements, several non-functional requirements were identified which were not explicitly identified before. A prioritized list - according to the MoSCoW method [003] - was created in discussion with the client. For a result of this analysis, see [005].

Functional Design In order to gauge the client wishes and vision of the final product, a functional design was created [006]. It contains a feature “backlog”. This is a list of user stories which define the functionality of Simlike. During the next sprints, we select several of these user stories to implement. These user stories can be used to derive test specifications and as reference during implementation. The functional design also includes several GUI mock-ups which are created together with the client.

Technical Design A preliminary study was performed in context of the technical design. In the first study report [004], several existing technologies were identified which could be adapted by the Simlike platform. This sprint, the implicit trade-offs were examined when using these existing technologies. It was discovered that the algorithm plays a key role in choosing the underlying hardware and storage platform. The focus will be on developing the algorithm further to the point the trade-offs will become clear, in order to make a sound decision about the underlying architecture.

6

Retrospective

In this section we will review progress since the last report. This enables us to better learn from our past experiences. Also, by documenting these efforts, we hope to gain feedback from our client, as well as our TU Delft supervisor, which we can then apply to accomplish work of higher quality. Last, but not least, these lessons may be of value to the team which will pick up this project after this internship assignment is over.

Initially we wrote in our plan of approach that we would have two meetings of a maximum of 10 minutes per day. The first meeting serves to create awareness of the time schedule and tasks for that day. The second meeting serves to discuss the progress made. We managed to have all the meetings in the beginning of the workday. The second meeting was only achieved in 50% of the planned times. The fact that we did not have every second meeting, can be attributed to following reason: during the day we communicate, there is some awareness of general progress. However, there is a pitfall here, as this leads to a false sense of awareness. Team members have the impression they know who did what, but it became clear that this was not entirely true. We have to pay extra attention to these second meetings, they were designed exactly to prevent this pitfall.

So far, we have not received feedback from the TU Delft supervisor. The client was invited for feedback sessions on Tuesday and Friday. On Tuesday we discussed the requirements in more detail, on Friday we discussed the functional design. In hindsight, it would have been better to have planned the feedback session for the functional design on Thursday. The reason for this is as follows: on Friday we started on the GUI part of the technical design, but the functional design was not yet approved by the client. This could have led to wasted resources if the design was not approved. Luckily, eventually the client his feedback did not render prematurely done work useless, though, this is something to watch out for in the future.

We organise our tasks in a Gantt time schedule, which is working very well. The schedule automatically generates a task list on a daily basis. If some tasks are not completed on time, these tasks are carried into the next daily list and they are flagged “over due”. In our first daily meeting we discuss the tasks, put them in the schedule and print out a copy for the team members. Besides a daily view, we also generate weekly views and a global view. The weekly view is attached to this report, it is perfect for the supervisors to monitor our weekly progress.

7

References

[001] Schwaber, Sutherland, “The Scrum Guide” [002] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak” [003] Wikipedia, “MoSCoW Method”, http://en.wikipedia.org/wiki/MoSCoW_Method, accessed on April 24, 2011. [004] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Oriëntatieverslag” [005] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Requirements analysis” [006] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Functional Design”

8

Attachment A - Gantt time schedule

9

Progress Report

Simlike platform, sprint #2

May 6, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiels (1338994) Volker Lanting (1513273)

Preface This is the progress report of the second sprint of the development of the Simlike platform. It is a part of a series of reports that each describe the development process of a single SCRUM sprint [001]. More details about the Simlike platform and the authors can be found in [002]. This report is meant as a reference for the developers and their supervisors and can be used to gain insight into and analyze the development process of the Simlike platform.

2

Table of contents

Summary ...... 4 Introduction ...... 5 Progress ...... 5 Technical Design ...... 5 Retrospective ...... 6 Discussion ...... 7 References ...... 7

3

Summary

For this sprint, the team had planned to finish the Technical Design Document [003]. We were not able to complete all of the planned work. Some details about the algorithm, databases and hardware have been deferred to following sprints.

On a general level, we have defined the product’s architecture, decided an approach for the required algorithms, determined the structure and requirements of the graphical user interface (GUI) and have done hardware and database research. As stated above this research has not yet been completed.

We had some problems with the planning this week, but overall the development process seems to be going well.

4

Introduction

This document describes the design and choices made during the second sprint of the development of the Simlike platform. In each sprint a selection of product features will be implemented. In this sprint the team has continued working on the global design and has written this down in the form of the Technical Design Document [003].

The section Technical Design discusses the architectural design of Simlike, as well as the requirements concerning the quality of our work and how we intend to meet those requirements. The Technical Design Document can be used to base choices on during the implementation sprints and as a general reference regarding the product’s architecture.

In the section Discussion the work that did not get finished is discussed and the causes are analyzed.

Progress

This section describes the progress made in the second sprint. For this sprint the global Technical Design was scheduled. We got behind on schedule, due to the poor planning around the absence of one of our team members and some discussions about the algorithms.

Technical Design The product has been split into several layers. ● Foundation - This layer contains the lower-level platforms and databases. It is the foundation of Simlike in the sense that all Simlike applications will be using the data stored and services offered by the platforms in this layer. ● Simlike - This layer contains the applications the user interfaces with. The applications use the services offered by the platforms in the foundation.

What was done while creating the Technical Design Document: ● (Matching) Algorithms - The team decided on specific algorithms to match and search for people. Decisions on some details of these algorithms have been deferred to the sprint where they are to be implemented. DECLARED CLASSIFIED BY NERVAL LIMITED

● Graphical User Interface - The design for the GUI has been changed. The team chose not to put the menu in a bar. Two alternatives were presented: one where the options are displayed in a side bar and one where the menu items are listed as icons. Displaying these icons on a panel in the main screen would take too much space, and it would not be user-friendly to make the user click a button to access the icons. Therefore, the idea

5

has become to have a button that unfolds to display the icons when the user goes over the button. Those are the two options to consider at this moment. ● Database / storage - The team has done research with regards to how to model data storage so that it is accessible with low latencies. MySQL and other relational databases have been ruled out since they do not scale well. Currently the team is doing research to evaluate key-value stores (Amazon SimpleDB), BigTable clones (Brisk) and graph databases (Neo4j). The last two are the most promising options at this point. ● Mobile application - The mobile application will most likely not be part of this project. ● Quality Assurance - To ensure the product’s high quality, a number of aspects need to be taken into account, including security, availability & reliability, and expandability & maintainability. To gain some much-needed insight in these issues, the team has conducted introductory research into the respective areas, and as a result now has a good idea of what measures need to be taken, on a technical as well as a social level. ● Virtual Private Server (VPS) - Two decisions had to be made. First, the team had to choose between selecting a VPS of some kind or use dedicated hardware for the Simlike platform. After opting to go with Amazon’s VPS system, the team was faced with a choice between using Amazon’s AWS Elastic Beanstalk (AAEB) and using Amazon’s EC2 VPS system directly. This final decision has been deferred to sprint #3.

Retrospective

In this section we will review the progress since the last report. This will enable us to better learn from our past experiences. Also, by documenting these efforts, we hope to gain feedback from our client, as well as our TU Delft supervisor, which we can then use to heighten the quality of our work. These lessons may also be of value to the team that will pick up this project after this internship assignment is over.

In sprint #2 the team had its second meeting with the TU supervisor (Peter van Nieuwenhuizen), in which his feedback consisted mostly of spelling errors and incorrect sentences in the reports that were previously handed in. The team has since corrected these in new versions of the corresponding reports. This means the quality of the content of our work was sufficient, so the team seems to be doing well so far.

However, the planning wasn’t very accurate. The team worked with three members from Monday until Wednesday, but the workload was still based on 4 people. This implies we should plan the free days of our team members.

The discussions about the algorithms required more time than the planning indicated. This is important to note, since not all decisions about the algorithms have been made yet. The fact that this took longer than expected, can be useful information when further discussing the algorithms, e.g. when extending the algorithms.

6

Discussion

In this section we will discuss the planned work that has not yet been finished. We will also analyze the causes of the problem.

Among the things that have not yet been finished during this sprint is the Technical Design Document. The following decisions still need to be made: ● What web platform to use (Amazon Elastic Beanstalk or Amazon EC2) ● How to use weights in the algorithms. ● What database to use (Graph Database or Brisk). ● What programming languages will we be using.

At its core, the cause of this is time shortage. This is because we planned too much work in 1 sprint, because we had one team member less for 3 days, or some combination of those 2 reasons. To correct this, every team member has since filled in his free days into the calendar used for collaborative purposes.

In case of Elastic Beanstalk and Graph Databases the problem was poor orientation research. These concepts were new to us and therefore had to be properly researched before we could make a decision.

The programming languages we can use depend on the hardware and database platform, therefore these decisions could not yet be made and are deferred to the next sprint.

References

[001] Schwaber, Sutherland, “The Scrum Guide” [002] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak” [003] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Technical Design Document”

7

Progress Report

Sprint #3

May 13, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Preface

This is the progress report of the third sprint of the development of the Simlike platform. It is part of a series of reports that each describe the development process of a single SCRUM sprint[001]. More details about the Simlike platform and the authors can be found in [002]. This report is meant as a reference for the authors and can be used to gain insight in, and analyze the development process of the Simlike platform.

2

Table of contents

Summary ...... 4 Introduction ...... 4 Progress ...... 5 Tasks...... 5 Features ...... 5 Functional Design ...... 6 1. One-to-one matching ...... 6 2. Facebook application (start) ...... 6 3. Simlike website (start) ...... 7 4. Automatic matching...... 7 5. Search matching ...... 7 Technical Design ...... 8 Database platform and bridge to Algorithm platform ...... 8 Quality assurance ...... 10 Development strategy ...... 10 Checkstyle and PMD ...... 10 Web framework ...... 10 Retrospective ...... 11 Discussion ...... 12 References ...... 13

3

Summary

In this sprint, we made further steps in the design phase. There were still a lot of decisions to make, and motivating them took considerable time. These decisions include selecting a programming language, a database and a web framework. Work on the Technical Design Document [003] has been done, a design for the first algorithm has been made and some of the most important tools have been configured.

We had planned to make a first version of the class diagram, but that proved to be too ambitious.

Introduction

This document describes the design and design choices made in the third sprint of the development of the Simlike platform. In each sprint a selection of features of the product will be implemented. These features are listed in section Features and further explained in section Functional Design. These sections can be used as reference during implementation and can later be used as documentation on how the features should be used.

Next the technical design of the features is discussed in section Technical Design. This can be used as a reference during implementation and can be used by future developers to gain insight in the architecture of the system.

In the section Quality Assurance the measures that were taken to provide quality results are discussed. This is mainly used to reflect upon the development process.

Section Retrospective discusses the progress that was made during this sprint and reflects on the process. It can be used to learn from mistakes that were made and to improve the quality process.

This report is where problems and noteworthy details are discussed, and it ends with a discussion. Any features left unimplemented and the reasons for that are discussed as well. This section can be used to create a better plan for the next sprint and to gain insight in the difficulty of some features.

4

Progress

Tasks Other than the implementation of features of the product, there were a number of tasks planned for this sprint: ● Pick a database platform - We chose Cassandra, because its „unbound‟ indexing scheme makes it possible to simulate graphs. This and the easy setup on Amazon make it preferable above HBase and SimpleDB. The fact it has support for map-reduce makes it a better choice than Neo4j, since using Neo4j would require us to redesign the algorithm. Experience has taught us this will take too much time. The benefit of Neo4j doesn‟t outweigh the setup time on Amazon and its „pricey‟ scalability (the required license is several thousand dollars per month). ● Pick programming languages - We will be using Amazon Elastic Beanstalk, which limits the programming languages we can use. Java will be used for the program, using a Java Framework to create the web application in Java. ● Finish the technical design document / study report - After picking a database platform and programming languages the technical design document can be finished. ● Install and configure software and servers - The Amazon servers will have to be configured and Cassandra installed and configured. Then we can set up our tools like Github, Mantis, Eclipse, Tomcat and testing tools like Emma and Checkstyle. ● Create test data and a backup - To test our implementation, some data should be present in the database. This data is created and a backup is made, in case something goes wrong during the testing and implementation. ● Test web application - The test web application will be used to give some insight into what the server is doing and if it is configured correctly. It will basically be used as a manual inspection tool for the server. ● Install tool chain - There are some tools we would like to install to aid in development, both to make our lives easier and to increase the quality of the product. Among these are the Eclipse IDE, Checkstyle, PMD, Maven, Jenkins, git and github, the Mantis bugtracker and the Tomcat Application server.

Features In this section the features that were planned for implementation in this sprint are discussed. These features are briefly explained in the form of user stories. ● One-to-one matching - Users can be matched with each other. This means two users will be passed to the algorithm, and it will return the amount of interests they have in common.

5

● Facebook application (start) - Users should be able to log in and link their Facebook account to the application. The application should retrieve the person‟s personal data and „likes‟. It should then show the Facebook application page. ● Simlike website (start) - The domain www.simlike.com should be registered and a link to the Facebook web application should be made as the home screen. It does not have to be online yet. ● Automatic matching - When a user logs in, a list of best matches is created and shown to the user. The algorithm accepts a user as input and returns an ordered list of people, ordered best match first. ● Search matching - A user should be able to search for people that match one or several characteristics that the user supplies. The algorithm should accept one or several characteristics and a user as input and return an ordered list of people, descending by best match with regards to the given characteristics first. Ties will be broken by how well the returned people match with the user.

Functional Design In this section the features that were planned for this sprint are discussed in more detail in the form of use cases. These use cases can be used to derive test specifications and as reference during implementation.

1. One-to-one matching Use case 1: Summary: The user visits the Simlike profile of another user. Situation: The user is logged in Step 1: The user navigates to the Simlike profile of another user. Result: The user can see the amount of characteristics he has in common with the other user.

2. Facebook application (start) Use case 2.1: Summary: The user logs in to the Facebook application, via Facebook, for the first time. Situation: The user is logged in to Facebook. Step 1: The user navigates to the Simlike Facebook page. Step 2: The user clicks „start‟ or a similar button. Step 3: Facebook asks the user for his permission to grant Simlike access to the required data. Step 4: The user grants permission. Result: The Simlike Facebook application home page is displayed.

Use case 2.2: Summary: The user logs in to the Facebook application, via Facebook, and has already logged in to the application at least once before. Situation: The user is logged in to Facebook. Step 1: The user navigates to the Simlike Facebook page.

6

Step 2: The user clicks „start‟ or a similar button. Result: The Simlike Facebook application home page is displayed.

3. Simlike website (start) Use case 3.1: Summary: The user logs into the Facebook application, via the Simlike website, for the first time. Situation: The user has started his/her internet browser. Step 1: The user navigates to www.simlike.com. Step 2: The Simlike home page is displayed. Step 3: The user clicks „login‟ or something similar. Step 4: Facebook asks the user for his permission to grant Simlike access to the required data. Step 5: The user grants permission. Result: The Simlike Facebook application home page is displayed.

Use case 3.2: Summary: The user logs in to the Facebook application, via the Simlike website, and has already logged in to the application at least once before. Situation: The user has started his/her internet browser. Step 1: The user navigates to www.simlike.com. Step 2: The Simlike home page is displayed. Step 3: The user clicks „login‟ or something similar. Result: The Simlike Facebook application home page is displayed.

4. Automatic matching Use case 4.1: Summary: The user chooses one of the possible matches. Situation: The user is logged on. Step 1: The system calculates the matches and shows the user the main screen. In the right section of the main panel, there is a list of the best matches. Step 2: The user can scroll through this list. He selects a match that appears interesting. Result: The user can view his choice and click one of the buttons for some extra actions regarding his choice.

5. Search matching Use case 5.1: Summary: The user searches for a co-user based on specific interests. Situation: The user can be in any screen. Step 1: The user clicks in the search bar. Step 2: The user types the interests and presses enter or the search button. Step 3: The system switches to the search screen and displays a list of the results. Step 4: The user can scroll through this list. He selects an outcome that interests him.

7

Result: The user can view his choice, and click one of the buttons for some extra actions regarding his choice.

Use case 5.2: Summary: The user goes to the search menu. Situation: The user is logged on. Step 1: The system shows the user the main screen. Step 2: The user selects the "Search" entry from the menu. Step 3: The system switches to the "Search" menu. Result: The user can now execute a more detailed search.

Use case 5.3: Summary: The user searches for a co-user based on various criteria. Situation: The customer is in the search menu. Step 1: The user can now enter different criteria. Step 2: The user presses enter or the search button. Step 3: The system displays a list of the results. Step 4: The user can scroll through this list. He selects an outcome that interests him. Result: The user can view his choice, and click one of the buttons for some extra actions regarding his choice.

Technical Design In this section the technical design of the features is discussed and the way the new features are merged into the existing product.

Database platform and bridge to Algorithm platform The final choice for the database platform is Cassandra.

DECLARED CLASSIFIED BY NERVAL LIMITED

The next step is to design a database for Cassandra which holds the user data for Simlike. Cassandra uses a schemaless design, which is a paradigm shift from traditional SQL schemas. This requires careful thought, because the design decisions influence the scalability of the entire platform. The design of the database decides on which machine(s) the data is stored and on how many machines the data is stored. It also involves making decision about database load balancing. E.g. if you store the data sorted by key in ascending order, you have the risk of creating data „hot spots‟. Data hot spots are places with a lot of data that is frequently accessed. This could lead to a higher load for particular machines. One way to counteract this phenomenon is to store the data in random order instead of sequential order; this will generate

8 a more load balanced setup. A book about Cassandra has been ordered to aid in these design decisions, it will arrive Monday 16 May or Tuesday 17 May, 2011.

DECLARED CLASSIFIED BY NERVAL LIMITED

9

Quality assurance

In this section we motivate and explain which strategy is used to produce high quality.

Development strategy The development strategy includes all design decisions (e.g. programming language, design patterns, tools) relevant for this sprint. This can be used to gain insight in the development process and as a reference of design decisions. Any global design decisions as stated in [002] are omitted.

Checkstyle and PMD In this sprint a configuration for checkstyle has been made. This configuration contains the style we will adhere to during this project. This style is described in [STYLE] and will probably be applied to all code written for Nerval Limited. For PMD we will first activate all rules and remove any that strike us as unnecessary during the development. These decisions will be explained in the corresponding sprint reports. The rules that remain can be reviewed after the project, to create a PMD configuration for future development for Nerval Limited. The ShortVariable rule has already been removed, since names like „x‟ and „dx‟ can actually be quite descriptive. Of course ignoring the warning is possible, but adding „//NOPMD‟ comments should not become a habit.

Web framework In this sprint, different web frameworks for Java have been researched and compared. The following criteria were used for choosing a framework: ● It must use a Template Engine. This allows for separating the HTML and Java code. ● It must be compatible with Tomcat. Tomcat is the web server used by Amazon Web Services. ● It must be light weight. ● Preferably, it should have form validation to aid in validating user input. ● It must not have an integrated web server, as this adds significant overhead to the frame work (we already have a web server, namely Tomcat).

There are dozens of frameworks for Java, and it proved quite difficult to make a choice. The following frameworks were considered: ● Spring makes use of MVC. It is a rather popular framework. It proved difficult to install, and it is a very large framework with a lot of additional features not meant for web development. ● Apache Wicket is a component-based framework. It is comparable to Swing. The advantage of this is that all the team members have experience with Swing. It does,

10

however, make the layout difficult to adjust once it has been made. Not to mention, the code will be very large. For these reasons, Wicket has been discarded as an option. ● Play! is a relatively new framework. It is made to be easy to work with and comprehensible. The documentation is very clear, unlike most other documentations. However, it uses its own web server, conflicting with our one of the earlier criteria.

Finally, an alternative was found: to use only a Template Engine and not a whole framework. A Template Engine named FreeMarker was found, satisfying all the criteria. The choice for now is to use FreeMarker, though there are a few doubts since a framework could still make it easier to develop the web application.

Retrospective

In this section we will review progress since the last report. This enables us to better learn from our past experiences. Also, by documenting these efforts, we hope to gain feedback from our client, as well as our TU Delft supervisor, which we can then apply to accomplish work of higher quality. Last, but not least, these lessons may be of value to the team which will pick up this project after this internship assignment is over. We purposely added too many features this week to be able to complete. The spirit of SCRUM is that it is not bad when you don‟t complete all the work and it encourages planning too many features (though it should respect some form of global planning). We wanted to see how much we could achieve in a sprint, but did not manage to finish everything (as was expected). The backlog in schedule of last sprint has been eliminated.

It is important to start the sprint meetings on Monday first thing in the morning. This avoids situations where some people are still busy with tasks from last sprint, while others are done with their tasks. This means some people will have to wait for others before they can continue working, this is not very efficient. So despite any unfinished features, every Monday morning the new sprint will be planned and the unfinished work will be planned into the new sprint if necessary.

Until now, tasks were divided and you would not get another significant task until you finished the current significant task. The problem that occurred here is that there were no guidelines as to how long a certain task would be allowed to take. This led to certain tasks taking too long. Until now this had not happened before, because the tasks were not very big. To avoid spending too much time on tasks, the following is agreed to by the team: ● Before starting a task, a time limit is set. If this limit is exceeded, you must discuss the task with another team member why it is taking longer than expected. This is because of two reasons. First, other people may have insight which can speed up the task in the case of difficulties. Second, this is to check if other more important tasks may not be executed if too many team members are spending too much time on less important tasks.

11

● If it is expected that a task takes longer than 3 hours, the team has to discuss the goals that have to be met when working on this task. E.g. in case of research: what is being researched? Why is it being researched? What are we going to do with the conclusion of the research? These goals are necessary to avoid getting too deviated from the goal in mind when starting this task.

Discussion

Except for the Continuous Integration (CI) part of Jenkins and a decent UML2 tool, the task of installing the development tools is finished. Github accounts and repositories have been set up and github itself integrates nicely with Jenkins, which is for now only able to build on command. Furthermore, Eclipse, complete with a number of plugins (among which the AWS, PMD, checkstyle and maven plugins) has been set up for all the members. The parts that are not finished have a high priority for next week (CI with Jenkins and UML modelling tools).

This week we decided to move parts of the current TDD to the study report [003]. We have made comparisons based on the requirements to discover which technologies we can use. Currently these comparisons are in the TDD, but it would be preferable to put those in the study report. This way the focus of the TDD remains strongly on the technical design of Simlike and the study report gains some “extra weight”. We will perform this task next week. Originally, the TU supervisor expressed that this might not be necessary, but keep in mind we are designing a development methodology for a new company from scratch. The company has to continue working with our documents, so we prefer to have them as neat, relevant and structured as possible.

Because of the focus on substantiation of design decisions, we did not get as far with the actual technical design as we planned. Originally we targeted on making a class diagram this week, but substantiating our decisions has taken so much time that we did not get around to it. In retrospect, trying to making a class diagram this week was a bit too optimistic. The lesson here is that motivating design decisions took more time than anticipated.

Another task that did not finish this week is creating test data for the algorithm. This was due to a missing dependency. To create the test data in Cassandra, we need a database design. Making a database design for Cassandra is not trivial. We‟ll have to postpone this task until the database design is complete. This will be somewhere around Wednesday next week (we have to wait 1 or 2 days for a book about Cassandra to arrive, before we can start on the database design; in the meanwhile there are enough other tasks to complete).

12

References

[001] Schwaber, Sutherland, “The Scrum Guide” [002] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak” [002] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Orientatieverslag” [003] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Technical Design Document” [STYLE] Lanting, “Nerval Limited Code Style”

13

Progress Report

Sprint #4

May 23, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Preface

This is the progress report of the fourth sprint of the development of the Simlike platform. It is a part of a series of reports that each describe the development process of a single SCRUM sprint. [001] More details about the Simlike platform and the authors can be found in [002]. This report is meant as a reference for the authors and can be used to gain insight in, and analyze, the development process of the Simlike platform.

2

Table of contents

Summary ...... 4 Introduction ...... 4 Progress ...... 5 Tasks...... 5 Features ...... 5 Functional Design ...... 5 Technical Design ...... 6 Database platform ...... 6 Application Programming Interface (API) ...... 9 Quality assurance ...... 10 Testing plan ...... 10 Development strategy ...... 10 Integration ...... 11 Retrospective ...... 11 Discussion ...... 12 References ...... 13

3

Summary

In this sprint we got our first prototype for validation. We made a very basic and still quite non- functional Facebook application and an interface to the database. The database itself is not yet completed and neither is the coupling between the interface, the database and the Facebook application. This is the reason the Facebook application is not yet very functional.

We grossly underestimated the workload for this sprint and we did not work efficiently enough. To counter this problem we will assign one team member as a “help desk” who can help out other team members on the fly. When he is not helping others, he can work on the reports and documentation.

Introduction

This document describes the design and design choices made in the fourth sprint of the development of the Simlike platform. In each sprint a selection of features of the product will be implemented. These features are listed in section Features and further explained in section Functional Design. These sections can be used as reference during implementation and can later be used as documentation on how the features should be used.

Next the technical design of the features is discussed in section Technical Design. This can be used as a reference during implementation and can be used by future developers to gain insight in the architecture of the system.

In the section Quality Assurance the measures that were taken to provide quality results are discussed. This is mainly used to reflect upon the development process.

Documentation on how to integrate the new features into the product environment is supplied in section Integration.

Section Retrospective discusses the progress that was made during this sprint and reflects on the process. It can be used to learn from mistakes that were made and to improve the quality process.

This report ends with a discussion, where problems and noteworthy details are discussed. Any features left unimplemented and the reason why are discussed as well. This section can be used to create a better plan for the next sprint and to gain insight in the difficulty of some features.

4

Progress

Tasks The tasks that are required for the system to work, but are not features themselves are listed in this section. ● Create a database using Cassandra - This includes the setup of indexing schema’s and the installation and configuration of Cassandra. ● Create database mapping from Cassandra to the algorithm and vice versa. ● Create an API for database and algorithm access - To retrieve and store information in the database from an application, an application programming interface (API) is needed. ● Link the Facebook application to the API - The Facebook application will use the API to retrieve data from the database. ● Embed authentication and authorization functionality into the API using Apache Shiro - Make the API require users to authenticate themselves (i.e. log in) and then decide, based on their roles, whether or not they have the privilege to perform certain actions. ● Link the Cassandra database layer to the API - Implement the API stubs to make them functional (and therefore unit testable).

Features In this section the features that were planned for implementation in this sprint are discussed. These features are briefly explained in the form of user stories. ● Logging in to the Facebook app - To log in to the Facebook application the foundation of the app needs to be completed. The GUI, authentication system and database communication will be implemented. ● Logging in to the mobile app - The mobile application can use the same API to communicate with the database, so only a GUI and authentication system need to be implemented. ● Matching functionality - The system should be able to find users matching specific characteristics and determine how many characteristics two people have in common.

Functional Design In this section the features that were planned for this sprint are discussed in more detail in the form of use cases. These use cases can be used to derive test specifications and as reference during implementation.

Use case 1 Summary: The user logs in on the Facebook app. Situation: The user has started up the simlike Facebook app. 5

Step 1: The system shows the user the welcome screen, with the Facebook log in button. Step 2: The user presses the log in button. Step 3: The system shows the user a pop-up button with the Facebook log in screen. Step 4: The user fills in his username and password, and presses “Log in”. Step 4a: If this is the user’s first time logging in, the user also has to give permission to the app to access his profile and likes. Step 5: The system shows the user the main screen. Result: The user is logged in.

Use case 2 Summary: The user chooses one of the possible matches. Situation: The user is logged on. Step 1: The system calculates the matches and shows the user the main screen. In the right section of the main panel, there is a list of the best matches. Step 2: The user can scroll through this list. He selects a match that appears interesting.

Use Case 3 Summary: The user searches for a co-user based on specific interests. Situation: The user can be in any screen. Step 1: The user clicks in the search bar. Step 2: The user types the interests and presses enter or the search button. Step 3: The system switches to the search screen and displays a list of the results. Step 4: The user can scroll through this list. He selects an outcome that interests him.

Use case 4 Summary: The user views some additional information about a co-user. Situation: The user has chosen a co-user. Step 1: The user presses the co-user’s name. Step 2: The system shows the co-user’s profile.

Technical Design In this section the technical design of the features is discussed and the way the new features are merged into the existing product.

Database platform Cassandra will be used as a storage platform. More specific, CassandraFS will be used as storage platform. DECLARED CLASSIFIED BY NERVAL LIMITED

DECLARED CLASSIFIED BY NERVAL LIMITED

6

The “Cassandra database schema” is very different from traditional SQL tables. In fact, it does not even use tables. Cassandra is known as a “schema-less data-store”. Lets look at the terminology first, aided by a few examples, to help to get the gist of it. These are the terms to get familiar with: Keyspace, Column Family, Row and Column. There is a hierarchical relation between these terms, which may help to understand their function. This relation can be visualised as follows:

Figure 1. Cassandra data model abstracted hierarchy

Multiple Keyspaces may exist at the architect’s discretion. Typically there is one Keyspace per application. In this sense a Keyspace is analogous to a database in traditional Database Management Systems (DBMS), but be careful not to think too traditional. Thinking in Columns and Column Families is quite a paradigm shift from traditional DBMS. A Keyspace may contain an unlimited number of Column Families. A Column Family is used to group relevant Columns and may contain an unlimited number of Columns. The Cassandra Wiki says the following about Column Families [AC1]:

“Each column family is stored in a separate file, and the file is sorted in row (i.e. key) major order. Related columns (those that you'll access together) should be kept within the same column family. The row key is what determines what machine data is stored on. Thus, for each key you can have data from multiple column families associated with it. However, these are logically distinct.”

The file is sorted in row major order, meaning rows are stored one after another (as opposed to column major order, where the columns are stored one after another) [WP1]. The file on disk

7 that contains the rows is immutable [CDG]. This means rows can only be appended to the file, this is how Cassandra guarantees fast writes. Cassandra periodically rebuilds the storage file in the background to optimise read performance.

Column Families come in two flavours: Standard and Super. If a Column Family contains Columns, it is called a Standard Column Family. If a Column Family contains a Standard Column Family, it is called a Super Column Family. A Super Column Family can contain an unlimited number of Standard Column Families. A Super Column Family can only contain Standard Column Families and thus cannot contain other Super Column Families.

Figure 2. The two possible ways to use Column Families: Standard and Super.

The final aspect of the Cassandra data model is the Row. A Row is contained within a Column Family and has a row key. Furthermore, the Row contains values for (some) Columns. Cassandra indexes the data by storing the rows and columns in a sorted manner. The rows are sorted by their keys. Columns can either be sorted by their names or their timestamps (and not by value!). Which of the two is used for column sorting, is an important design decision which has implications for querying the data. E.g. queries which return a time sorted result would benefit from the decision to index the columns by timestamp [FB1]. Column Families are sorted by their names, as they do not contain a timestamp. The big picture visualised as class diagram in UML looks like this:

8

Figure 3. Cassandra data model as a class diagram in UML.

The remainder of the Database platform technical design is still under development.

Application Programming Interface (API) The API will have to be accessible via the Internet, since it should run on a different server than the Facebook application. It will also run on a different server than the database server itself. Therefore the calls to the API can be done by an HTTP request and the parameters can be passed along using the GET method. The API will then access the information in the database via sockets and pass it back to the requesting interface as a JSON object.

9

Quality assurance

In this section the team motivates and explains which strategy has been used to produce features of high quality. First the testing plan of the features is discussed, then the overall development strategy for this sprint is discussed.

Testing plan The testing plan of the features of this sprint includes all test specifications and reasons for adding them. A testing plan can be used to gain insight in the test suite, to provide some level of confidence in the correctness of the implementation.

Given the amount of implemented code, testing itself is not very useful since testing a stub will give no information that could not have been inferred directly by the team members. However, writing the test specifications and unit tests themselves will be valuable because at some point in the future, all planned features will be implemented at which point they convey useful information.

Unfortunately, the team has not actively decided to write a test plan for this sprint, which is reflected in the lack of test specifications (and corresponding unit tests, where applicable) after designing the features. This is due to time constraints, as we had already gotten behind on our own schedule on Monday.

Development strategy The development strategy includes all design decisions (e.g. programming language, design patterns, tools) relevant for this sprint. This can be used to gain insight in the development process and as a reference of design decisions. Any global design decisions as stated in [002] are omitted.

One of the features we have chosen to implement this sprint is API authentication and authorization. In order to achieve this we have decided to use Apache Shiro, which enables the domain-agnostic development of security features, even in the context of a web-based environment.

The API itself will be implemented in Java, using Tomcat and HttpServlets for the web interface and the use of JSON strings will be implemented with the JSON-Simple library.

For the Facebook application the team has decided to use javascript, HTML and CSS.

A few of Checkstyle rules have been altered or removed altogether:

10

● “UnusedImports” is removed, because as it turns out, Java imports are necessary when linking to classes in other packages in Javadoc. ● A //NOCHECK rule has been added to be able to ignore Checkstyle when warranted, if the rule itself still remains useful overall. The comment is followed by a number to indicate the amount of lines after the comment that checkstyle should ignore. ● The check for duplicate string literals is removed, since it’s usefulness does not outweigh the time and effort it takes to make it ignore string commands like “\n”.

This week started with a planning technique called “Planning Poker”. It works as follows: start with a list of features. Everyone can assign an estimation of the amount of work required to finish this task. This works by playing cards that represent a certain amount of work (hence the name “Planning Poker”). E.g. suppose a specific task is “Facebook application technical design”. Some participants rated this as 13 hours of work, one as 8 hours work and another as 20 hours work. In this case the person who called it 8 hours work underestimated the task. On the other side, the person claiming 20 hours overestimated the amount of work. It could also have been possible that one of the participants sees a shortcut or pitfall none of the other team members had seen.

Integration

In this section we discuss the integration of the new features into the existing product.

The (as of this sprint still rudimentary) Facebook application will be integrated with the API in order to implement basic use cases e.g. logging into the Facebook application. The Facebook application will run on a separate server, so the API calls have to be made via the internet.

The API application will run on servers in the cloud (AWS Elastic Beanstalk). The calls to the API can be done with simple HTTP requests and the API will collect the requested data from the database via sockets.

To test the application, for the moment the Tomcat API application has to be started locally in order for the Facebook application to connect to it.

Retrospective

In this section we will review progress since the last report. This enables us to better learn from our past experiences. Also, by documenting these efforts, we hope to gain feedback from our client, as well as our TU Delft supervisor, which we can then apply to accomplish work of higher quality. Last, but not least, these lessons may be of value to the team which will pick up this project after this internship assignment is over.

11

This was the first time we used Planning Poker. This led to a gross underestimation of the time required for the tasks. This week we completed only 64% of the tasks planned. This mainly had two causes:

Firstly, we need to gain more experience with Planning Poker and more work experience in order to get better work load estimations.

Secondly, we are still struggling with communication overhead. We have been trying to counteract this for the past two weeks, but attempts until now have not been successful enough. The main problems are that sometimes team members work individually and are not up-to-date with the tasks of others (this is why we started using planning poker). This can lead to extra work because the activities are not synchronised with each other. Another communication problem arises from the difference in experience with the team members. Less experienced team members often need assistance from more experienced team members. When this occurs, two things can happen: one team member needs to interrupt his task in order to assist the other, or the other team member has to wait for the other to become available. Both possibilities are very inefficient.

Next week we will go for an entirely new approach to counter act communication overhead: one of the more experienced team members will be a “help desk”, we call him the “flying goalie”. Every other member is allowed to disturb him at any time for assistance. At that point he drops his task to assist the other team member. This minimises the waiting time for three other team members when they require assistance. This means the flying goalie may not receive tasks with a firm deadline or tasks that require a great amount of concentration. The flying goalie will be charged with finishing the reports in the time he is not required. If no assistance is required and the reports are done, he can peer review the work of others. He is also responsible for guiding the overall integration. We expect this new approach to be more efficient for the team.

We will continue with using Planning Poker for at least another 2 weeks to see if the planning will become more accurate.

Discussion

● Mobile application - We did not finish the work on the API, database and the Facebook application so no start was made on the mobile application. ● Setting up the database - Designing the database indexing schemas and learning a whole new technology like Cassandra took more time than expected. Also, a lot of interruptions made this task progress much slower. This task is not yet finished. ● Authentication and authorization on the API - Shiro requires configuration which was surprisingly hard to set up. Combined with the use of Tomcat the configuration is even harder and this is the reason we did not complete this task.

When looking at the code examples presented by the Apache Foundation and Katasoft

12

(the company responsible for commercial support) it looks like Shiro is actually very easy to use. Coupled with its domain-agnostic approach, it becomes a highly desirable library to use. However, getting it to play nicely with the Tomcat application server required far more work (and frustration) than anticipated.

With this in mind the recommendation is that the steps required to properly set up both Tomcat and Shiro-in-a-web-environment are thoroughly documented (and possibly submitted to the Shiro documentation project so that it may be of use to others) in order to prevent such problems in the future. ● Connecting the Facebook application to the API - When integrating the API with the Facebook application, browsers had trouble with security settings because of cross-site scripting. This problem, and the fact that the API is still hosted locally was the reason we chose to spend our time on other tasks and features. ● Connecting the API to the database - The database is not fully setup yet, so it was not yet possible to create the link between the database and the API.

References

[001] Schwaber, Sutherland, “The Scrum Guide” [002] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak” [FB1] Avinash Lakshman and Prashant Malik, 2010, “Cassandra: a decentralized structured storage system”. SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 35-40. DOI=10.1145/1773912.1773922 http://doi.acm.org/10.1145/1773912.1773922 [AC1] Cassandra Wiki, “DataModel”, http://wiki.apache.org/cassandra/DataModel, accessed May 11, 2011. [WP1] Wikipedia, “Row-major order”, http://en.wikipedia.org/wiki/Row-major_order, accessed May 11, 2011. [CDG] Eben Hewitt, 2010, “Cassandra: The Definitive Guide”, O’Reilly Media, 978-1-449-39041- 9.

13

Progress Report

Sprint #5

May 30, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273) Preface

This is the progress report of the fifth sprint of the development of the Simlike platform. It is a part of a series of reports that each describe the development process of a single SCRUM sprint. [001] More details about the Simlike platform and the authors can be found in [002]. This report is meant as a reference for the authors and can be used to gain insight in, and analyse, the development process of the Simlike platform.

2

Table of contents

Summary ...... 4 Introduction ...... 5 Progress ...... 6 Tasks...... 6 Features ...... 6 Technical Design ...... 7 Quality assurance ...... 8 Testing plan ...... 8 API layer testspecifications ...... 8 Development strategy ...... 10 Retrospective ...... 11 Discussion ...... 12 References ...... 13

3

Summary

The GUI and API are working and have been integrated. The database is not ready due to setbacks. There are some tasks which depend on the database which could not be done. A test suite has been set up and test cases have been written. The test cases all succeed.

4

Introduction

This document describes the design and design choices made in the fifth sprint of the development of the Simlike platform. In each sprint a selection of features of the product will be implemented. Often some other tasks will have to be executed before a feature can be implemented. These tasks are listed in section Tasks. The features are listed in section Features and further explained in section Functional Design. These sections can be used as reference during implementation and can later be used as documentation on how the features should be used.

Next the technical design of the features and tasks is discussed in section Technical Design. This can be used as a reference during implementation and can be used by future developers to gain insight in the architecture of the system.

In the section Quality Assurance the measures that were taken to provide quality results are discussed. This is mainly used to reflect upon the development process.

Section Retrospective discusses the progress that was made during this sprint and reflects on the process. It can be used to learn from mistakes that were made and to improve the quality process.

This report ends with a discussion, where problems and noteworthy details are discussed. Any features left unimplemented and the reason why are discussed as well. This section can be used to create a better plan for the next sprint and to gain insight in the difficulty of some features.

5

Progress

Tasks The tasks that should be completed this sprint, but that are not features will be listed in this section. ● Setup the general test structure - To make sure the testing is done in a uniform way, the directory structure for testing needs to be set up and appropriate libraries installed. ● Create test cases for the API - The application programming interface (API) to the database was implemented last sprint, but test specifications still have to be made. ● Extend the API to allow updates of user data for the database - The API should have a call that will retrieve information from Facebook for a given user. Also a function has to be made that will update the information of all users that have not been updated for a while. ● GUI layout - The Facebook app has been created in the previous sprint, but the planned layout has not been applied yet. The style and structure of the app must be adjusted to match the layout.

Additionally, the following tasks from the previous sprint still need to be finished: ● Create a database using Cassandra - This includes the setup of indexing schemas and the installation and configuration of Cassandra. ● Create database mapping - from Cassandra to the algorithm and vice versa. ● Link the Facebook application to the API - The Facebook application will use the API to retrieve data from the database. ● Embed authentication and authorization functionality into the API using Apache Shiro - Make the API require users to authenticate themselves (i.e. log in) and then decide, based on their roles, whether or not they have the privilege to perform certain actions. ● Link the Cassandra database layer to the API - Implement the API stubs to make them functional (and therefore unit testable).

Features In this sprint, the features from the previous sprint are to be finished. Therefore, there are no new features to describe.

6

Technical Design In this section the technical design of the features is discussed and the way the new features are merged into the existing product.

Two update functions have been designed: one to update a user‟s profile information when the user logs in, and one to update the information of all users at once, to be invoked on a regular basis.

When a user logs in, there are two important steps that have to be taken: the profile information to be displayed on the main screen must be gathered, and the profile information in the database must be updated. The first step must be done as quickly as possible, as the user should be kept waiting for as short as possible. That is why the Facebook app first draws profile information from Facebook and draws information which comes solely from simlike from the database (for now, only the top five simlikes), and displays it right away. When asking Facebook for information, the app also requests when the last update was. The answer is passed to the database, which compares it to the last time it updated its information. If the Facebook information has been updated later than the database information, the database updates its information directly from Facebook.

The function that will update the information of all users will have to deal with a lot of data. Therefore only users that have not been updated for a while will be updated. These users‟ database ids, their Facebook ids and the last date their information was update are loaded into the java application. Then from all the users that have not been updated for a while the information will be retrieved from Facebook, including their likes. These users and likes will be stored as JSON objects and be written to the database all at once. To achieve this, a machine with a lot of RAM will be rented using Amazon Spot Instances.

7

Quality assurance

In this section it is motivated and explained which strategy was used to produce high quality. First the testing plan of the features is discussed, and then the overall development strategy for this sprint is discussed.

Testing plan The testing plan of the features of this sprint includes all test specifications and reasons for adding them. A testing plan can be used to gain insight in the test suite, to provide some level of confidence in the correctness of the implementation.

To test the JavaScript functions of the Facebook application, qUnit will be used. This allows us to write unit tests for JavaScript files. To test the API, jUnit will be used in combination with Mockrunner to test the servlets and the database classes without having to start a servlet container or database.

API layer test specifications The Application Programming Interface (API) layer consists of 2 sub-layers. 1. The API call layer - Provides a way to access the functionality of the API middle layer by simple http requests. 2. The API middle layer - Provides a link between the API call layer and the database API.

To test the API, jUnit will be used in combination with Mockrunner to test the API call layer (servlets) and the API middle layer without having to start a servlet container or database.

Testing the API call layer The servlets should accept appropriate parameters and return the requested result. We will not create decision tables for combinatorial testing of the servlets, since this takes a significant amount of time. They are relatively trivial and the consequences when they fail will not be critical. We will only apply MC/DC testing to critical parts of the system. When testing, we differentiate between two types of parameters: authentication parameters and API call parameters.

Testing authentication The first type is the authentication parameters. All servlets extend the authentication servlet, so the authentication parameters will be tested on only one servlet. However, multiple ways of authentication (AuthTypes) are possible and should thus be tested.

For authentication testing the following cases will be tested: Globally: 1. An invalid authentication type (AuthType). An AuthenticationException should be thrown.

8

2. Per valid AuthType parameter: a. All parameters will be given, along with principals that don‟t exist in the database (e.g. a nonexistent username) - an AuthenticationException should be thrown. b. All parameters will be given, along with credentials that don‟t exist in the database (e.g. a nonexistent password) - an AuthenticationException should be thrown. c. A valid case will be tested - authentication should succeed. 3. Per authentication parameter: a. A missing parameter - a MandatoryKeyMissingException should be thrown. Note that a parameter with an invalid key is equal to this situation with one difference: there is also an extra (unused) parameter present in the request. b. Multiple values given for a parameter - a MultipleValuesForParameterException should be thrown.

Testing api calls The second type are the parameters needed for the calls to the API (e.g. a user ID). When authentication is tested, we can simply provide correct authentication information to the servlets that implement calls to the API.

The following cases will be tested: Per parameter: ● A missing parameter - a MandatoryKeyMissingException should be thrown. ● A wrongly named parameter - a MandatoryKeyMissingException should be thrown. ● Multiple values will be given for a parameter - a MultipleValuesForParameterException will be thrown. General tests: ● Valid parameters will be given, but the caller is not authorized to request the information - an AuthorizationException will be thrown. ● Valid parameters, but the requested information does not exist - a NoResultException will be thrown. ● Valid case - the requested information should be returned.

Testing the API middle layer For the classes in the API that provide the link between the database and the call layer, the following cases will be tested: ● An „invalid‟ (not corresponding information existing in database) parameter combination - A NoResultException should be thrown. ● A valid parameter combination - the (correct) JSON format of the information in the database should be returned. Since the parameters to the calls are supplied by the API call layer, no null values can be given as parameter (this should cause a MandatoryKeyMissingException), unless it is actually supposed to be null. Therefore we will not be testing them.

9

Development strategy The development strategy includes all design decisions (e.g. programming language, design patterns, tools) relevant for this sprint. This can be used to gain insight in the development process and as a reference of design decisions. Any global design decisions as stated in [002] are omitted.

For testing the following new tools are used: ● qUnit - For testing the JavaScript code. ● Mockrunner - To test servlets and classes with database communication without having to start a servlet container or database.

The following checkstyle checks have been altered or removed: ● The //NOCHECK comment can now be used without a parameter to ignore the line it is on and the line directly following the comment. ● The DeclarationOrder rule, which forces Java class field declarations to be done in a specific order is removed. As it started to interfere with field declarations that were necessarily in a different order it became apparent that this is an undesirable rule.

10

Retrospective

In this section we will review progress since the last report. This enables us to better learn from our past experiences. Also, by documenting these efforts, we hope to gain feedback from our client, as well as our TU Delft supervisor, which we can then apply to accomplish work of higher quality. Last, but not least, these lessons may be of value to the team which will pick up this project after this internship assignment is over.

The progress made on the individual tasks is as follows: ● Connecting the Facebook application to the API - We have succeeded in linking the Facebook application to the API. It is now possible to make requests to the API and show the results. ● Styling the Facebook application - The Facebook application‟s GUI is now in accordance with the design made by Paris Hidden. ● A test suite has been made - The API and Facebook application are now tested. ● Mobile application design - A design for the GUI of the mobile application has been created. ● Authentication and authorization - When using the API, you now have to be authenticated first and when authenticated a check will be made to see if you are authorized to make the API call.

11

Discussion

The mobile application has been given more priority by the client. It is now a must have. To level the workload, the chat of the Facebook application has been moved to could have.

We got a little better at Planning Poker. We almost finished all the tasks we planned for this week. Two tasks did not complete: ● Integrate the API with the database. The database is still not ready. This task suffered from a lot of setbacks. Firstly, we can only work on this in Naaldwijk, where we have a local test server running the database software, limiting our time window for solving this task to two days. Next, installing the Java client to communicate with the database did not succeed because of software dependency problems. ● Retrieve data from Facebook: do it! This feature did not finish because of time limitations.

From Sprint #4, a number of tasks is also still not finished (see attached gantt chart). A lot of these are dependent on the database, hence the tasks have not been closed yet. Once the database is up and running, we anticipate to close these tasks fast.

We implemented a number of test cases, all tests pass.

12

References

[001] Schwaber, Sutherland, “The Scrum Guide” [002] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak”

13

Progress Report

Sprint #6

June 6, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Preface

This is the progress report of the sixth sprint of the development of the Simlike platform. It is a part of a series of reports that each describe the development process of a single SCRUM sprint [001]. More details about the Simlike platform and the authors can be found in [002]. This report is meant as a reference for the authors and can be used to gain insight in, and analyse, the development process of the Simlike platform.

2

Table of contents

Summary ...... 4 Introduction ...... 5 Progress ...... 6 Tasks...... 6 Technical Design ...... 6 API ...... 6 Authentication and Authorization ...... 7 JSON formats ...... 8 Quality assurance ...... 10 Testing plan ...... 10 API test specifications ...... 10 Database test specifications ...... 12 Development strategy ...... 13 Integration ...... 15 Maintenance ...... 17 Retrospective ...... 18 Discussion ...... 20 References ...... 21

3

Summary

In this sprint we created test specifications and cases for the API and database. More documentation has been created to describe the interactions of the different libraries, classes and modules.

We are now using Maven to quickly compile our project. This means we do not need to put all our dependencies in a directory. It also led to a refined tool chain, using Mockito instead of Mockrunner and Jetty instead of Tomcat.

The database is now complete and can be used, but the communication between the database and the API still needs to be implemented. The communication between applications (like the Facebook application) and the API has been defined. Including the URL‟s to visit and the format of the JSON objects that are returned. Users can also be authenticated before they can use the API.

We had some technical difficulties and planned too much hours, which caused us to get further behind on schedule. To prevent this from happening next time, we will use an explicit feature pool with time indicators per task. This allows us to more closely monitor our progress as a team and as an individual according to the planning.

4

Introduction

This document describes the design and design choices made in the sixth sprint of the development of the Simlike platform. In each sprint a selection of features of the product will be implemented. Often some other tasks will have to be executed before a feature can be implemented. These tasks are listed in section Tasks.

Next the technical design of the features is discussed in section Technical Design. This can be used as a reference during implementation and can be used by future developers to gain insight in the architecture of the system.

In the section Quality Assurance the measures that were taken to provide quality results are discussed. This is mainly used to reflect upon the development process.

Documentation on how to integrate the new features into the product environment and on the maintenance of the application is supplied in sections Integration and Maintenance respectively.

Section Retrospective discusses the progress that was made during this sprint and reflects on the process. It can be used to learn from mistakes that were made and to improve the quality process.

This report ends with a discussion, where problems and noteworthy details are discussed. Any features left unimplemented and the reason why are discussed as well. This section can be used to create a better plan for the next sprint and to gain insight in the difficulty of some features.

5

Progress

Tasks The tasks that should be completed this sprint, but that are not features will be listed in this section. ● Documentation - Documentation of the implemented features and structure should be created for future developers. ● Test specifications and test cases - Unfinished test specifications should be completed and new specifications should be made for the new features. These specifications should be implemented in test cases. ● Database - The database should be finished, using Cassandra for scalable storage. ● Database communication - The API should be able to communicate with the database. ● Test data - Test data should be created by retrieving information of Facebook users and storing it in the database. ● Authentication and Authorization - Both authentication and authorization should be finished.

Technical Design In this section the technical design of the features is discussed and the way the new features are merged into the existing product.

API The API calls have been changed to allow for better authorization. The new list of calls is as follows (corresponding JSON formats are listed in section JSON formats): ● searchMatching/automaticMatching - These calls accept the Id of the user that wants to be matched and will return a list of BasicUsers. A BasicUser contains less information than a normal User and is visible to all Simlike users. This list of BasicUsers is called a MatchingResult. SearchMatching also needs a list of ContextItem names to search matching users for. ● simpleOneToOneMatching/extendedOneToOneMatching - These calls accept two Ids of users and return the amount of simlikes the users have in common as a MatchNumber or these actual simlikes as a SimlikeCollection respectively. ● getBasicUserInformation/getAllUserInformation - These calls accept the Id of the user whose information should be returned and return the result as a BasicUser or Profile respectively. User‟ s can use privacy settings to keep their profile information hidden from other users. If a user is not authorized to view the complete profile, a call to getBasicUserInformation can be made to obtain basic information that is public for every user. ● DECLARED CLASSIFIED BY NERVAL LIMITED

6

● getTopSimlikes - This call accepts the Id of the user whose top simlikes should be returned and returns the requested simlikes as a SimlikeCollection. ● updateUser - This call accepts the Id of a user and a Facebook token to access his/her information as inputs and updates the local data of the user from Facebook. The API call layer consists of the Servlets that make the API calls accessible to others over the net. The API calls getSimlike and getTopSimlikes will not yet be implemented in Servlets.

Authentication and Authorization For authentication Shiro is used. In Shiro authentication uses a combination of principals and credentials. The principals identify the user and the credentials proof he is indeed the user that was specified.

Shiro allows the creation of realms. A realm is an object corresponding with a data source, against which a user can be authenticated [REALMS]. Therefore realms can be used for both authentication and authorization.

More fine grained authorization can be achieved by using roles. A role is a set of permissions and users can be assigned one or more roles. This way you can explicitly state what users can and cannot do.

All API calls require authentication, to be sure we are dealing with someone that is allowed to see the information. After authentication is done by the AbstractSimlikeServlet, control will be handed over to the actual API call servlet. This servlet will be supplied with the Subject representing the caller.

Some API calls require special permissions to assure the privacy of other users. In the AbstractSimlikeServlet, after authentication, the first authorization happens based on the Roles of the caller. The following Roles and their permissions exist: ● Admin - An administrator has all rights, but API calls will be logged to be able to check if the admin is not misusing any privileges. ● Person - A person is the basic user of Simlike. The following permissions are given to all Persons: ○ automaticMatching ○ searchMatching ○ simpleOneToOneMatching ○ getBasicUserInformation ● Background Updater - The background updater is responsible for updating user information (either in batch jobs or for single users). ○ updateUser

A user can restrict access to his/her simlikes and date of birth by adjusting privacy settings. The following privacy settings are possible: ● Public - When the information is public, all authenticated Simlike users will be able to access the information.

7

● Friends only - Information for friends only will only be displayed to the user him-/herself and to the user‟s Facebook friends. ● Friends of friends - When information is accessible to friends of friends, it can only be viewed by the user him-/herself, the user‟s friends and their friends. ● Private - Private information can only be accessed by the user who owns the information. There are a few exceptions to these rules. As stated before administrators have access to the information. Also, a user can request access to the information of another user. If the request is granted, the user will be able to view the information until the permission is revoked.

These restrictions mean that whether a caller is allowed to perform the calls extendedOneToOneMatching and getAllUserInformation depends on the caller, the given user Id and the privacy settings of the user corresponding to the given user Id. Therefore it is not possible to authorize callers by using general Roles. To solve this, there will be an extra authorization check in the servlets that implement these two API calls.

Once control is handed to the servlets they are responsible for extra authorization. This can be achieved by using the Subject that was passed to them by the AbstractSimlikeServlet. The authorization checks are then delegated to the PermissionChecker, which determines whether a Subject is authorized based on the Subject and the user Id of the user whose profile is requested.

JSON formats The results of API calls are represented as JSON objects. In this section the formats of these objects is listed. ● MatchingResult { “users” : [ {BasicUser}, …, {BasicUser} ] } ● BasicUser { uId : “Id of the user” “firstname” : “first”, “lastname” : “last”, “age” : “age of the user”, “photo” : “URL to photo”, “location” : “city”, “matching_simlikes” : amount of matching simlikes as a number } ● Profile { “uId” : “Id of the user” “firstname” : “first”, “lastname” : “last”, “age” : “age of the user”, “photo” : “URL to photo”,

8

“date_of_birth” : “YYYYMMDD”, “location” : “city”, “simlikes” : [{simlike}, {simlike}, {simlike}, {simlike}], “top_simlikes” : [{simlike}, {simlike}, {simlike}], “matching_simlikes” : [{simlike}, {simlike}, {simlike}] } ● ContextItem { “cId” : “Id of the ContextItem”, “name” : “name of the ContextItem”, “photo” : “URL to photo” } ● SimlikeCollection { “simlikes” : [ {simlike}, {simlike}, {simlike} ] } ● simlike A simlike is just a ContextItem, so the JSON format is the same. See [DEFS] for more information about the difference between a simlike and a context item. ● MatchNumber { “simlikes” : number of matching simlikes} ● Error {“errors” : {”error classname” : “error message”, …, “error classname” : “error message” }}

9

Quality assurance

In this section, motivation is provided for which strategy was used to produce high quality. First the testing plan of the features is discussed, then the overall development strategy for this sprint is discussed.

Testing plan The testing plan of the features of this sprint includes all test specifications and reasons for adding them. A testing plan can be used to gain insight in the test suite, to provide some level of confidence in the correctness of the implementation.

API test specifications The middle layer of the API consists of the „helper‟ class RequestParameterExtractor and the two database communication classes Matcher and PrivateDatabaseAPI. The test specifications will be split for the two categories.

Helper classes test specifications RequestParameterExtractor extracts parameters from a HttpServletRequest. The following cases will be tested: From a request with multiple parameters: ● Request a non-existent parameter - Throw a MandatoryKeyMissingException. ● Request multiple non-existent parameters - Throw a MandatoryKeyMissingException. ● Request a non-existent parameter and an existing one - Throw a MandatoryKeyMissingException. ● Request an existing parameter - Return the corresponding value. ● Request multiple existing parameters - Return the corresponding values. ● Request a parameter with multiple values - Throw a MultipleValuesForParameterException. From a request without any parameters: ● Request a parameter - Throw a MandatoryKeyMissingException. ● Request multiple parameters - Throw a MandatoryKeyMissingException. From a request with one parameter: ● Request a non-existent parameter - Throw a MandatoryKeyMissingException. ● Request an existing parameter - Return the corresponding value. ● Request a parameter with multiple values - Throw a MultipleValuesForParameterException. ● Request a non-existent parameter and an existing one - Throw a MandatoryKeyMissingException.

Database communication classes test specifications

10

PrivateDatabaseAPI is responsible for the retrieval of „simple‟ data. More specific, it can retrieve users and simlikes. For retrieving a user the following cases will be tested: ● Request a valid user as BasicUser - The user‟s basic information should be returned as a BasicUser. ● Request a valid user as Profile - The user‟s information should be returned as a Profile. ● Request a non-existent user as BasicUser - A NoResultException should be thrown. ● Request a non-existent user as Profile - A NoResultException should be thrown. For retrieving a context item the following cases will be tested: ● Request a valid context item - The simlike‟s information should be returned as a ContextItem. ● Request a non-existent context item - A NoResultException should be thrown. ● Request a valid user’s top simlikes - The simlikes should be returned as a SimlikeCollection. ● Request a non-existent user’s top simlikes - A NoResultException should be thrown.

The Matcher class handles the more intricate data retrieval. It is responsible for executing the matching algorithms and retrieving the data from the database. For oneToOneMatching the following cases will be tested: ● Request simple matching of two valid users - The amount of coresponding simlikes should be returned as a MatchNumber. ● Request extended matching of two valid users - The list of coresponding simlikes should be returned as a SimlikeCollection. ● Request simple matching of two non-existent users - A NoResultException should be thrown. ● Request extended matching of two non-existent users - A NoResultException should be thrown. ● Request simple matching of a valid and a non-existent user - A NoResultException should be thrown. ● Request extended matching of a valid and a non-existent user - A NoResultException should be thrown. For automaticMatching the following cases will be tested: ● Request matching for a valid user - Return the results as a MatchingResult. ● Request matching for a non-existent user - A NoResultException should be thrown. ● Request matching for a valid user who has no simlikes - A NoResultException should be thrown. ● Request matching for a valid user who has no simlikes in common with any other users - A NoResultException should be thrown. For searchMatching the following cases will be tested: ● Request a search started by a non-existent user - A NoResultException should be thrown. ● Request a search started by a valid user with no search criteria (cNames) - A NoResultException should be thrown.

11

● Request a search started by a valid user with a non-existent cName - A NoResultException should be thrown. ● Request a search started by a valid user with non-existent cNames - A NoResultException should be thrown. ● Request a search started by a valid user with existing cNames - The users that were found should be returned as a MatchingResult. ● Request a search started by a valid user with existing and non-existent cNames - The result should be returned as if the non-existent cNames were never supplied. ● Request a search started by a valid user with an existing cName, but no possible matches - A NoResultException should be thrown.

Authentication test cases: 1. An invalid authentication type (AuthType). An AuthenticationException should be thrown. 2. For every valid AuthType parameter: a. All required parameters will be given, with principals that don‟t exist in the database (e.g. a non-existent username) - an AuthenticationException should be thrown. b. All required parameters will be given, with credentials that don‟t exist in the database (e.g. a non-existent password) - an AuthenticationException should be thrown. c. A valid case will be tested (i.e. both principals and credentials should be correct, along with any other mandatory parameters) - authentication should succeed and the request should be handled by the servlet. 3. Per authentication parameter: a. A missing parameter - a MandatoryKeyMissingException should be thrown. Note that a parameter with an invalid key is equal to this situation with one difference: there is also an extra (unused) parameter present in the request. b. Multiple values given for a parameter - a MultipleValuesForParameterException should be thrown.

Database test specifications The database reads its configuration from an INI file. There is a CassandraConfig class which handles this. The following test cases are defined for this class: ● Test loading the default configuration. This is loaded from a configuration file. ● Test loading an alternative configuration. This is loaded from the same configuration file, but from within another section. ● Test loading a non-existent configuration section. Should throw an IllegalStateException. ● Test loading a configuration with an illegal parameter. Should throw an IllegalArgumentException. ● Test connecting to a keyspace and disconnect again. This should work. ● Test loading all possible configuration settings all at once. No exceptions should be thrown if valid values are given.

12

● Test loading configuration file when file is not found. Should throw a FileNotFoundException.

Development strategy The development strategy includes all design decisions (e.g. programming language, design patterns, tools) relevant for this sprint. This can be used to gain insight in the development process and as a reference of design decisions. Any global design decisions as stated in [002] are omitted.

This week we switched from the build tool Ant to a new build tool Maven. Initially we chose for Ant because all the team members are familiar with it. However, we reached the point where managing the dependency hierarchy for the project was not feasible anymore with Ant. We currently have 42 dependencies. Maven relieves us from doing the dependency management manually. We also integrated Checkstyle, PMD, javadoc, Emma, unit testing and code duplication reports in our build. This gives us a central point where we can gather our reports.

We used Mockrunner to test the servlets, but this did not work with Maven. Therefore we are now using Mockito instead to perform the servlet tests. Mockrunner allowed us to run a Servlet by extending a special class that made it possible to set up the HttpServletRequest and read the output of the Servlet by using the classes methods. This made it extremely easy to test Servlets.

Mockito does not offer the same functionality. Instead, it allows us to mock the HttpServletRequest and HttpServletResponse by specifying the way some of their methods should react. This meant that we had to build the framework to setup the mock HttpServletRequest parameters and get the output of the mock HttpServletResponse ourselves. By creating our own special class we mimicked Mockrunners behaviour to keep changes to the existing test cases to a minimum.

We integrated a feature in javadoc: UML class diagrams are automatically created and integrated in the javadoc. We are planning to integrate our sequence diagrams in the javadoc as well. Currently we draw the sequence diagrams by hand, which is easier and faster in the communication with each other. These drawings will be digitized and included in the javadocs and in the TDD. To generate the UML diagrams in the javadoc, we use an external library called Graphviz.

For local web server testing, we switched from Tomcat to Jetty. Jetty has superior support for Maven, which accelerates the development significantly. For example, the Jetty Maven plugin detects changes in the source code and recompiles the application on-the-fly, enabling us to see the changes immediately while developing. This is preferable to having to compile manually with each change (this takes more effort and time). With the integration testing we will have to test it with Tomcat as well, since Tomcat is used by our hosting platform AWS Elastic Beanstalk.

13

Maven also provides support for integration testing with Cassandra. We installed a Cassandra Maven plugin in our project, which provides a local Cassandra server for every developer. This allows for basic testing, such as testing database communication used for authentication and authorisation. DECLARED CLASSIFIED BY NERVAL LIMITED

We are also exploring how to use the Maven Failsafe Plugin. This plugin enforces that the integration tests must succeed in order to deploy through Maven. This acts, as the plugin name suggests, as a failsafe for deploying tested software.

Another plugin we are exploring is the Versions plugin. This plugin alerts us whenever a new version of a dependency is available, ensuring we can develop and test with the most up-to- date software. This reduces the possibility of old bugs in third party software which can lead to security risks in the application.

14

Integration

In this section the integration of the new features into the existing product is discussed.

For Facebook integration we use the RestFB, a Java client for Facebook. It provides support for fetching JSON objects from Facebook containing user data. Another supported feature is Facebook Query Language (FQL), this can be used to query data from Facebook using a SQL- like query language.

For the integration with Cassandra (database), we use Hector. Hector is a Java client library for connecting and communicating with Cassandra. It offers fail over and load balancing functionality.

Shiro is integrated into the current product to enable authentication and authorization. The servlets of the API extend the AbstractSimlikeServlet, which uses the parameters given via the HTTP request in combination with Shiro to authenticate the caller and check if he/she is authorized. For more specific, individual, authorization the servlet itself should do the authentication by using the PermissionChecker. To authenticate a subject, the principals and credentials given in the HTTP request are compared to those stored in the database via the SimlikeRealm.

The GUI for the Facebook application will be hosted on a Content Delivery Network (CDN). A CDN offers low latency which results in a real-time loading experience of the GUI. The API will be hosted on AWS Elastic Beanstalk, it will be deployed as a Web application Archive (WAR) file.

The Facebook application (and possibly other applications) is linked to the API via simple http requests. The GUI links the user to a URL which contains the call to issue, along with the parameters. For example, a user with ID 12345 searching for “soccer” would be linked to [API URL]/searchMatching?uId=12345&cNames=soccer&[AUTHENTICATION PARAMETERS]. The API would then respond by making a request to the database, and returning the results as a JSON list to the GUI. The GUI can then process this list and display it to the user. The communication is visualised in the following flow diagram:

15

16

Maintenance

When new social media are integrated in Simlike, new authorisation functionality will have to be added as well. To make this work, two new components have to be added to the security mechanism. A new AuthenticationToken and a new AuthorizingRealm have to be implemented for the new social media integration. For an example of how this works, see com.simlike.api.auth.tokens.Simtoken and com.simlike.api.auth.realms.SimlikeRealm, respectively.

Hector is designed to be able to connect to another database server in the cluster if the first connection attempt fails. If new database servers are added to the cluster, the configuration in “api/src/main/resources/cassandra.ini” has to be updated accordingly. Its also possible to fine tune the other Hector settings in this configuration file.

As described in section Integration, the API returns JSON objects. When, for some reason, the format of these objects needs to be changed it can be done in the PrivateDatabaseAPI class. The current formats are listed in templates files in the src/test/resources directory. These templates can be used for reference and are also used for testing. Therefore these templates should also be changed when changes to the JSON format are made.

17

Retrospective

In this section we will review progress since the last report. This enables us to better learn from our past experiences. Also, by documenting these efforts, we hope to gain feedback from our client, as well as our TU Delft supervisor, which we can then apply to accomplish work of higher quality. Last, but not least, these lessons may be of value to the team which will pick up this project after this internship assignment is over.

This week we encountered several setbacks. The first setback was caused by Eclipse. Due to a bug in the IDE certain libraries would not load. This meant we could not run our software. It took some time to figure out the IDE was the cause. Initially it looked like Tomcat was the culprit, so we tried out an alternative for Tomcat, namely Jetty. This worked like a charm. Once we tried to integrate Jetty in our IDE, the same errors came back, indicating this was an Eclipse problem. We solved this by running Jetty trough Maven, which has the same benefits as running it with the IDE. This took an entire day (Full Time Equivalent) to fix.

Another issue came up with the hard disk of one of the team members. We had to replace the hard disk. After reinstalling, which took almost the entire day, the new hard disk started to fail and eventually crashed the same day. This took 1.5 day FTE out of our planning.

We also encountered a hard to solve bug in our authentication mechanism. This took halve a day FTE to debug and solve.

With all these problems, we had spent 3 days FTE less on our planning than we had anticipated (out of the 20 days FTE we had at our disposal).

In sprint 4 an explicit list of tasks and features (representing the feature pool for the sprint) was made. This list was used to during the planning poker meeting and for reference by both the project team and client. The use of this list was not planned and so we didn‟t use it in sprint 5. However, the client missed the list, since it gave him a good overview of what we were doing and how we were progressing. It also provided more structure to the planning phase and made it easier for the project team to find out what to do next. Therefore this list of tasks and features with the planned time per feature will be used every sprint from now on. The tasks are also listed per team member. This way each team member will have an easy overview of where he is now and what has to be done. For an example of the list, see [LIJST].

This week, the list indicated that everyone had about 35 hours of tasks planned (one team member had less). It proved that 35 hours of tasks is not feasible in a work week. This is due to the fact that time is not spent 100% efficiently. It is more realistic to plan 25-30 hours a week per team member for tasks. In total we planned 122.5 hours for tasks this week, while realistically we only had 100-120 hours at our disposal. This could indicate we could have had a task

18 surplus of 0 - 2.5 day FTE in our planning this week. To counteract this, we will aim to have around 110 hours of tasks per week.

All in all, quantifying the discrepancies in our planning, reveals we should have had an extra team member to complete all the work this week. Hence quite a few tasks did not finish.

19

Discussion

Several tasks did not finish (completely) this week: ● Integrate database with API (requires DB). Due to the setbacks in the start of the week, the database was ready later than expected. This tasks depends on the database, hence it started later and is not yet finished. ● Create documentation (TDD). All documentation is present, though it has not been combined yet entirely in the TDD. The documentation is currently spread out over the concept TDD, several progress reports and on paper notes. ● Create test specifications: authorisation. This task has not been started yet. ● Create test specifications: authentication. This task has not been started yet. ● Create test specifications: JavaScript. This task is halfway done. All the specifications are done, but exist only on paper. They still need to be put in the TDD. ● Implement test cases: authorisation. The test specs for these test cases have not yet been devised, hence they are not yet implemented. ● Implement test cases: authentication. The test specs for these test cases have not yet been devised, hence they are not yet implemented. ● Implement test cases: database. All specs have been implemented, except for test spec “Test loading configuration file when file is not found”. ● Create test data: the application to gather test data is ready. Now we need volunteers (besides ourselves) to donate their data. ● Complete authorization. The implementation is halfway ready. The business logic is ready, we only need to implement the authorisation data source.

20

References

[001] Schwaber, Sutherland, “The Scrum Guide” [002] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak” [REALMS] Apache, “Realm | Apache Shiro”, http://shiro.apache.org/realm.html, accessed on May 30 2011 [DEFS] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Definitions” [LIJST] Lanting, Feature Backlog Sprint 6

21

Progress Report

Sprint #7

June 10, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Preface

This is the progress report of the seventh sprint of the development of the Simlike platform. It is a part of a series of reports that each describe the development process of a single SCRUM sprint [001]. More details about the Simlike platform and the authors can be found in [002]. This report is meant as a reference for the authors and can be used to gain insight in, and analyse, the development process of the Simlike platform.

2

Table of contents

Summary ...... 4 Introduction ...... 5 Progress ...... 6 Tasks...... 6 Technical Design ...... 7 API ...... 7 Authentication ...... 8 Authorization ...... 8 Quality assurance ...... 10 Testing plan ...... 10 Facebook Application Test ...... 10 API test specifications ...... 11 Development strategy ...... 14 Maintenance ...... 16 Retrospective ...... 16 Discussion ...... 17 References ...... 18

3

Summary

In this sprint the documentation has been made up to date and the code has been tested and checked for style errors.

The communication between the API and the database has been implemented so project ADEL has been tested and can communicate with Facebook, a user, our API and (through the API) the database. However, we have not yet gathered much data from it.

We also finished a lot of tasks that did not get completed in earlier sprints, but we also did not manage to finish some tasks planned for this sprint. Most important is the creation of a global class diagram.

4

Introduction

This document describes the design and design choices made in the seventh sprint of the development of the Simlike platform. In each sprint a selection of features of the product will be implemented. Often some other tasks will have to be executed before a feature can be implemented. These tasks are listed in section Tasks.

Next the technical design of the features is discussed in section Technical Design. This can be used as a reference during implementation and can be used by future developers to gain insight in the architecture of the system.

In the section Quality Assurance the measures that were taken to provide quality results are discussed. This is mainly used to reflect upon the development process.

Documentation on the maintenance of the application is supplied in section Maintenance.

Section Retrospective discusses the progress that was made during this sprint and reflects on the process. It can be used to learn from mistakes that were made and to improve the quality process.

This report ends with a discussion, where problems and noteworthy details are discussed. Any features left unimplemented and the reason why are discussed as well. This section can be used to create a better plan for the next sprint and to gain insight in the difficulty of some features.

5

Progress

Tasks The tasks that should be completed this sprint, but that are not features will be listed in this section.

The new tasks for this sprint are: ● Change the code for exception handling - Exceptions should be logged and not reported back to the application to improve security. ● Finish the TDD to reflect all changes - The TDD should be up to date to avoid having to rewrite it entirely. ● Solve PMD warnings and errors - Like checkstyle, PMD should become a tool that we use every time we write code. We haven‟t so far, so we need to catch up on all the PMD errors and warnings. ● Complete project ADEL - The Facebook application that retrieves a user‟s information and stores it in our database should be complete at the end of the sprint. ● Create test specifications and cases - The new code should be tested and any lack of coverage should be motivated. ● Complete authentication - The authentication process should be completed. ● Complete authorization - The current plan for authorization needs to be approved by all team members and implemented. ● Update Facebook application - The Facebook application should be able to communicate with the API that was changed last sprint and it should be able to pull data from the current user from Facebook. ● Send code to SIG - The deadline for sending to code to the Software Improvement Group (SIG) is June 14. But we want to send it a little earlier to receive a response faster. ● Visit Zef to discuss the use of MOBL in our project - MOBL is a mobile programming language that is developed at the TU Delft. It looks like we could use it to develop the mobile application. For this we will visit the author and discuss the possibilities.

Other tasks that have been re-planned for this sprint: ● TDD: global class diagram - Create a global class diagram for an overview of all the Java classes. By inspecting the connections between classes, it could be possible to identify design complications. ● TDD: Facebook app - Add the Facebook application to the TDD. ● Integrate database with API (requires DB) - ● Create documentation (TDD) - Gather all the documentation in one document. ● Create test specifications: authorisation. ● Create test specifications: authentication. ● Create test specifications: JavaScript 6

● Implement test cases: authorisation ● Implement test cases: authentication ● Implement test cases: database ● Create test data - Test data is needed in order to run the algorithm. ● Create backup of test data - Backup the test data so we can always revert to a working version of the database. ● Complete implementation of authorization.

Technical Design In this section the technical design of the features is discussed and the way the new features are merged into the existing product.

API The name of a user id used to be uId. This is now called simid when referring to the user id used by Simlike and fbid when referring to the user id used by Facebook. This changes the names of the parameters of the API calls. Every parameter with uId in its name is now called simid instead.

Some parameters of API calls were redundant and a possible security risk. These parameters refer to the simid of the user that made the API call. This simid can be obtained from the authentication principals and is therefore not needed as a separate parameter.

The following API calls were added: ● oneToOneMatching - This call takes a simid as parameter. If the caller is authorized to access extendedOneToOneMatching a call will be made to it, passing the simid parameter along. Otherwise a call to simpleOneToOneMatching will be made. ● getUserInformation - This call takes a simid as parameter. If the caller is authorized to access getAllUserInformation a call will be made to it, passing the simid parameter along. Otherwise a call to getBasicUserInformation will be made. ● authenticate - This call takes a Facebook id (fbid) and a Facebook token (fbtoken) as parameters and returns either the (simid, simtoken) pair that corresponds to the given information or (if not present) creates a new (simid, simtoken) pair and returns it. The answer will be given as a SimAuth format Json object. It is implemented to also work with any future Realms that we might create later on and can be used by applications to get the required simid and simtoken so they can access the other API calls. ● getTopSimlikes - This call accepts a simid as parameter and returns the corresponding user‟s top five simlikes from the database. It returns them as a SimlikeCollection.

JSON formats The following formats were added or changed: ● SimAuth {

7

"authsimid" : "simid that can be used for authentication", "simtoken" : "simtoken that can be used for authentication" } ● Error { "error" : "error message" } ● Every “uid” has been replaced with a “simid”.

Authentication Authentication is the act of verifying whether a user is who he says he is. The user supplies a (principals, credentials) pair, where the principals are the information that identifies the user and the credentials are the information that can be used to check whether the user is „telling the truth‟. For example, a name can be a principal (although it should be a unique identifier in Simlike) and a passport can be the credentials.

Facebook When a user supplies his Facebook token (fbtoken) and id (fbid) to the AuthenticationServlet and requests authentication on the FacebookRealm (type = Facebook), Shiro will ask the FacebookRealm for the AuthenticationInfo that belongs to the (fbid, fbtoken) pair. The realm asks Facebook for the id (principals) belonging to the fbtoken (credentials) and returns (fbid, fbtoken, FacebookRealm) as AuthenticationInfo if the given principals match those gotten from Facebook. Otherwise an AuthenticationException is thrown. If the information is returned, the user is authenticated on the FacebookRealm and the information can be used to generate a simtoken.

Simlike When a user supplies his simid and simtoken along with an API call, the AbstractSimlikeServlet will ask Shiro to authenticate the user at the SimlikeRealm with a (simid, simtoken) pair. Shiro will call the SimlikeRealm‟s doGetAuthenticationInfo() method. This method will use the simtoken to read the simid corresponding to this simtoken from the database. This work is done by the method SimlikeController.read(). When the returned simid matches the given simid, the user is authenticated and the (simid, simtoken, SimlikeRealm) AuthenticationInfo is returned.

Authorization Authorization is the act of checking whether a user (once authenticated) has permission to do what he is trying to do. A driver‟s license is an example of authorization. If you own a driver‟s license you have the permission (you are authorized) to drive a car.

Shiro offers the isPermitted() method to check whether a user has a given permission. When this call is made, Shiro calls the doGetAuthorizationInfo() method on the realms where the user is authenticated and supplies it with the principals of the user.

8

Facebook The FacebookRealm‟s authorization is currently not used. Users only authenticate on this realm to obtain a simtoken. All users therefore have the implicit permission to request a simtoken.

Simlike When doGetAuthorizationInfo() of the SimlikeRealm is called, it loads the Role corresponding to the given principals. A Role can be viewed as a set of permissions, so if a user has a certain Role, he has all the permissions that belong to that role. These Roles and Permissions are returned in the AuthorizationInfo and used by Shiro to determine if the user is authorized or not.

The Roles provide a way to distribute permissions amongst large groups of users. However, to implement privacy settings that control the access to a user‟s profile data, individual permissions are needed. Let‟s say user A requests the profile of user B. This is only allowed if A has the permission “access:profile:B”. To dynamically generate the individual permission, the simid of B is added to the principals of A. This way the doGetAuthorizationInfo() method of a Realm can ask the PermissionChecker whether A has permission to access B‟s profile. If A has permission, the actual permission can be added and the AuthorizationInfo can be returned. Shiro will do the rest, so a simple call to isPermitted() can be used in the servlet.

9

Quality assurance

In this section it is motivated and explained which strategy was used to produce high quality. First the testing plan of the features is discussed, then the overall development strategy for this sprint is discussed.

Testing plan The testing plan of the features of this sprint includes all test specifications and reasons for adding them. A testing plan can be used to gain insight in the test suite, to provide some level of confidence in the correctness of the implementation.

Facebook Application Test The top layer, the application itself, also requires testing. For this purpose, the team has used qUnit to create unit tests for the JavaScript functions.

At start-up: ● Check the ‘content’ div - Content should be empty at start-up. Showing and hiding content: ● Call the hideContent() function - Check if the menu and content are actually hidden after calling the function. ● Call the showContent() function - Check if the menu and content are made visible after calling the function. Searching: ● Request an empty search - If no query is specified, the system should not perform a full search. ● Request a valid search - There must be some result if the query is valid, so we expect the content section to be updated. Additionally, the currentList must be updated. Viewing a profile by index: ● Request an invalid index - If an index is specified out of the array‟s bounds, then an error should occur. ● Request a valid index - If a valid index is specified, we expect the content section to be updated. Viewing a profile by user ID: ● Request an invalid ID - If a non-existing ID is specified, then an error should occur. ● Request a valid ID - If a valid ID is specified, we expect the content section to be updated. The currentList must also be updated. Automatching: (takes no parameters) ● Perform the automatching - After automatching, we expect the content section and the currentList to be updated. Viewing one‟s own profile:

10

● Request the profile - Check to see if the information from the profile is correctly displayed.

API test specifications GetAllUserInformation The following cases have been added: ● Request a user’s information when the caller is not authorized - An AuthorizationException should be thrown and a general error printed to the http response.

ExtendedOneToOneMatching ● Request a user’s information when the caller is not authorized - An AuthorizationException should be thrown and a general error printed to the http response.

GetTopSimlikes ● An authorized caller requests the simlikes of an existing user - The simlikes of the user corresponding to the simid should be returned. ● A user that is not authorized to view the simlikes makes the API call - An AuthorizationException should be thrown, causing a standard error to be printed on the http response. ● No simid will be supplied to the call - A MandatoryKeyMissingException should be thrown, causing a standard error to be printed to the http response. ● A call with a wrong name for the simid parameter will be made - A MandatoryKeyMissingException should be thrown, causing a standard error to be printed to the http response. ● A call with multiple values for the simid will be made - A MultipleValuesForParameterException should be thrown, causing a standard error to be printed to the http response. ● The simid of a non-existent user will be supplied - A NoResultException should be thrown, causing a standard error to be printed to the http response.

ProfilePrivacySettings Users can restrict access to their simlikes with privacy settings. The following cases are tested: For each setting: ● A user requesting access to his own data - Access should be granted. ● A random user that is neither friend nor friend of friend requesting access - Only with public information should access be granted, otherwise it should be denied. For privacy settings set to friends: ● A friend requesting access - Access should be granted. ● A friend of a friend requesting access - Access should be denied. For privacy settings set to friends of friends: ● A friend requesting access - Access should be granted.

11

● A friend of a friend requesting access - Access should be granted.

AuthenticationType While this is a core class within the simlike code, the team found that only a specific set of code paths was covered so these test specs were devised: ● Get an AuthenticationType using a valid lowercase name. Should yield the corresponding AuthenticationType. ● Get an AuthenticationType using a valid uppercase name. Should yield the corresponding AuthenticationType. ● Get an AuthenticationType using an invalid name. Should yield an IllegalArgumentException. ● Get a new valid RememberMeAuthenticationToken object from a request: should yield the token object. ● Get a new RememberMeAuthenticationToken object using an invalid principal request parameter key. Should yield a MandatoryKeyMissingException. ● Get a new RememberMeAuthenticationToken object using an invalid credential request parameter key. Should yield a MandatoryKeyMissingException.

AbstractSimlikeServlet / Authentication in general: ● Non-existent authsimid: should yield a MandatoryKeyMissingException ● Non-existent simtoken: should yield a MandatoryKeyMissingException ● Multiple values for authsimid: should yield a MultipleValuesForParameterException ● Multiple values for simtoken: should yield a MultipleValuesForParameterException ● Non-existent authsimid and simtoken: should yield a MandatoryKeyMissingException

Simtoken & FacebookToken: One test class has been created to test both token classes. We test each class‟ equals method as follows: ● Same object: should return true. ● Different ID: should return false. ● Different chars: should return false.

New calls have been added to the API since last sprint. We will test these calls as follows:

UpdateUser This class takes no additional parameters, other than the simlike ID and the simlike token. It assumes the user is already logged in. The following test cases have been specified for it: ● Update as a new user - A new entry should be created for this user in the database. ● Update as an existing user - The existing data should be overwritten in the database. ● Update as an up-to-date user - The database should not be changed. ● Supply no parameters - A MandatoryKeyMissingException should be thrown and a standard Error should be returned. ● Supply a multivalued parameter - A MultipleValuesForParameterException should be thrown and a standard Error should be returned.

12

AuthenticationServlet ● Authenticate a new Facebook user - A new simlike account should be created for this user and stored in the database. The simlike id and token should be returned. ● Authenticate a Facebook user with an existing simlike account - The matching simlike account should be retrieved from the database and returned along with a newly generated token. ● Authenticate a Facebook user with a wrong specified Realm - An AuthenticationException should be thrown, causing a StandardJsonError to be written. ● Authenticate a Facebook user with a non-existent specified Realm - An AuthenticationException should be thrown, causing a StandardJsonError to be written. ● Authenticate a Facebook user with no specified Realm - A MandatoryKeyMissingException should be thrown, causing a StandardJsonError to be written. ● Authenticate a Facebook user with multiple Realms - A MultipleValuesForParameterException should be thrown, causing a StandardJsonError to be written. ● Authenticate a Facebook user with wrong credentials - An AuthenticationException should be thrown, causing a StandardJsonError to be written. ● Authenticate a Facebook user with no credentials - A MandatoryKeyMissingException should be thrown, causing a StandardJsonError to be written. ● Authenticate a Facebook user with non-existent credentials - An AuthenticationException should be thrown, causing a StandardJsonError to be written. ● Authenticate a Facebook user with multiple credentials - A MultipleValuesForParameterException should be thrown, causing a StandardJsonError to be written. ● Authenticate a Facebook user with non-existent fbid - An AuthenticationException should be thrown, causing a StandardJsonError to be written. ● Authenticate a Facebook user with wrong fbid - An AuthenticationException should be thrown, causing a StandardJsonError to be written. ● Authenticate a Facebook user with no fbid - A MandatoryKeyMissingException should be thrown, causing a StandardJsonError to be written. ● Authenticate a Facebook user with multiple simids - A MultipleValuesForParameterException should be thrown, causing a StandardJsonError to be written. ● Request authentication with a false API parameter name - A MandatoryKeyMissingException should be thrown, causing a StandardJsonError to be written. ● Request authentication without parameters - A MandatoryKeyMissingException should be thrown, causing a StandardJsonError to be written.

OneToOneMatching

13

● Supply no parameters - A MandatoryKeyMissingException should be thrown, causing an Error to be written. ● Supply multiple values for the parameter - A MultipleValuesForParameterException should be thrown, causing an Error to be written. ● Request matching when the caller is authorized for extendedOneToOneMatching - A SimlikeCollection should be returned. ● Request matching when the caller is not authorized for extendedOneToOneMatching - A MatchNumber should be returned.

GetUserInformation ● Supply no parameters - A MandatoryKeyMissingException should be thrown, causing an Error to be written. ● Supply multiple values for the parameter - A MultipleValuesForParameterException should be thrown, causing an Error to be written. ● Request matching when the caller is authorized for getAllUserInformation - A Profile should be returned. ● Request matching when the caller is not authorized for getAllUserInformation - A BasicUser should be returned.

Development strategy The development strategy includes all design decisions (e.g. programming language, design patterns, tools) relevant for this sprint. This can be used to gain insight in the development process and as a reference of design decisions. Any global design decisions as stated in [002] are omitted.

The following PMD rules have been removed this sprint: ● UnusedPrivateField: PMD emits a warning when a private field is unused in the code. While this is usually a helpful rule, PMD is unable to recognize that fields are annotated with Lombok‟s @Getter annotation are, in fact, used. This generates quite a few unjustified warnings. ● DataFlowAnomalyAnalysis: This generates an Info (which is just as annoying as a Warning) whenever a local variable is reassigned a value (which is, technically speaking, necessary to properly clean up a resource such as a PrintWriter). ● LongVariable: Generates a warning whenever a variable name is considered “excessively long”. However, we do not agree with their definition of “excessively”. ● AvoidFinalLocalVariable: this is in contradicion with another PMD rule which suggests that unaltered local variables should be final. The team decided it sends a good signal to others reading the code if a local variable that is not reassigned is marked as final. ● AbstractClassWithoutAbstractMethod: There are legitimate cases in the simlike code where an abstract class has no abstract methods. ● SingularField: ● UseLocaleWithCaseConversions: ● BeanMembersShouldSerialize:

14

● ArrayIsStoredDirectly: In, for example Simtoken, a char array is stored as-is by the constructor. This is done because String objects are immutable and therefore no guarantees can be given about when they will be cleaned up by the garbage collector. This presents a security risk. ● MethodReturnsInternalArray: This is analogous to the ArrayIsStoredDirectly as this is mostly about getters for the internal char array object in Simtoken. ● AvoidDuplicateLiterals: While it is good principle to extract multiple relevant (otherwise identical) strings to a class or local constant, getting warnings about trivial strings is counterproductive. ● UnusedImports: This rule is redundant as the eclipse JDT plugin already generates a warning for unused imports.

We are now using slf4j to log events. This means all exceptions that are expected are caught and logged.

15

Maintenance

In this section the maintenance of the created features is discussed. Most important in this respect are the database, the Facebook application and the API.

Database New column families can be added to the existing keyspace at any time. Due to the flexible nature of Cassandra, extra columns can be added on the fly. Cassandra does not enforce a tight table schema, this is the designers responsibility.

The schema is defined in src/cassandra/cli/load.script. This script is loaded in a local instance of Cassandra for running unit tests and integration tests. Any changes to the database schema should be reflected here as well to be sure all test cases pass.

Facebook application Changes to the layout of the Facebook application can be made in the CSS files. When the API calls change, these changes should be reflected in the JavaScript file script.js, where the API calls are made. If more (key, value) pairs are added to the JSON objects that are passed to the Facebook application or the current names of the keys change, the templates in the HTML file should be changed accordingly.

API Changes to the front end of the API can be made by adding, removing or editing the servlets. When the functionality changes, the test cases should be changed accordingly. Any changes to the JSON formats that are used, should also be made in the corresponding template files. Changes to the database schema mean the corresponding controllers in the API will also need to be changed.

Retrospective

In this section we will review progress since the last report. This enables us to better learn from our past experiences. Also, by documenting these efforts, we hope to gain feedback from our client, as well as our TU Delft supervisor, which we can then apply to accomplish work of higher quality. Last, but not least, these lessons may be of value to the team which will pick up this project after this internship assignment is over.

This sprint we finished a lot of work of previous sprints. Apparently we planned too much features in those sprints. Experience has shown us that planning somewhere between 25-30 hours per team member per week is the maximum.

16

The development process has started to become more structured during this sprint. Testcases were made for new functionality and bugs were found and solved because of it. Also, the constant remodelling of the API‟s interface should finally be over. We learned that we need to plan more „concept meetings‟ where the whole team discusses the approach that will be taken for a specific feature. This leaves us with a better understanding of the situation and what should be taken into account and therefore leads to less remodelling of already implemented features.

We ran into some more unplanned problems this sprint and learned that a lot of tasks depend on others. This can lead to people not being able to continue with their task until a problem with another task is solved. In these cases it is best if two people deal with the problem. Having a list of small possible tasks next to the sprint‟s feature pool can make it easier for the other team members to find a new task to work on when all other tasks cannot be completed due to the problem. The tasks on this list will need to be small, so that when the problem is solved the others can complete what they were working on before switching back to the other tasks without losing much time.

Discussion

Some tasks were not completed: ● TDD: global class diagram - We use the software UMLGraph to generate our class diagrams automatically for us. It adds the diagrams to the javadoc. It is, however, not able to generate a global class diagram containing all classes (at least, not using the Maven plugin). For now, we will fall back on making the global diagram manually. We will do this once all classes are present to save time, because such a diagram requires intensive maintenance. ● Send code to SIG - We could not send the code yet this week due to yet to be made non-disclosure agreement (NDA) between Nerval Limited and SIG. Once the NDA is ready, we will send in the code. The deadline has been relieved for this purpose. ● Visit Zef to discuss the use of MOBL in our project - Zef was not available this week. ● Project ADEL and test data - Project ADEL (the application that retrieves user data and stores it in our database) is functionally completed. However, its appearance should be changed to give a more professional impression. Also, the format in which the data is stored could be refined to save time later on. Once these tasks are completed we can ask people to use project ADEL and until then we only have test data for our test cases, not for our algorithms.

It is also important to note that next week some members will not be present every day during the next sprint. This means our planning should be kept small and it might be a good idea to take another look at the global planning to see how we can make up for this loss of manpower.

17

References

[001] Schwaber, Sutherland, “The Scrum Guide” [002] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak”

18

Progress Report

Sprint #8

June 17, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Preface

This is the progress report of the eighth sprint of the development of the Simlike platform. It is a part of a series of reports that each describe the development process of a single SCRUM sprint [001]. More details about the Simlike platform and the authors can be found in [002]. This report is meant as a reference for the authors and can be used to gain insight in, and analyse, the development process of the Simlike platform.

2

Table of contents

Summary ...... 4 Introduction ...... 5 Progress ...... 6 Tasks...... 6 Features ...... 6 Functional Design ...... 6 Technical Design ...... 7 API ...... 7 Database ...... 8 Quality assurance ...... 9 Testing plan ...... 9 Development strategy ...... 9 Retrospective ...... 11 Discussion ...... 11 References ...... 11

3

Summary

This sprint project ADEL and the possibility to select, change and retrieve top interests were finished. Also all mocks for the database are replaced by real implementations. Only matching, chat and like-clustering remain for the Facebook application.

From this sprint on we will start working on separate tasks in teams of two people. This way we hope to be able to implement the last features faster and be in time for the deadline.

4

Introduction

This document describes the design and design choices made in the eighth sprint of the development of the Simlike platform. In each sprint a selection of features of the product will be implemented. Often some other tasks will have to be executed before a feature can be implemented. These tasks are listed in section Tasks. The features are listed in section Features and further explained in section Functional Design. These sections can be used as reference during implementation and can later be used as documentation on how the features should be used.

Next the technical design of the features is discussed in section Technical Design. This can be used as a reference during implementation and can be used by future developers to gain insight in the architecture of the system.

In the section Quality Assurance the measures that were taken to provide quality results are discussed. This is mainly used to reflect upon the development process.

Section Retrospective discusses the progress that was made during this sprint and reflects on the process. It can be used to learn from mistakes that were made and to improve the quality process.

This report ends with a discussion, where problems and noteworthy details are discussed. Any features left unimplemented and the reason why are discussed as well. This section can be used to create a better plan for the next sprint and to gain insight in the difficulty of some features.

5

Progress

Tasks The tasks that should be completed this sprint, but that are not features will be listed in this section. ● UpdateUser should work in accordance with the database schema - To achieve this, the user info that is retrieved from Facebook should be extracted and put into the right columns in the database. ● UpdateUser should work without parameters - The standard authentication parameters are the only parameters necessary for this call, since the Facebook id and token should be retrieved from our database. ● Make project ADEL look more trustworthy - Project ADEL (an application to retrieve a Facebook user‟s data) should be made to look a bit more professional. It should also contain a little information as to what we will do with the user‟s data and it should give feedback when the user presses the button to share his/her information with us. ● Visit Zef to discuss MOBL - We need to know whether we can use MOBL for our project or not.

Features In this section the features that were planned for implementation in this sprint are discussed. These features are briefly explained in the form of user stories. ● F22 Top Simlikes - It is now possible to add, remove and retrieve top simlikes.

Functional Design In this section the features that were planned for this sprint are discussed in more detail in the form of use cases. These use cases can be used to derive test specifications and as reference during implementation. For the Top Interests, the following use cases have been made:

Use case 1 Summary: The user chooses one of his simlikes as a top simlike. Situation: The user is logged on. Step 1: The system shows the user the main screen. At the top of the screen are the user‟s top Simlikes. Step 2: The user presses the Edit button. Step 3: The system unfolds a menu with the user‟s simlikes. Step 4: The user drags the simlike of his choice to one of the five positions. Step 5: The user presses the Save button.

6

Step 6: The system saves the top simlikes on the database and removes the menu. Result: The user‟s simlike is displayed as a top simlike.

Use case 2 Summary: The user removes one of his top simlikes. Situation: The user is logged on. Step 1: The system shows the user the main screen. At the top of the screen are the user‟s top Simlikes. Step 2: The user presses the Edit button. Step 3: The system unfolds a menu with the user‟s simlikes. Step 4: The user drags the top simlike of his choice outside of its frame. Step 5: The system removes the top simlike from the frame. Step 6: The user presses the Save button. Step 7: The system saves the top simlikes on the database and removes the menu. Result: The top simlike‟s frame is empty.

Use case 3 Summary: The user undoes the changes he made to his top simlikes. Situation: The user has made changes as in the first two use cases, but has not saved yet. Step 1: The user presses the Cancel button. Step 2: The system reverts to the old top simlikes and removes the menu. Result: The top simlikes are restored to how they were before the user pressed the Edit button.

Technical Design In this section the technical design of the features is discussed and the way the new features are merged into the existing product.

API The following API calls have been added or changed: ● getTopSimlikes(simid) - This call accepts a simlike user id and returns the top simlikes of the corresponding user if the caller is authorized to view them. The simlikes are returned as a SimlikeCollection. Non-set top simlikes are represented as an empty JSON object. ● setTopSimlikes(ciids) - This call accepts a list of context item ids as input. The length of this list should be equal to the amount of top simlikes a user can have. The top simlikes of the caller will be set to the given ids. Empty top simlikes can be represented by an empty string in ciids. The call returns the simlikes that were just set as a SimlikeCollection. Non-set top simlikes are represented as an empty JSON object. ● UpdateUser() - This call retrieves the Facebook data of the caller and updates the data in our database. A BasicUser (without matches) and SimlikeCollection are returned, containing the information of the user.

7

Database The top1...top columns in the User column family contain references (super column names) to the actual simlikes in the UserDataExternal column family. When the top1 column exists in the database, it should point to the 1st top simlike of the user. The top columns are not required. If top2 does not exist, the API assumes the user hasn‟t defined his/her 2nd top similke yet. When calling getTopSimlikes these non-set top simlikes will be represented by empty JSON objects in the SimlikeCollection. An example of the result of a call to getTopSimlikes when top simlike 2 is not set and the max number of top simlikes is 3, is as follows: {“simlikes” : [{“ciid”:“id”, “name”:”name”, “photo”:”url”},{},{“ciid”:“id”, “name”:”name”, “photo”:”url”}] }. This means it is possible to set the top1 and top3 columns, but not create a top2 column.

8

Quality assurance

In this section it is motivated and explained which strategy was used to produce high quality. First the testing plan of the features is discussed, then the overall development strategy for this sprint is discussed.

Testing plan The testing plan of the features of this sprint includes all test specifications and reasons for adding them. A testing plan can be used to gain insight in the test suite, to provide some level of confidence in the correctness of the implementation.

GetTopSimlikes The following cases are tested: ● Supply no parameter - A MandatoryKeyMissingException should be thrown, causing a standard error to be written to the http response. ● Supply a wrongly named parameter - A MandatoryKeyMissingException should be thrown, causing a standard error to be written to the http response. ● Supply a parameter with multiple values - A MultipleValuesForParameterException should be thrown, causing a standard error to be written to the http response. ● Test a valid request when the caller is not authorized - An AuthorizationException should be thrown, causing a standard error to be written to the http response. ● Test a valid case - The top simlikes of the user should be returned as a SimlikeCollection.

SetTopSimlikes ● Supply no parameter - A MandatoryKeyMissingException should be thrown, causing a standard error to be written to the http response. ● Supply a wrongly named parameter - A MandatoryKeyMissingException should be thrown, causing a standard error to be written to the http response. ● Supply a parameter with less ciids than the number of top simlikes - An IndexOutOfBoundsException should be thrown, causing a standard error to be written to the http response. ● Supply a parameter with more ciids than the number of top simlikes - The extra ciids will simply be ignored, the top simlikes of the caller should be set to the first ciids. ● Test a valid case - The top simlikes of the caller should now be the given ciids.

Development strategy

9

The development strategy includes all design decisions (e.g. programming language, design patterns, tools) relevant for this sprint. This can be used to gain insight in the development process and as a reference of design decisions. Any global design decisions as stated in [002] are omitted. jQuery will be used to apply „effects‟ to the GUI of the Facebook application. Possible examples are sliding windows and drag and drop effects.

After meeting with Zef, we have decided to use MOBL for our mobile application. We believe it has all the features we are looking for to create our application.

10

Retrospective

In this section we will review progress since the last report. This enables us to better learn from our past experiences. Also, by documenting these efforts, we hope to gain feedback from our client, as well as our TU Delft supervisor, which we can then apply to accomplish work of higher quality. Last, but not least, these lessons may be of value to the team which will pick up this project after this internship assignment is over.

Monday was a holiday and because of exams we missed 5 extra workdays of the 16 (4 workdays per team member). Because of the limited man-hours we had this sprint, we did not plan much tasks. To make sure we will have a working product by the end of the project, we decided to complete all the work we had started earlier. This means that ADEL is now completely finished and ready to be put online. Also all API call servlets do what they have to do and all database communication mocks are replaced by real code.

Discussion

We could not send the code to SIG because no agreement could be reached on the contents of the Non-Disclosure Agreement (NDA) between Nerval limited and SIG. Therefore we will check our code with the two tools that SIG recommended besides the ones we are already using. If they provide new insights or prove to be more useful than the ones we are using now, our development strategy can be changed to use those tools.

Apart from the SIG all tasks that were planned for this sprint were finished. However, we are closing in on the deadline of this project. Therefore, from this sprint on, all features should be completed before new ones are started and we will work in teams of two to make sure we get all must-haves finished before the project is over. In the following sprints, the matching, chat, mobile application and like-clustering should be completed in that order of importance.

References

[001] Schwaber, Sutherland, “The Scrum Guide” [002] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak”

11

Progress Report

Sprint #9

June 27, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Preface

This is the progress report of the ninth sprint of the development of the Simlike platform. It is a part of a series of reports that each describe the development process of a single SCRUM sprint [001]. More details about the Simlike platform and the authors can be found in [002]. This report is meant as a reference for the authors and can be used to gain insight in, and analyse, the development process of the Simlike platform.

2

Table of contents

Summary ...... 4 Introduction ...... 5 Progress ...... 6 Tasks...... 6 Features ...... 6 Functional Design ...... 6 Technical Design ...... 7 Chat ...... 7 Offline messaging ...... 7 Matching algorithm ...... 8 Quality assurance ...... 9 Testing plan ...... 9 Development strategy ...... 9 Integration ...... 10 Retrospective ...... 11 Discussion ...... 12 References ...... 13

3

Summary

A new development server was installed in the data centre DECLARED CLASSIFIED BY NERVAL LIMITED

Two algorithms were implemented: automatic matching and search matching. The latter was not planned, but finished as a by-product of automatic matching. One-to-one matching, which was planned, was not completed.

The chat system is up and running. The chat client GUI was not finished due to sickness of a team member.

DECLARED CLASSIFIED BY NERVAL LIMITED

This week the team was divided into smaller groups to work on more features simultaneously, this proved to be very effective. Nonetheless, not all tasks were finished this week.

4

Introduction

This document describes the design and design choices made in the ninth sprint of the development of the Simlike platform. In each sprint a selection of features of the product will be implemented. Often some other tasks will have to be executed before a feature can be implemented. These tasks are listed in section Tasks. The features are listed in section Features and further explained in section Functional Design. These sections can be used as reference during implementation and can later be used as documentation on how the features should be used.

Next the technical design of the features is discussed in section Technical Design. This can be used as a reference during implementation and can be used by future developers to gain insight in the architecture of the system.

In the section Quality Assurance the measures that were taken to provide quality results are discussed. This is mainly used to reflect upon the development process.

Documentation on how to integrate the new features into the product environment is supplied in section Integration.

Section Retrospective discusses the progress that was made during this sprint and reflects on the process. It can be used to learn from mistakes that were made and to improve the quality process.

This report ends with a discussion, where problems and noteworthy details are discussed. Any features left unimplemented and the reason why are discussed as well. This section can be used to create a better plan for the next sprint and to gain insight in the difficulty of some features.

5

Progress

Tasks The tasks that should be completed this sprint, but that are not features will be listed in this section. ● Set up MOBL environment - The development environment should be set up (using MOBL) so we can start on the mobile application next sprint. ● Orientation research on chat protocols - A protocol (and possibly an existing server and/or client) for the chat feature has to be chosen. ● DECLARED CLASSIFIED BY NERVAL LIMITED

● DECLARED CLASSIFIED BY NERVAL LIMITED

● DECLARED CLASSIFIED BY NERVAL LIMITED

● Set up new development server - A new more powerful development server will be installed in the datacentre.

Features In this section the features that were planned for implementation in this sprint are discussed. These features are briefly explained in the form of user stories. ● Chat - A user should be able to start a private chat to another user. The file sharing and multiple people chats are not implemented this week. ● Offline messaging - A user should be able to send a user a message when he or she is offline. ● One-To-One Matching - A user should be able to view how many simlikes he/she and another user have in common. ● Automatic Matching - A user should be able to request a list of other users that match well with him/her.

Functional Design In this section the features that were planned for this sprint are discussed in more detail in the form of use cases. These use cases can be used to derive test specifications and as reference during implementation.

Use case 1.1 Summary: The user chooses one of the possible matches. Situation: The user is logged on.

6

Step 1: The system calculates the matches and shows the user the main screen. In the right section of the main panel, there is a list of the best matches. Step 2: The user can scroll through this list. He selects a match that appears interesting. Result: The user can view his choice and click one of the buttons for some extra actions regarding his choice.

Use case 1.2 Summary: The user views some additional information about one of the possible matches. Situation: The user has chosen a match. Step 1: The user presses the match‟s name. Step 2: The system shows the match‟s profile.

Use Case 1.3 Summary: The user sends a message to one of the possible matches. Situation: The user has chosen a match. Step 1: The user presses the "message" button. Step 2: The system opens the screen for making a new post. Step 3: The user types his message and presses "Send". Step 4: The system sends the message and returns to the main menu, displaying the match.

Use Case 1.4 Summary: The user opens a chat with one of the potential matches. Precondition: The match in question is online. Situation: The user has chosen a match. Step 1: The user presses the "Chat" button. Step 2: The system switches to the Chat panel and opens a chat window with that person.

Technical Design In this section the technical design of the features is discussed and the way the new features are merged into the existing product.

Chat The chat consists of a chat server and a chat client. The client is located on the Facebook application, the server is a separate running eJabberd. To register new users on the chat server a simple script, accessible over https, is used. This script will be made available by nginx, because it is a very light server and we don‟t need any advanced extra features.

The API will return the user‟s password on the chat server and the hostname of the chat server. The username on the chat server will be the user‟s simid followed by „@‟ and the hostname. These (hostname, password) combinations will be stored in the UserDataExternal table in the Cassandra database. For security reasons the passwords will expire every day.

Offline messaging

7

Offline messages will be simple strings in the Cassandra database. They will be stored in the UserMessageInbox and UserMessageSent column families, at both the sender‟s and the receiver‟s end.

Matching algorithm DECLARED CLASSIFIED BY NERVAL LIMITED

8

Quality assurance

In this section it is motivated and explained which strategy was used to produce high quality. First the testing plan of the features is discussed, and then the overall development strategy for this sprint is discussed.

Testing plan The testing plan of the features of this sprint includes all test specifications and reasons for adding them. A testing plan can be used to gain insight in the test suite, to provide some level of confidence in the correctness of the implementation.

JabberDataController This class is responsible for storing and retrieving the Jabber data. ● Request non-existent data of an existing user - An empty Map or a null value will be returned by the methods that don‟t return a Map. If the getOrCreateJabberData method is used the user should now be a registered Jabber user. ● Request data of a non-existent user - An empty Map or null value will be returned. Don‟t test this for getOrCreateJabberData, because it will result in „ghost‟ data in the database.

The chat server - client integration will be tested manually, just like the correctness of the matching results.

Development strategy The development strategy includes all design decisions (e.g. programming language, design patterns, tools) relevant for this sprint. This can be used to gain insight in the development process and as a reference of design decisions. Any global design decisions as stated in [002] are omitted.

We will use the Jabber/XMPP protocol for the chat, using eJabberd as a chat server, since it is easy and free to cluster it. Ngingx will be used to host a script to register/update users on the eJabberd server.

9

Integration

In this section the integration of the new features into the existing product is discussed.

The chat server should be hosted on a server. At least one of the chat servers of the cluster will need to have nginx with the user registration script, so the API can register/update Jabber users. EJabberd should be configured so it‟s distributed database (Mnesia) will automatically replicate the table with user information throughout the cluster. The API‟s class JabberUserAdministration should be updated to contain a reference to the server where the script is located.

DECLARED CLASSIFIED BY NERVAL LIMITED

10

Retrospective

In this section we will review progress since the last report. This enables us to better learn from our past experiences. Also, by documenting these efforts, we hope to gain feedback from our client, as well as our TU Delft supervisor, which we can then apply to accomplish work of higher quality. Last, but not least, these lessons may be of value to the team which will pick up this project after this internship assignment is over.

This week we split up in two smaller groups so we could work on more tasks simultaneously. This proved to be very effective. This was more effective because: ● Individual team members were required to make more decisions by themselves, this resulted in faster decision making. Until now, usually most decisions were discussed first. ● There was less communication overhead overall. The quality of the decisions made, did not decrease at all. We will continue making decisions this way. All decisions will be evaluated while writing the progress report.

DECLARED CLASSIFIED BY NERVAL LIMITED

11

Discussion

The offline messaging was not implemented, because we planned to do it via ejabberd. After some more research it turned out that doing it via Cassandra will be easier and more efficient. It did not get completed, because it had lower priority than the chat.

DECLARED CLASSIFIED BY NERVAL LIMITED

The chat server and the user registration script is all set up, however, the client side is not yet finished because one of our team members got sick and was not able to work on Friday. The server side part of the chat server is finished.

DECLARED CLASSIFIED BY NERVAL LIMITED

DECLARED CLASSIFIED BY NERVAL LIMITED

This week a new development server arrived and was installed in the datacentre. This gives us more flexibility as to where we can work. Before, sometimes a team member had to work in Naaldwijk because that is where the previous development server was located. This was not an issue, but since we split into smaller groups, this is very convenient (we could work physically independently from each other if needed).

The specifications of the development server are:

12 GB memory 2x 1 TB disk space configured in RAID 1 Intel Core i7 950 Processor 100 mbit/s connection

DECLARED CLASSIFIED BY NERVAL LIMITED

According to the planning, we finished 80% of the tasks this week. Some time was lost because one team member was sick on Friday. Also, the amount of work planned was a little too much. Nonetheless, the amount of work that did finish was satisfying in our opinion.

12

References

[001] Schwaber, Sutherland, “The Scrum Guide” [002] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak” [003] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Sprint Report #3”

13

Progress Report

Sprint #10

July 5, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

1

Preface

This is the progress report of the tenth sprint of the development of the Simlike platform. It is a part of a series of reports that each describe the development process of a single SCRUM sprint [001]. More details about the Simlike platform and the authors can be found in [002]. This report is meant as a reference for the authors and can be used to gain insight in, and analyse, the development process of the Simlike platform.

2

Table of contents

Summary ...... 4 Introduction ...... 5 Progress ...... 6 Tasks...... 6 Features ...... 6 Technical Design ...... 6 Messaging ...... 7 JavaScript SDK ...... 7 Quality assurance ...... 8 Testing plan ...... 8 Retrospective ...... 9 Discussion ...... 10 Features ...... 10 References ...... 11

3

Summary

This sprint‟s progress report is more concise than the previous progress reports. With the approaching deadline, we decided to spent more time on finishing features and less time on reports. The progress will have to be documented eventually. We planned the entire last week for this. In the meantime, the technical design will be kept up to date in the draft of the Technical Design Document (TDD). This saves time because we do not have to merge this week‟s progress report in the TDD.

We added a JavaScript SDK to integrate features faster on multiple platforms and to increase maintainability.

A lot of the must have features are implemented in the API and in the JavaScript SDK. However they are not all yet fully integrated in the GUI.

Currently we are working overtime to integrate all the must have features in the GUI.

4

Introduction

This document describes the design and design choices made in the tenth sprint of the development of the Simlike platform. In each sprint a selection of features of the product will be implemented. Often some other tasks will have to be executed before a feature can be implemented. These tasks are listed in section Tasks. The features are listed in section Features. These sections can be used as reference during implementation and can later be used as documentation on how the features should be used.

Next the technical design of the features is discussed in section Technical Design. This can be used as a reference during implementation and can be used by future developers to gain insight in the architecture of the system.

In the section Quality Assurance the measures that were taken to provide quality results are discussed. This is mainly used to reflect upon the development process.

Section Retrospective discusses the progress that was made during this sprint and reflects on the process. It can be used to learn from mistakes that were made and to improve the quality process.

This report ends with a discussion, where problems and noteworthy details are discussed. Any features left unimplemented and the reason why are discussed as well. This section can be used to create a better plan for the next sprint and to gain insight in the difficulty of some features.

5

Progress

Tasks The tasks that should be completed this sprint, but that are not features will be listed in this section. ● DECLARED CLASSIFIED BY NERVAL LIMITED ● Prepare demonstration for Peter - Prepare a working demonstration for Peter. ● Make a room reservation for the presentation. ● Sprint report. ● Install Sonar - as part of the alternative SIG assignment. ● Host API - Host the API on a server in the datacentre. ● Host GUI - Host the GUI on a server in the datacentre.

Features In this section the features that were planned for implementation in this sprint are discussed. ● Offline messaging - Users should be able to send messages to each other. Kind of like email. ● Chat one on one - Users should have a nice GUI for the one on one chat. ● Chat photo sharing - Users should be able to share photo‟s over the chat. ● Automatic matching - Automatically match users and fully integrate it in the web interface. ● Search Matching - Users should be able to search for users based on their simlikes. Fully integrate it in web interface. ● One-on-one Matching- Display common interests (a.k.a. simlikes) and integrate in web interface. ● Mobile App - Fully functional mobile application (styling will be finished later).

Technical Design In this section the technical design of the features is discussed and the way the new features are merged into the existing product.

The progress report if this sprint will have less focus on the technical design part. A lot of features must be finished, so to save time the technical documentation will be written directly in the Technical Design Document next week.

6

Messaging Messages will be stored in the database by the MessageController. This controller will be able to retrieve the most recent messages and store new messages. The Cassandra data structure we use allows the creation of maps so users can bring order in their messages. Every message can belong to at most one map. It also allows room for adding „homopost‟ functionality (posting responses to messages underneath each other). During this bachelor project the homopost and extra maps (other than inbox and sent mails) will not be implemented, also no limit will be set on the amount of messages users can have. These are recommended features for future development.

MessageCenter ● sendMessage(subject, content, receiver) - This call sends a message with the given subject and content from the caller to the receiver. ● readMessages(map, [start-id], number) - Reads the most recent messages from the given map of the user. The maximum number of messages that should be read is supplied as parameter. Optionally it can start reading from the given message id. This means that the most recent messages starting from the arrival time of the given message will be returned. ● removeMessage(map, message) - Removes the given message from the user‟s map. If no maps contain the message, it will be deleted.

JavaScript SDK We added a single JavaScript SDK, which can be reused by the Facebook application and the mobile application. For future development, it can also be used for the web platform. This speeds up development, since implementing a feature can be done for two platforms simultaneously. Currently implemented functionality: ● API Authentication - authenticate to Simlike and access credentials. ● Facebook integration - Includes logging into Simlike with your Facebook account. Also manages Facebook permissions (e.g. allow access to Facebook photo albums). ● Chat - an abstraction layer for the Simlike chat, exposed by simple methods like connect(), disconnect(), sendMessage(), sendPhoto() etc. ● Messaging system - abstraction layer for offline messaging functionality. Provides support for sending offline messages to users, including a subject. Also has support for message folders.

7

Quality assurance

In this section it is motivated and explained which strategy was used to produce high quality. First the testing plan of the features is discussed, then the overall development strategy for this sprint is discussed.

Testing plan The testing plan of the features of this sprint includes all test specifications and reasons for adding them. A testing plan can be used to gain insight in the test suite, to provide some level of confidence in the correctness of the implementation.

MessageController ● Send a new message to an non-existent user - An IllegalArgumentException will be thrown, causing a standard Json error to be returned to the client. ● Send a new message to an existing user - The message will be stored in the „sent‟ box of the sender and the inbox of the receiver. ● Send a new message to yourself - The message will be stored in the inbox and the send box of the caller. ● Remove multiple messages - The messages will be removed from the given map. ● Remove a message from all maps - The message will be completely removed. ● Remove non-existent messages from both a valid and invalid map - Nothing will happen, since the non-existent message will be deleted. ● Read messages from a non-existent map - An empty object will be returned.

8

Retrospective

In this section we will review progress since the last report. This enables us to better learn from our past experiences. Also, by documenting these efforts, we hope to gain feedback from our client, as well as our TU Delft supervisor, which we can then apply to accomplish work of higher quality. Last, but not least, these lessons may be of value to the team which will pick up this project after this internship assignment is over.

We applied the process of „finishing what you start‟ as we decided in previous sprints. This means that we could finally finish some real features and create results that can be demonstrated. The features are being implemented in a steady pace. We will keep up this way of working.

This sprint‟s progress report is more concise than the previous progress reports. With the approaching deadline, we decided to spend more time on finishing features and less time on reports. The progress will have to be documented eventually. We planned the entire last week for this. In the meantime, the technical design will be kept update to date in the draft of the Technical Design Document (TDD). This saves time because we do not have to merge this week‟s progress report in the TDD.

We worked the entire weekend to finish as many features as possible. We communicated via Skype. This was less efficient than working together in the same room, so we got less done than we wanted. Especially the mobile app suffered from this. We will be working overtime next week to finish all the must have features.

9

Discussion

Almost all the features and tasks that were planned were finished. Only the integration of the parts remains. Additionally, we made a JavaScript JDK to reuse code between the mobile app and the Facebook application.

● DECLARED CLASSIFIED BY NERVAL LIMITED ● Prepare demonstration for Peter - This task depended on the features that should be finished this week. Not all features were finished; hence this task is not complete. ● Make a room reservation for the presentation - We made a reservation for room C at EWI on July 15th 10:00. A little earlier than the presentation starts so that we have time to setup. ● Sprint report. Finished. ● Install Sonar - Sonar is up and running. ● Host API - The API is now hosted on the development server in the datacentre. ● Host GUI - The API is now hosted on the development server in the datacentre.

Features In this section the features that were planned for implementation in this sprint are discussed. ● Offline messaging - API and JavaScript SDK are fully implemented. Integration in the GUI is not finished yet. ● Chat one on one - Chat is working, we moved it into the JavaScript SDK so the mobile app is able to reuse the code (this was less work than writing it explicitly for the mobile app). GUI integration is half done. ● Chat photo sharing - Finished, needs more GUI styling. ● Automatic matching - Finished. ● Search Matching - Finished. ● One-on-one Matching- implemented in API, must still be integrated in GUI. ● Mobile App - We spent more time than anticipated on this, due to the learning curve for Mobl. 60% done.

10

References

[001] Schwaber, Sutherland, “The Scrum Guide” [002] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak”

11

Definitions This document contains a list of definitions used in the documents of the development process of the Simlike platform. Each definition is briefly explained and, when applicable, their acronym is listed. Some descriptions have been borrowed from Wikipedia (www.wikipedia.org) or their product websites.

● AEB - See AWS Elastic Beanstalk. ● Amazon Web Services (AWS) - Amazon’s collection of web services, including database platforms and hosting clusters of servers. See also: Auto Scaling, Easy Cloud Compute, Elastic Load Balancing, Spot Instances. ● Auto Scaling - A feature of EC2, allowing the user to automatically scale the amount of servers up or down according to user-defined conditions. See also: Amazon Web Services, Easy Cloud Compute. ● Android - Google’s open source operating system for smart phones. ● Apache Cassandra - An open source database platform from Apache. ● Apache Hadoop - A collection of open-source software for reliable, scalable, distributed computing. See also: Apache HBase, Hadoop Distributed File System. ● Apache HBase - An open source database platform from Apache. ● Apache Tomcat - Implements the Java Servlet and the JavaServer Pages (JSP) specifications from Sun Microsystems, and provides a "pure Java" HTTP web server environment for Java code to run. See also: Hypertext Transfer Protocol, Java Server Pages, Java Servlet, Jetty. ● API - See Application Program Interface. ● App - Short for application. ● Application Programming Interface - A particular set of rules and specifications that software programs can follow to communicate with each other. It serves as an interface between different software programs and facilitates their interaction, similar to the way the user interface facilitates interaction between humans and computers. ● AWS - See Amazon Web Services. ● AWS Elastic Beanstalk (AEB) - Amazon’s web application hosting platform. See also: Amazon Web Services. ● CDN - See Content Delivery Network. ● CGI - See Common Gateway Interface. ● Common Gateway Interface (CGI) - A standard defining the delegation of web-page generation from a server to a stand-alone application. ● Content Delivery Network (CDN) - a system of computers containing copies of data placed at various nodes of a network. A CDN can improve access to the data it caches by increasing access bandwidth and redundancy and reducing access latency. ● Context Item - A context item is a piece of information that can be used to identify the context of a user. A Facebook user has likes, activities, places and interests that can be clustered. These clusters are the context items identifying the user’s context. ● Data Base Management System (DBMS) - Collection of software to control the use and maintenance of databases. See also: DDBMS, RDBMS. ● DBMS - See Data Base Management System. ● DDBMS - See Distributed Data Base Management System. ● Distributed Data Base Management System (DDBMS) - Collection of software that permits the management of a distributed database and makes the distribution transparent to the users. See also: Data Base Management System. ● Easy Cloud Compute (EC2) - Amazon’s platform for hosting virtual private servers. See also: Amazon Web Services, Auto Scaling, Elastic Load Balancing, Virtual Private Server. ● EC2 - See Easy Cloud Compute. ● Elastic Load Balancing - A feature of EC2, allowing the load on the user’s EC2 servers to be automatically divided amongst the machines. See also: Amazon Web Services, Easy Cloud Compute. ● Extensible Messaging and Presence Protocol (XMPP) - The Extensible Messaging and Presence Protocol (XMPP) is an open technology for real-time communication, which powers a wide range of applications including instant messaging, presence, multi- party chat, voice and video calls, collaboration, lightweight middleware, content syndication, and generalized routing of XML data. See also: Jabber. ● Facebook - A popular online social network. ● Facebook Graph API - an API which discloses the information contained within Facebook. See also: Facebook, Application Programming Interface. ● Facebook Query Language (FQL) - A SQL-style interface to query the data exposed by Facebook’s Graph API. It provides for some advanced features not available in the Graph API, including batching multiple queries into a single call. See also: Facebook Graph API. ● FDD - See Functional Design Document. ● FQL - See Facebook Query Language. ● Functional Design Document (FDD) - A document describing the functionality and rationale behind the functionality of a software system. ● GAE - See Google App Engine. ● Google App Engine (GAE) - Google’s web application hosting platform. ● Graphical User Interface (GUI) - The GUI is the graphical interface the user uses to interact with the application. ● GUI - See Graphical User Interface. ● Hadoop Distributed File System (HDFS) - Creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster. See also: Apache Hadoop. ● HDFS - See Hadoop Distributed File System. ● HTTP - See Hypertext Transfer Protocol. ● Hypertext Transfer Protocol (HTTP) - A request-response protocol in the client-server computing model. In HTTP, a web browser, for example, acts as a client, while an application running on a computer hosting a web site functions as a server. ● I/O - Short for the much used expression Input/Output. ● iOS - Apple’s operating system for their iPhone and iPad. ● Jabber - Jabber is a chat server that implements Extensible Messaging and Presence Protocol (XMPP). See also: Extensible Messaging and Presence Protocol. ● JAR - See Java Archive. ● JavaScript - A scripting language, mostly used for client side computing on the internet. ● JavaScript Object Notation (JSON) - A lightweight text-based open standard designed for human-readable data interchange. Despite the name it is platform independent. ● Java Archive (JAR) - A JAR file allows Java runtimes to efficiently deploy a set of classes and their associated resources. The elements in a JAR file can be compressed. ● Java Server Pages (JSP) - A way of generating dynamic web content by referencing Java-code in static content. ● Java Servlet - A Java class that is used to extend a server’s capabilities. ● Jetty - Jetty is a pure Java-based HTTP server and servlet container (Application server) developed as a free and open source project as part of the Eclipse Foundation. See also: Apache Tomcat. ● JS - See JavaScript. ● JSON - See JavaScript Object Notation. ● JSP - See Java Server Pages. ● JSSDK - Acronym for JavaScript Software Development Kit. See also: JavaScript, Software Development Kit. ● lighttpd - Open source web-server software. ● MVC - See Model-View-Controller. ● Model-View-Controller (MVC) - A software architectural pattern that isolates "domain logic" from the user interface, permitting independent development, testing and maintenance of each. ● nginx - Open source web-server software. ● P2P - See Peer-to-peer. ● Peer-to-peer (P2P) - A distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the application ● PHP - See Pre Hypertext Processor. ● Pre Hypertext Processor (PHP) - A server-side scripting language, mostly used for web applications. ● Python - A scripting language. ● RDBMS - See Relational Data Base Management System. ● Relational Data Base Management System (RDBMS) - A DBMS in which data is stored in the form of tables and the relationship among the data is also stored in the form of tables. See also: Data Base Management System. ● SCRUM - SCRUM is an iterative, incremental framework for project management often seen in agile software development, a type of software engineering. ● SDK - See Software Development Kit. ● Simid - Abbreviation for Simlike user ID. ● simlike - A simlike (with a small s) is an item of a user that represents something he or she likes (e.g. a sport, band or brand). ● Simlike - Simlike is a social platform for interaction based on common ground. ● Simtoken - Abbreviation for a Simlike authentication token. ● Single Point Of Failure (SPOF) - A critical point in the system that fails the entire system upon failure. ● Software Development Kit (SDK) - A set of development tools that allows for the creation of applications for a certain software package, or in this context, for the Simlike platform. ● SPOF - See Single Point Of Failure. ● Spot Instances - An Amazon EC2 service, which allows the user to bid for computing time and power. See also: Amazon Web Services, Easy Cloud Compute. ● TDD - See Technical Design Document. ● Technical Design Document (TDD) - A document describing the technical decisions and foundations of a software system. ● Tomcat - See Apache Tomcat. ● Titanium Appcelerator - Software to compile JavaScript applications to applications for iOS and Android. ● UML - See Unified Modelling Language. ● Unified Modelling Language (UML) - A set of graphic notation techniques to create visual models of object-oriented software-intensive systems. ● Virtual Private Server (VPS) - A server, giving the user the illusion he/she has access to the entire server, when, in fact, multiple users might be using the same machine. ● VPS - See Virtual Private Server. ● WAR - See Web application Archive. ● Web application Archive (WAR) - A WAR file is a Java Archive (JAR) file used to distribute a collection of JavaServer Pages, servlets, Java classes, XML files, tag libraries and static Web pages (HTML and related files) that together constitute a Web application. See also: Java Archive, Java Server Pages, Java Servlet. ● XMPP - See Extensible Messaging and Presence Protocol. Security orientation & design guidelines

4 May, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Preface

This document is intended to outline the necessities that must be implemented in order to guarantee some acceptable measure of security, and to provide some concrete suggestions to reach that goal. How exactly this will be achieved is left to another study. As a guideline, Google’s “What every programmer needs to know about web security”[1] is used, because Google knows like no other what can and should be done to design a system that is secure enough for productive real-world use.

2

Table of Contents What is security? ...... 4 How can a workable amount of security be achieved? ...... 5 Design suggestions ...... 9 Other suggestions ...... 10 References ...... 11

3

What is security?

Strictly speaking, all systems in existence today are insecure, so the question left to be asked is “how insecure are they, exactly?”. However, in practice a system is said to be secure if the cost for the attacker to break the system exceeds the reward he/she would gain by breaking it. In other words, security is a game of economics and risk management; it is not primarily a problem of a technical nature. The tools to combat system insecurity however, have both technical and social components.

4

How can a workable amount of security be achieved?

To properly defend against any potential attacks it will be necessary to assess which kinds of attacks are the most likely to be aimed at Simlike, both shortly after launch and long-term.

There are 3 central aspects to securing any given project, and all 3 are imperative if the project is truly to be secure: 1. Physical security. This is about limiting physical access to any and all relevant assets. Here there are 2 aspects within the project that need to be secured on the hardware level: a. The collaboration software, the bug tracker and the project code itself. All these are hosted on virtual servers: Google Apps is used by the team for collaboration, the mantisBT bug tracker is hosted on a virtual private server (VPS) and the project’s release-ready code is hosted on Amazon’s cloud in compiled form while development code in source form is hosted on project members’ private computers, so physical access is already severely restricted. b. Any and all relevant documents: These are hosted within Google Docs which is a part of Google Apps, therefore access to these is restricted as well.

2. Technological security. This consists of 3 parts: a. Application level security: This is about making sure that the Simlike application code itself is secure by designing in safeguards while designing the application itself, not as an afterthought. The reasoning for this is that it is very difficult to add-on security later, when the product is already in production use and should not experience functional breakage. This can be done, for instance, by making sure that there are no flaws in the identity verification process, that the application server is properly configured, and that any input data is interpreted robustly. b. OS level security: Given the non-triviality of the programs contained within the collection of software known as an Operating System (e.g. Windows 7, Debian Linux), this is more about managing potential exploits in the underlying OS than anything else. This may be achieved by installing verified security updates as they become available, and by hardening the OS – that is, removing any and all unnecessary parts of the OS to reduce the number of potential exploits. For example, on a Linux-based server often no GUI packages have to be installed since administration of such a server can be done from the command-line interface (CLI). c. Network level security: This can be done by taking steps with regards to the network stack level within the OS, in the software running on any deployed specialized network hardware such as routers and switches and by using tools such as firewalls and intrusion detection systems, to mitigate malicious traffic.

5

3. Policies and procedures. This is about protecting and strengthening something that is often the weakest link in security: people. For instance, a successful social engineering attack – with which an attacker can take advantage of unsuspecting employees by making them divulge information of a sensitive nature – can in some instances reduce or even eliminate the need for technical attacks. In the case of this project, social engineering in all likelihood won’t be an issue until the Simlike website goes online. However, once it does, any and all Nerval Ltd. employees, even the non-technical ones, will need to be aware of what social engineering is and they will need to be educated on how to detect such an attack. In addition, they will need to be trained to be somewhat paranoid and vigilant. For example, if there is an office building with a restricted “for employees only”-section, a policy could be created and enforced which explicitly states that tailgating is not allowed.

According to Google, there are 7 key security concepts[2]: 1. Authentication: The central question here is “How can person A be sure he is communicating with person B and not with some attacker C?”. There are 3 general ways to provide authentication, which can be combined to provide a stronger form of authentication: a. Something you know (e.g. a password). This system is generally simple to implement and is simple for users to understand, but generally passwords become less user friendly when they are required to be stronger, and are hard for users to remember. The result is often that passwords are used in multiple places, thereby decreasing security. In addition, a users will also write down the passwords that are more difficult to remember, thereby making it easy for anyone with physical access to the user’s machine to gain entry into the system. b. Something you have (e.g. an ATM card). So-called one-time password (OTP) cards generate new passwords upon every new user login. There are also Smart Cards, which are fairly temper-resistant and are intended to be inserted into a card-reader. Here the strength of the authentication depends on the difficulty of forging such a card. It is also easy to use, but can be lost, at which point security may be compromised, especially if this is the only authentication method used. c. Something you are (i.e. biometrics). It is easy to use, and helps somewhat with strengthening security, but as of April 2011, the most popular forms of biometrics (fingerprint and iris scans) result in a high rate of false positives (impostor is accepted) and false negatives (authentic user is rejected). Therefore this should not be used on its own; instead, it should only be used to bolster a previously existing authentication method.

6

2. Authorization: This deals with the question “Does user X have permission to perform action Y?”, e.g. “Does Alice have permission to read a critical system configuration file?”. This can be managed with a so-called Access Control List (ACL), which is a list of 3-tuples, where such a 3-tuple specifies which User can access which Resource, and what privileges are available to that User when accessing that Resource. An alternative is to use Roles instead of Users in an ACL, where every User can take some subset of Roles. This would result in a list of 3-tuples. 3. Confidentiality: The goal here is to keep the data on storage or the contents of the communication between parties secret. This can be accomplished by means of a key, which is a secret of some form that is shared between those parties. Confidentiality can be achieved, for example, through cryptography (where the key is used to encrypt and decrypt longer messages), steganography (which is writing hidden messages in such a way that no one, apart from the sender and intended recipient even suspect the existence of the message), access controls and database views. 4. Data integrity: This primarily deals with the prevention and detection of data corruption, which can be either accidental (e.g. detecting a faultily downloaded file) or intentional (e.g. a man in the middle attack, in which an attacker mediates between parties in order to gain information from and/or spread information between authentic parties). Note that this is different from confidentiality: the goal here is to ensure that there is no alteration of data between writing it and reading it back in, whereas confidentiality deals with keeping the data secret from unauthorized parties. To ensure data integrity, several measures can be taken, such as hashing (i.e. using algorithms like SHA-1 or MD5), checksums (e.g. CRC) and Message Authentication Codes (MACs). 5. Accountability: Being able to find out what or who the cause of a security breach is can be of vital importance in controlling the damage done. Accountability can be achieved by logging and audit trails (which is a chronological record of system activities to enable the reconstruction and examination of the sequence of events and/or changes in an event). However, in order to achieve this, secure time stamping, data integrity in the logs and audit trails, immutability of the audit trails and an ability to detect modification of the logs are all requirements. If not all of those are available, an attacker can easily cover his/her tracks. 6. Availability: This is about achieving long uptimes, limiting any downtime of the Simlike system and getting good server response times. To meet these goals redundancy / data and server multiplication can be added in order to remove any single point of failure. In addition, limits can be imposed upon users, e.g. a user might be limited by the system to 3 requests per second. DoS (Denial of Service) attacks are aimed at reducing the system’s availability by sending exorbitant amounts of data to the server, which then becomes overwhelmed and therefore can’t do its job. If there is no explicit design to deal with this the system will be vulnerable to this type of attack.

7

7. Non-repudiation: If a transaction has happened somewhere in the system, its existence must be irrefutable. This can be achieved by generating some sort of evidence (e.g. receipts or digitally signed statements) for every transaction. However, it might not be feasible to do this throughout the entire Simlike product because of financial, data storage or time complexity costs. In that case it may be an option to do it only for the most critical functionalities of the system.

8

Design suggestions

There are a number of design decisions the team could make in future sprints to get a good measure of security into the simlike product: 1. Validate all user input: This can prevent attacks such as SQL injections. 2. Use Access Control: Using a role-based access control model will make it easier to manage users, privileges as well as resources. 3. Least Privilege Principle: Give the various ACL roles no more privileges than they need to do their job. 4. Ensure Availability by designing enough redundancy/duplication into the system to make it very difficult for DoS attacks to succeed 5. Store passwords securely with a salted hash[3] 6. No “turtle shell” architectural design, where there is a system specifically designed to protect another, inherently insecure system. This is because if an attacker does manage to get past the outer system, he/she will have free reign over the vulnerable inner system. 7. Encrypt all communication between the client-side interface and the server, for instance by using HTTPS. 8. Reuse as much well-tested security-related code as possible, in order to let the team avoid having to write it themselves and make mistakes in the implementation. 9. Treat malicious traffic as fact and not as an exceptional condition. This implies explicitly designing a way to detect and handle such traffic. 10. Force users to create strong passwords by rejecting any password that isn’t. However, at the same time stimulate them to make the password easy to remember, for example by suggesting them to use normal but less-often used words separated by spaces or another special character.

11. Design the software as simple as technically possible, since more complex software is likelier to harbor bugs. In addition, simple software is easier to understand and debug. 12. Prevent users from committing insecure actions. Instead help them in doing it in a secure manner.

9

Other suggestions

1. Define concrete, measurable security goals for Simlike, for example: ○ Only certain kinds of users should be able to see any given user’s profile information. 2. No security through obscurity (STO) without proper security backup: in practice STO is hardly a deterrent for a determined attacker. The preference here is to not use STO at all, as it complicates the system. 3. Keep track of usability when designing for security: often security and convenience have an inversely proportional relationship, which means that as a system gets more secure it becomes less convenient for users. If usability suffers too much at the expense of security, users will try to circumvent inconvenient security measures, for example by choosing weak but easy to remember passwords, or they might bluntly write the passwords down. 4. Keeping true to the Least Privilege Principle, no Simlike application server should run as root / administrator. Instead, it should run as a specially-created user and group that have just enough privileges within the OS to get the job done. 5. No defensive programming[4]. While it might increase security by catching an error that wasn’t caught earlier, this can increase the maintenance burden significantly which in turn may detract future Simlike maintainers from other security-related issues. 6. Instead, unambiguously design by contract on the class and integration levels, and anything that does not explicitly fit into the contract should cause an AssertError or something of equivalent semantics as soon as possible. This increases both maintainability and security by simultaneously making it harder for an attacker to get malicious code in, and making it possible for maintainers to quickly find the root of a lot of bugs (Fail Early, Fail Loudly)[5]. 7. Create a containment plan in preparation for breach, and test it. This can increase response time if a breach actually occurs. 8. Create and maintain awareness within the development team that implementing security features does not imply the product is secure, in much the same way that testing non- trivial software does not imply that it is free of bugs. An attacker needs to find only one flaw to successfully penetrate the system, but the team (especially the team that takes over after the project team has finished) will need to check for all possible flaws.

10

References

[1]: http://code.google.com/edu/submissions/daswani/index.html [2]: http://gcu.googlecode.com/files/1.ppt [3]: http://www.aspheute.com/english/20040105.asp [4]: http://en.wikipedia.org/wiki/Defensive_programming [5]: http://oncodingstyle.blogspot.com/2008/10/fail-early-fail-loudly.html

11

Oriëntatieverslag

Bachelor stage in opdracht van Nerval Limited

26 april 2011, Delft

Jeroen Dijkhuizen Joey Ezechiëls Joris Albeda Volker Lanting

Voorwoord Dit oriëntatieverslag heeft als doel om bekend te raken met de vaarwateren waar we in terecht gaan komen tijdens het ontwikkelen van het eindproduct welke centraal staat in onze bachelor stage. Inhoudsopgave

Inleiding ...... 5 Simlike platform ...... 6 Hardware platform ...... 6 Google App Engine (GAE) ...... 6 Amazon Web Services (AWS) ...... 7 Dedicated Server ...... 8 Web platform ...... 9 Apache ...... 9 nginx ...... 9 lighttpd ...... 9 Mobile app platform ...... 10 Titanium Appcelerator ...... 10 Database platform ...... 10 Apache Hadoop ...... 10 Amazon Simple DB ...... 11 Amazon Elastic MapReduce (EMR) ...... 12 Apache Cassandra ...... 12 Google Datastore ...... 13 Algorithms platform...... 13 Pre-processing ...... 13 Searching...... 15 Algoritmen ...... 15 Cluster algoritmen ...... 16 Filter en Aanraad algoritmen ...... 16 Latent Dirichlet Allocation ...... Error! Bookmark not defined. Pattern mining ...... 16 Classifiers ...... 16 Development analyse...... 17 Ontwikkelmethodieken ...... 17 Waterval methode ...... 17 Agile Development ...... 17 Test-driven Development ...... 18 SCRUM ...... 18 Software tools ...... 19 Facebook SDK ...... 19 Integrated Development Environment (Eclipse) ...... 20 Software versioning (Git) ...... 20 Build tools ...... 20 Bugtracker (Mantis) ...... 21 Programmeertalen ...... 22 Java ...... 22 Python ...... 22 PHP ...... 22 HTML5 / Javascript / CSS3 ...... 23 HTML4 / Javascript / CSS ...... 23 Literatuurlijst ...... 24

Inleiding

In dit oriëntatieverslag zal onderzocht worden welke bestaande software, development tools en algoritmen er zijn die we kunnen toepassen binnen het bachelor project. Hierin staat centraal het ontwikkelen van de Simlike applicatie, een nieuw sociaal medium waarop gebruikers vriendschappelijk in contact kunnen komen op basis van gedeelde kenmerken. Het uiteindelijke doel is om een facebook applicatie, een mobile app en een website te ontwikkelen. Deze drie eindproducten maken gebruik van een zoekalgoritme om te matchen en te zoeken op kenmerken van gebruikers. Het algoritme bestaat op zijn beurt weer uit twee delen. Ten eerste een pre-processing deel welke kenmerken uit de data weet te destilleren en hier zoekstructuren in weet aan te brengen. Het andere deel van het algoritme beslaat het zoeken (searching) binnen deze datastructuren. In dit verslag worden nog geen keuzes gemaakt, dit zal pas gebeuren tijdens de ontwerpfase.

Simlike platform

De fundering van het Simlike platform bestaat uit 5 componenten: 1. Hardware platform, bestaat uit de apparatuur waar het platform op gehost wordt. 2. Web platform, dit platform is een interface die het gebruik via een webbrowser mogelijk maakt. 3. Mobile platform, een platform voor mobiel gebruik. 4. Database platform, om alle data in op te slaan. 5. Algorithms platform, om de data te verwerken en te doorzoeken. Er wordt bekeken welke bestaande software en technieken (algoritmen e.d.) eventueel toegepast kunnen worden ter ondersteuning van het Simlike platform.

Hardware platform Er zijn een aantal virtuele en hardware kandidaten waarop het platform gehost kan worden. Aspecten die hier voornamelijk interessant zijn, zijn schaalbaarheid en kosten: het platform moet zo schaalbaar mogelijk zijn tegen zo laag mogelijke kosten.

Google App Engine (GAE) Google heeft een eigen hosting platform voor web applicaties. Dit is wat Google er zelf over zegt:

“App Engine is a complete development stack that uses familiar technologies to build and host web applications. With App Engine you write your application code, test it on your local machine and upload it to Google with a simple click of a button or command line script. Once your application is uploaded to Google we host and scale your application for you. You no longer need to worry about system administration, bringing up new instances of your application, sharding your database or buying machines. We take care of all the maintenance so you can focus on features for your users.” - Google App Engine website[1]

Er zijn ook kritische geluiden over GAE:

“The bottom line is using Google App Engine will be more restrictive than the development and deployment environments that you are used to. The datastore is more difficult to use than a relational database (which for example has global transactions, joins on tables, and a subset of the types of queries that you can do with a relational database). Not being able to start and manage long-running processes also makes some kinds of applications difficult to write.” - developer.com[2]

Een groot nadeel is, wanneer er voor GAE gekozen wordt, dat alleen Google Datastore overblijft voor data storage. Het aantal bruikbare libraries wordt eveneens beperkt[3]. GAE is relatief goedkoop wanneer een applicatie in zijn geheel op GAE gehost wordt, maar het is niet flexibel en het ondersteunt geen achtergrondprocessen. De enige talen die ondersteunt worden zijn Java en Python. Het platform is relatief (t.o.v. concurrenten) ook erg langzaam. Er zijn verder geen zorgen over het schalen van de applicatie, dat gebeurt automatisch (er is wel een initiële set-up nodig).

Pros: ● Prijstechnisch schaalbaar vanaf 0 euro. ● Geen zorgen over schaalbaarheid van hardware. Cons: ● Geen achtergrondprocessen mogelijk. ● Response times relatief langzaam t.o.v. concurrentie. ● Weinig keuze voor libraries. ● Erg beperkend voor de keuze van het onderliggende database systeem.

Amazon Web Services (AWS) AWS biedt meerdere web services welke te combineren zijn tot een hosting platform. De eerste service is Easy Cloud Compute (EC2). Dit is een platform waarop virtual private servers (VPS) gehost worden. Er zijn eigen images te maken welke als VPS gedraaid worden. Dit geeft meer vrijheid dan bijvoorbeeld GAE, want binnen een VPS is er keuze voor een eigen OS en/of software (apache, nginx, php, etc.). Met behulp van Auto Scaling kan het aantal VPS instanties dynamisch meeschalen met de belasting van de applicatie. Met behulp van Elastic Load Balancing kan dit ook over meerdere datacentra over de wereld gespreid worden.

Met zogenaamde “Spot Instances” kan rekenkracht goedkoper ingekocht worden. Dit is bijvoorbeeld ideaal voor achtergrond processen (denk hierbij aan bijvoorbeeld pre-processing en het up-to-date houden van gegevens):

“Spot Instances allow customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and customers whose bids meet or exceed it gain access to the available Spot Instances. If you have flexibility in when your applications can run, Spot Instances can significantly lower your Amazon EC2 costs.” - Amazon.com[4]

AWS is niet de goedkoopste, maar wel goed geprijsd. Het feit dat ze niet de goedkoopste zijn, kan gecompenseerd worden door te bieden op rekentijd. Hiermee kan men significant besparen op het draaien van achtergrondprocessen. AWS is erg flexibel dankzij de variëteit van de services die ze aanbieden, echter vereist dit wel meer kennis en initiële configuratie. De snelheid van het platform heb je deels zelf in de hand door de software keuzes die je maakt en het aantal VPS instances wat je inzet. Wanneer het goed ingesteld is, heb je geen omkijken meer naar het schalen van het platform.

Pros: ● Aantrekkelijk geprijsd, kosten schalen goed mee (vanaf 40-100 euro per maand). ● Ondersteunt achtergrond processen. ● Flexibel dankzij combinatie van systemen. ● Praktisch geen limiet aan schaalbaarheid. Cons: ● Kennis nodig van meerdere systemen. ● Officiële support is te duur, je bent daarom aangewezen op support van een community.

Dedicated Server Een optie die natuurlijk ook beschikbaar is, is een fysieke dedicated server aanschaffen of huren. Nadeel is dat het onderhoud hiervan ook gedaan moet worden, en wanneer je wilt uitbreiden zal je meer servers moeten bestellen en dit zijn langlopende contracten (dat is soms ook mogelijk per maand, maar ook dan is het langlopend in vergelijking met de vorige 2 opties). De schaalbaarheid is slecht, maar in het geval van veel I/O kunnen de kosten voordeliger zijn, bijvoorbeeld wanneer Apache Cassandra (zie verder op in het verslag) als database platform gekozen zou worden[5].

Pros: ● Volledige controle over hardware platform. ● Voor sommige doeleinden de goedkoopste hardware oplossing (onderhoud etc. niet meegerekend). Cons: ● Financieel en technisch moeilijk schaalbaar. ● Veel kennis nodig van zowel hardware en software. ● Extra kostenpost “onderhoud”: in eigen beheer of uitbested Web platform Hieronder valt de eigen website en de facebook app. Ook de mobile app valt hier deels onder, want het web platform wordt aangesproken door de mobile app. Het web platform moet met hoge snelheid vele requests kunnen afhandelen, zowel in piekuren als in daluren.

Apache Apache is de meest gebruikte webserver ter wereld en heeft veel mogelijkheden en libraries.[6] Apache is niet bedoeld als snelste webserver, maar is wel high-performing. De reden hiervoor is de modulaire opbouw. Hierdoor worden concessies gedaan aan de snelheid van de server. Er is geen ondersteuning voor Java Servlets. Ondersteuning voor PHP en Python in de vorm van een module of Common Gateway Interface (CGI).

Pros: ● Modulair uitbreidbaar. ● Veel functionaliteit. Cons: ● Kan veel overhead hebben. ● Kan niet meer dan 10.000 requests tegelijkertijd afhandelen. nginx Van oorsprong is nginx ontwikkeld voor een website welke 500 miljoen requests per dag afhandelde. nginx kan meer dan 10.000 requests tegelijkertijd afhandelen. Er is geen ondersteuning voor Java Servlets. Support voor PHP en Python gaat via een CGI. Erg geschikt in combinatie met PHP[7].

Pros: ● Kan meer dan 10.000 requests tegelijkertijd afhandelen. ● Snelste webserver voor PHP requests. ● Kleine applicatie “footprint”. Cons: ● Geen .htaccess configuratie mogelijk (configuratie vereist herstart)[8]. lighttpd lighttpd is ook bedoeld voor het afhandelen van veel gelijktijdige requests. Er is geen ondersteuning voor Java Servlets. Ondersteuning voor PHP en Python via CGI[9].

Pros: ● Kan meer dan 10.000 requests tegelijkertijd afhandelen. ● Kleine applicatie “footprint”. Cons: ● Heeft een trackrecord met ernstige bugs. ● Geen .htaccess configuratie mogelijk.

Mobile app platform De drie grootste smartphone operating systems zijn Android (van Google), iOS (van Apple) en BlackBerry OS (van RIM). De voorkeur gaat uit naar uiteindelijk een app voor elke van deze operating systems. Omdat het maken van native apps voor de verschillende systemen erg veel tijd in beslag zal nemen, zal dit niet kunnen in dit bachelorproject. De keuze wordt om geen mobile app te maken, of titanium appcelerator te gebruiken.

Titanium Appcelerator Met Titanium Appcelarator kun je apps maken die compatible zijn met iOS en Android. Support voor Blackberry OS staat in de planning. Deze is voorlopig alleen beschikbaar als closed beta[10]. Titanium Appcelerator is met zijn support voor iOS geschikt voor het ontwikkelen van iPhone apps en iPad apps. Het is een SDK en er zit een developer applicatie bij, waarmee je eenvoudig in html en javascript gemaakte apps kunt compileren naar mobile native apps, het grootste voordeel hiervan is dat je maar een enkele codebase hoeft te onderhouden voor alle mobiele OS-en.

Pros: ● Eenvoudige User Interface. ● Maintainable (een enkele app die portable is naar iOS en Android). ● Veel beschikbare API‟s om gebruik van te maken. ● Gratis voor kleine bedrijven[11]. Cons: ● Voor sommige dingen moeten wellicht native API‟s geschreven worden.

Database platform Het platform moet grote hoeveelheden data aankunnen, deze snel kunnen uitlezen en wegschrijven, en dat alles terwijl het goed schaalbaar en fouttolerant is (bijvoorbeeld in het geval van hardware falen).

Apache Hadoop Hadoop slaat informatie op meerdere locaties op. Dit verhoogt de betrouwbaarheid in geval van dataverlies en machineuitval[12]. Functies kunnen parallel over data worden uitgevoerd met behulp van MapReduce. Dit kan een grote tijdsbesparing opleveren in geval van veel data. De data kan op meerdere punten uitgelezen worden en de berekeningen op de dichtsbijzijnde data uitgevoerd. Er is slechts een write per keer mogelijk en dus geen batch jobs.

Hadoop heeft geen support voor transacties, en kan langzaam werken, omdat files niet geïndexeerd worden. Daarmee is het vooral geschikt voor het uitvoeren van berekeningen over grote hoeveelheden data, maar niet voor user interactie. Hadoop gebruikt Java.

Pros: ● MapReduce mogelijkheden (parallelle berekeningen) ● High scalability ● Betrouwbaarheid (in geval van dataverlies of machine uitval) Cons: ● Geen indexering van gegevens (kan traag opzoeken als er veel data is) ● Geen transacties ● Geen batchjobs

Amazon Simple DB Amazon SimpleDB is een niet relationele database, maar Amazon biedt wel een relationeel alternatief (Amazon RDS) wanneer dat gewenst is. Je kunt kiezen tussen consistente en uiteindelijk consistente reads per query [13]. De data zelf moet op Amazon Simple Storage Service opgeslagen worden. Een tutorial hierover is te vinden in [14].

Via Amazon Elastic MapReduce kun je mapreduce gebruiken met simpleDB en S3 [15]. De prijzen zijn als volgt [16]: ● eerste 25 uur gratis, daarna $0.154 per uur ● eerste GB verkeer gratis, daarna (tot 10 TB) $0.10 IN en $0.15 OUT ● eerste GB opslag gratis, daarna $0.275 per GB

Pros: ● Serversetup en DB configuratie worden grotendeels geregeld. ● Goed schaalbaar. ● Je betaalt slechts voor wat je gebruikt. ● MapReduce (parallelle berekeningen mogelijk). ● Betrouwbaar in geval van dataverlies of machine uitval. ● Geindexeerde gegevens, maakt snel zoeken mogelijk. ● Batchjobs tot 25 commando‟s mogelijk. ● Mogelijkheid voor directe consistentie, maar niet nodig. Cons: ● Dwingt je ook Amazon S3 te gebruiken. ● Geen transacties of joins. Amazon Elastic MapReduce (EMR) EMR is een implementatie van Hadoop als software-as-a-service. Amazon zegt het volgende over hun eigen product:

“Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).” - Amazon.com[17]

Pros: ● Je betaalt alleen voor werkelijk verbruikte resources. ● Integreert goed met Amazon EC2. Cons: ● Werkt alleen in combinatie met Hadoop.

Apache Cassandra Cassandra is een distributed database met (key,value) pairs en een mogelijkheid voor subkolommen. Deze zijn geindexeerd, maar worden gedistribueerd opgeslagen. Hierdoor is snel zoeken mogelijk en wordt er toch betrouwbaarheid geboden in geval van dataverlies of machineuitval. Writes worden verzameld en geflushed, waardoor dus tijdelijke inconsistente data ontstaat.

De website van Cassandra [18] beschrijft de limitaties van Cassandra. Elke rij van de database moet op een machine kunnen, met maximaal 2 miljard kolommen per rij; elke kolom mag maximaal 2 GB groot zijn. Cassandra is ongeschikt voor streaming. Wel kan het gebruikt worden in combinatie met applicaties op Amazon EC2[5]. Aangeraden wordt om het te draaien op EC2 L instances met machines met ten minste 4GB RAM[19].

Complexe operaties moeten worden opgesplitst in onafhankelijk uitvoerbare kleinere operaties. Wel zijn er transacties mogelijk, maar met een library genaamd Cages [20].

Pros: ● Goed schaalbaar. ● Betrouwbaar bij dataverlies of machineuitval. ● Snel zoeken mogelijk. ● Transacties mogelijk. Cons: ● Geen joins mogelijk. ● Geen „kant en klare‟ MapReduce mogelijkheid. ● Zelf servers opzetten en Cassandra configureren. ● Geen mogelijkheid voor directe consistentie van de data.

Google Datastore Google Datastore is een vorm van gedistribueerde opslag dat ontwikkeld is en aangeboden wordt door Google. De datastore is “schemaloos”, wat wil zeggen dat de gebruikende applicatie zelf verantwoordelijk is voor het creëren en handhaven van de structuur van de datastore [21]. Dit wordt verder niet afgedwongen. Er zijn 2 varianten: gestructureerd als Master/slave of als High replication [22].

Pros: ● Onderdeel van de Google App Engine. ● Gedistribueerde opslag, daarom redundant en veilig. ● Transaction en query support. ● Schaalbaar. ● Keuze tussen Master/slave structuur en High replication datastore. ● High replication datastore heeft hoge uptimes voor databeschikbaarheid. ● “Schemaloosheid” geeft enorme flexibiliteit. Cons: ● Master/slave structuur kan leiden tot tijdelijke onbeschikbaarheid van de data. ● High replication is ongeveer 3x duurder qua cpu- en opslagkosten dan Master/slave. ● High replication heeft hogere read/write latencies dan Master/slave. ● De “schemaloosheid” van de Datastore dwingt geen relationele integriteit af en legt veel verantwoordelijkheid bij de applicatie zelf.

Algorithms platform Het algorithms platform draagt zorg voor het verwerken van de grote hoeveelheden data en bestaat uit twee delen. Het eerste deel zorgt voor pre-processing van de data. Dat wil zeggen dat er zoekstructuren gecreëerd worden voor de beschikbare data waarin efficiënt gezocht kan worden. Het tweede deel draagt zorg voor het doorzoeken van de gestructureerde data.

Pre-processing DECLARED CLASSIFIED BY NERVAL LIMITED DECLARED CLASSIFIED BY NERVAL LIMITED DECLARED CLASSIFIED BY NERVAL LIMITED

Searching DECLARED CLASSIFIED BY NERVAL LIMITED Algoritmen

Het platform moet uit zichzelf kunnen herkennen welke kenmerken van gebruikers gerelateerd zijn en hoe zwaar bepaalde kenmerken mogen meetellen bij het maken van suggesties van nieuwe vrienden. De volgende soorten algoritmen kunnen eventueel gebruikt worden voor dit doeleinde: cluster algoritmen, filter en aanraad algoritmen, latent dirichlet allocation, pattern mining en classifiers.

Cluster algoritmen Deze algoritmen kunnen punten die bij elkaar liggen clusteren. Punten binnen een cluster worden dan gerelateerd aan elkaar gezien. Je kunt dan zelf de afstandsfunctie bepalen, en daarmee dus criteria waarop dingen met elkaar worden vergeleken[30].

Filter en Aanraad algoritmen Hier kun je aan de hand van veel data voorspellingen doen over kleinere delen van de data. Bijvoorbeeld, veel mensen die van XFactor houden, houden ook van Popstars, dus kun je voorspellen dat een individu dat van XFactor houdt, misschien ook van Popstars houdt[31].

DECLARED CLASSIFIED BY NERVAL LIMITED

Pattern mining Met deze algoritmen kunnen veelvoorkomende patronen in data gevonden worden. Zo kunnen we er achter komen welke gebruikers dezelfde interesses hebben[33].

Classifiers Deze algoritmen kunnen na training bepalen bij welke sub-verzameling van data nieuwe data het beste past. Hiermee kunnen we nieuwe gebruikers in het begin matchen wanneer zij zich net aangemeld hebben, maar nog niet verwerkt zijn in de database[34]. Development analyse

Het bedrijf achter Simlike is een nieuwe start-up en heeft daarom nog geen bestaande ontwikkelmethoden. Deze moeten we ook zelf opstellen, om deze reden krijgt dit aspect ook aandacht in het oriëntatieverslag. We kijken naar tools en technieken die de development kunnen vergemakkelijken en ondersteunen.

Ontwikkelmethodieken Vanuit de universiteit hebben we ervaring met de Waterval methode en Test-driven Development. Zelf hebben we ook belangstelling voor Agile Development en SCRUM.

Waterval methode Een manier van software ontwikkelen waarbij een fase afgerond moet zijn voordat de volgende fase kan beginnen.

Pros: ● Fases kunnen mooi afgebakend worden en hebben geen overlap in de tijdsdimensie. ● Geaccepteerde ontwikkelmethode.

Cons: ● Niet flexibel. ● Als de planning voor een release cycle overschreden wordt kunnen de nadelen verstrekkend zijn in het project. ● Een klant kan zich vaak niet bedenken over gewenste noodzakelijke features, wat kan leiden tot een achterhaald product in een snel evoluerende markt tegen de tijd dat het af is.

Agile Development Een ontwikkelmethodiek waarbij er incrementeel en iteratief features en onderhoud gepland worden door een team dat “self-organizing” te werk gaat, wat wil zeggen dat niet van bovenaf wordt aangegeven wanneer wat moet worden gedaan binnen het team.

Pros: ● Geen overhead van management door self-organizing teams, waardoor ontwikkeling vaak sneller kan verlopen. Cons: ● In tegenstelling tot het watervalmodel waar centrale leiding is, moeten leden van een Agile team zelf initiatief en verantwoordelijkheid tonen. Test-driven Development Hier wordt de sofware vanaf het begin ontworpen om testbaar te zijn. Het idee is dat testen een integraal onderdeel is van het ontwikkelproces en geen nagedachte, wat de kwaliteit van het product ten goede moet komen.

Pros: ● Beter ontworpen software. ● Hogere software kwaliteit. Cons: ● De tests moeten ontworpen worden.

SCRUM Een incrementele en iteratieve Agile ontwikkelmethodiek met vrij korte milestone cycles, “sprints” genaamd, waarbij er een aantal voorgedefinieerde rollen bestaan: ● De “scrum master”: vergelijkbaar met de project manager, maar dan meer gericht op 1 sprint. ● De “product owner”: deze vertegenwoordigt de belanghebbenden en het bedrijf, indien van toepassing. ● Het “team”: de groep die het product daadwerkelijk onwikkelt.

Het idee is dat het team elke sprint een uitrolbare incrementele update van het product ontwikkelt. Welke features er in een bepaalde sprint komen wordt bepaald tijdens een zogenaamde “sprint planning meeting”, die aan het begin van elke sprint plaatsvindt. Uit de requirements wordt een lijst met features afgeleid. Uit deze features worden per sprint een paar features gekozen die geïmplementeerd zullen worden. Aangezien de requirements nog kunnen veranderen, kan deze lijst met features ook nog veranderen, wat het ontwikkelproces flexibel maakt.

Pros: ● Planning overschrijden bij een sprint heeft geen verstrekkende gevolgen als het opgevangen kan worden in volgende sprints. In dit geval zal het probleem geanalyseerd worden en de feature bij een volgende sprint geïmplementeerd worden. ● Minder features om op te focussen per cycle. ● Flexibel; erkent dat klanten zich kunnen bedenken over wat ze willen / nodig hebben. ● Pragmatische aanpak. ● Niemand mag de collectie van features in een sprint veranderen als het eenmaal begonnen is. Cons: ● Overlap in de verschillende ontwikkelstadia, zoals Requirements analysis en Functional Design. ● Geen behoorlijk prototype te tonen terwijl de applicatiefundering wordt ontwikkeld terwijl dat strict genomen een eis is.

Software tools We hebben onderzocht welke tools ons werk zouden kunnen besparen en tools die de kwaliteit van het product kunnen waarborgen. Daarnaast is er ook software welke vereist wordt vanuit de opdrachtgever. Er is op dit moment nog geen keuze gemaakt welke software en/of programmeertaal we gaan gebruiken. Dit gaan we onderzoeken in het plan van aanpak. Er is alleen een overzicht van potentiële kandidaten.

Facebook SDK Facebook heeft haar eigen API. Voor het ontwikkelen van een facebook app (één van de eisen van de opdrachtgever) is de facebook SDK een vereiste. Een voorbeeld van hoe de SDK werkt om gegevens van een profiel op te vragen en te gebruiken is te vinden in [35]. De technische documentatie die bij de API calls m.b.t. facebook gebruikers hoort is te vinden in [36]. Facebook heeft hierbij SDKs voor verschillende talen:

“JavaScript SDK[37] The JavaScript SDK enables you to access all of the features of the Graph API and Dialogs via JavaScript. It provides a rich set of client-side functionality for authentication and rendering the XFBML versions of our Social Plugins. iOS SDK (iPhone & iPad)[38] The iOS SDK provides first-class Facebook Platform support for iPhone, iPad and iPod Touch apps written in Objective-C. You can utilize single-sign-on, call the Graph API and display Platform Dialogs. Android SDK[39] Our Android SDK brings the Facebook Platform to the Android Platform (mobile & devices). You can use this SDK to add single-sign-on to your Android apps, invoke the Graph API and more. PHP SDK[40] This SDK provides Facebook Platform support to your PHP-based web apps. This library helps you add Facebook Login and Graph API support to your Website. Tools[41] We provide a variety of development tools that you can use to develop, test and monitor your app.” - developers.facebook.com[42]

Pros: ● Facebook koppeling hoeven we niet zelf te schrijven. Cons: ● Geen alternatief mogelijk. Integrated Development Environment (Eclipse) Eclipse is een Integrated Development Environment (IDE) welke ondersteuning biedt voor ontwikkeling in Java, Python en PHP. Eclipse is te downloaden op hun site [44].

Pros: ● Kan overzichtelijk alle packages weergeven. ● Geeft suggesties bij het typen (auto-completion). ● Heeft een handige refactoring functie. ● Veel handige plugins. ● Support voor de Google App Engine in de PyDev plugin ● Geen kosten. ● We hebben er allemaal ervaring mee. Cons: ● -

Software versioning (Git) Git is een software versioning systeem. Hiermee kunnen we op verschillende computers aan dezelfde code werken en de gemaakte code op gestructureerde manier met elkaar uitwisselen. Voor $12 per maand krijg je 10 private repositories welke beschikbaar zijn voor 5 gebruikers[43].

Pros: ● Handige manier om bestanden te delen. ● Repositories zijn makkelijk aan te maken. ● Iedereen heeft een eigen versie, dus geen last van source code die niet compileert van anderen. ● Distributed versioning maakt het makkelijker om aan delen van het systeem te werken. ● Maakt het makkelijk om een workflow te gebruiken ten behoeve van Scrum Cons: ● Stijle leercurve; moeilijk door te krijgen welke functie waarvoor is.

Build tools Om het builden, genereren van documentatie en testen te vergemakkelijken, kunnen we dit automatisch laten verlopen m.b.v. build tools. Op die manier worden we ondersteunt in het detecteren van momenten wanneer fouten optreden. Het kan ook werkbesparing opleveren. Omdat alles automatisch gebeurt, hoeft dit proces niet telkens met de hand gedaan te worden.

Maven Maven is een build tool van Apache[45]. Een rescensie is te vinden op [46].

Pros: ● Knapt het vuile werk voor je op. Convention over configuation: je hoeft weinig te schrijven om veel te bereiken, vooral vergeleken met Apache Ant. Cons: ● Het idee van Convention over configuration kan frustrerend werken: Maven maakt aannames die onterecht kunnen zijn. Bij Ant moet je meer zelf doen, maar Ant zit wel simpeler in elkaar.

Hudson Hudson kan worden gebruikt om de software continu te builden en te testen[47].

Pros: ● Automatische build en tests. ● Veel plugins (e.g. GIT). ● Builds via het web aanvragen. Cons: ● Ondersteuning voor Java Servlet Pages vereist, wordt niet ondersteund door onderzochte webservers.

Buildbot Buildbot is een systeem om de compile- / testcyclus te automatiseren, door automatisch te testen direct voor en na elke commit. Hierdoor worden fouten in geteste code eerder opgemerkt, wat leidt tot minder ergernis bij medeontwikkelaars[48].

Pros: ● Automatische build en tests ● Overzichtelijke interface Cons: ● Installatie en configuratie nodig

Bugtracker (Mantis) Mantis is een bugtracker, waarmee centraal kan worden bijgehouden wie welke werkzaamheden nog moet doen. Om deze reden ervaren wij dit als goed voor de communicatie. Er bestaat een plugin voor Mantis die integreert met Buildbot. Dit verhoogt de efficiëntie van de work flow.

Pros: ● Goed overzicht over wat er gedaan moet worden. ● Overzicht over wie wat moet doen. ● Voortgang goed met planning te vergelijken. Cons: ● Moeilijke interface. Programmeertalen De volgende programmeertalen zouden geschikt kunnen zijn op basis van de onderzochte bestaande oplossingen.

Java Compatible met: Facebook SDK, Hadoop, Apache Mahout.

Pros: ● Brede support door zowel bedrijven als F/OSS projecten ● Statische type-checking leidt tot eliminatie van bepaalde typen programmeerfouten ● Al bekend bij alle leden van het team (geen learning curve m.b.t. de syntax) ● Support in de Google App Engine Cons: ● De code kan verbose worden. ● Statische type-checking leidt tot minder flexibiliteit. ● Relatief hoge initialisatiekosten.

Python Compatibel met: Apache, nginx, lighttpd, Facebook SDK.

Pros: ● Dynamic typing en duck typing ● Functies zijn first-class objects ● Simpele syntax ● Support voor webprogrammeren, zowel door FOSS projecten als commerciële instanties. ● Support in de Google App Engine. Cons: ● Niet bekend bij alle leden van het team.

PHP Compatible met: Apache, nginx, lighttpd, Facebook SDK.

Pros: ● Dynamic typing. ● Ontworpen voor serverside programming. ● 10 jaar expertise bij 1 van de leden van het team. ● Makkelijk ervaren personeel te vinden voor onderhouden van PHP code. Cons: ● Tot nu toe geen significante ervaring bij 2 leden van het team.

HTML5 / Javascript / CSS3 Compatible met: Facebook SDK.

Pros: ● Toekomstbestendig. ● Ontworpen voor clientside programming. ● Moderne, desktopachtige GUI‟s mogelijk. ● Universele webclienttaal. Cons: ● Nog officieel in ontwikkeling. ● Niet backwards compatible met HTML 4. ● Ondersteuning voor HTML5 nog niet overal aanwezig.

HTML4 / Javascript / CSS Compatible met: Facebook SDK.

Pros: ● Ontworpen voor clientside programming ● Universele webclienttaal ● Ondersteuning aanwezig op het overgrote deel van de clients. Cons: ● Inmiddels sterk suboptimaal vanuit een technisch standpunt.

Literatuurlijst

Alle sites zijn voor het laatst geraadpleegd in april 2011.

[1] Google, “Why app engine?”, http://code.google.com/appengine/whyappengine.html, geraadpleegd april 2011 [2] Watson M., “GAE vs. AWS” , http://www.developer.com/services/article.php/3873286/Google-App-Engine-vs-Amazon-Web- Services-The-Developer-Challenges.htm, geraadpleegd april 2011 [3] Punt F., “Google App Engine, de voor en nadelen”, http://blog.finalist.com/2009/09/10/google-app-engine/, geraadpleegd april 2011 [4] Amazon, “Amazon EC2 spot instances”, http://aws.amazon.com/ec2/spot-instances/, geraadpleegd april 2011 [5] Gardner D., “Running Cassandra on EC2”, http://www.slideshare.net/davegardnerisme/running-cassandra-on-amazon-ec2, geraadpleegd april 2011 [6] Apache, “The Apache HTTP Server Project”, http://httpd.apache.org/, geraadpleegd april 2011 [7] Interfacelabs, “nginx”, http://interfacelab.com/nginx-php-fpm-apc-awesome/, geraadpleegd april 2011 [8] Fjordvald M., “Nginx Primer 2: From Apache to Nginx”, http://blog.martinfjordvald.com/2011/02/nginx-primer-2-from-apache-to-nginx/, geraadpleegd april 2011 [9] Lighttpd, “lighttpd fly light”, http://www.lighttpd.net/, geraadpleegd april 2011 [10] Wikipedia, “Appcelerator Titanium”, http://en.wikipedia.org/wiki/Appcelerator_Titanium, geraadpleegd april 2011 [11] Titanium, “Plans & Pricing”, http://www.appcelerator.com/products/plans-pricing/, geraadpleegd april 2011 [12] Apache, “Welcome to Apache Hadoop”, http://hadoop.apache.org/, geraadpleegd april 2011 [13] Amazon, “Amazon SimpleDB”, http://aws.amazon.com/simpledb/#highlights, geraadpleegd april 2011 [14] Amazon, “Amazon SimpleDB”, http://aws.amazon.com/simpledb/#sdb-vs-s3, geraadpleegd april 2011 [15] Amazon, “Amazon Elastic Map Reduce”, http://aws.amazon.com/elasticmapreduce/faqs/#gen-1, geraadpleegd april 2011 [16] Amazon, “Amazon SimpleDB”, http://aws.amazon.com/simpledb/#pricing, geraadpleegd april 2011 [17] Amazon, “Amazon Elasstic MapReduce”, http://aws.amazon.com/elasticmapreduce/, geraadpleegd april 2011 [18] Apache, “CassandraLimitations”, http://wiki.apache.org/cassandra/CassandraLimitations, geraadpleegd april 2011 [19] Apache, “CassandraHardware”, http://wiki.apache.org/cassandra/CassandraHardware [20] Williams D., “Locking and Transation over Cassandra using Cages”, http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/, geraadpleegd april 2011 [21] Google, “What is Google App Engine”, http://code.google.com/appengine/docs/whatisgoogleappengine.htmldpleegd april 2011 [22] Google, “Choosing a Datastore”, http://code.google.com/appengine/docs/java/datastore/hr/, geraadpleegd april 2011 [23] Dean J. and Ghemawat S., “Google Research Publication: mapreduce”, http://labs.google.com/papers/mapreduce.html, geraadpleegd april 2011 [24] Apache, “MapReduce Tutorial”, http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Overview, geraadpleegd april 2011 [25] Apache, “Apache Mahout; Bringing Machine Learning to Industrial Strength”, https://cwiki.apache.org/MAHOUT/books-tutorials-and-talks.data/froscon., geraadpleegd april 2011 [26]Apache, “Introducing Apache Mahout”, http://www.ibm.com/developerworks/java/library/j- mahout/, geraadpleegd april 2011 [27] Apache, “Apache Mahout”, http://mahout.apache.org/ , geraadpleegd april 2011 [28] Apache, “Algorithms - Apache Mahout”, https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms, geraadpleegd april 2011 [29] Apache, “Apache Lucene”, http://lucene.apache.org/java/docs/index.html, geraadpleegd april 2011 [30] Duda R. O., ”k-means”, http://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/k_means.htm, geraadpleegd april 2011 [31] Wikipedia, “Collaborative Filtering”, http://en.wikipedia.org/wiki/Collaborative_filtering, geraadpleegd april 2011 [32] Wikipedia, “Latent Dirichlet allocation”, http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation), geraadpleegd april 2011 [33] Tanbeer, Ahmed, Jeong, “Parallell and Distributed Frequent Pattern Mining in Large Databases”, http://www.computer.org/portal/web/csdl/doi/10.1109/HPCC.2009.37, geraadpleegd april 2011 [34] Wikipedia, “Classifier”, http://en.wikipedia.org/wiki/Classifier_(mathematics), geraadpleegd april 2011 [35] Mehta V., “How-to: use the Graph API to pull the movies friends like”, http://developers.facebook.com/blog/post/481, geraadpleegd april 2011 [36] Facebook, “user - Facebook Developers”, http://developers.facebook.com/docs/reference/fql/user/, geraadpleegd april 2011 [37] Facebook, “Javascript SDK”, http://developers.facebook.com/docs/reference/javascript/, geraadpleegd april 2011 [38] Facebook, “facebook-ios-sdk”, http://developers.facebook.com/docs/reference/ios, geraadpleegd april 2011 [39] Facebook, “facebook-android-sdk”, http://developers.facebook.com/docs/reference/android, geraadpleegd april 2011 [40] Facebook, “facebook-php-sdk”, http://github.com/facebook/php-sdk/, geraadpleegd april 2011 [41] Facebook, “Tools - Facebook Developers”, http://developers.facebook.com/tools, geraadpleegd april 2011 [42] Facebook, “Facebook Developers”, http://developers.facebook.com, geraadpleegd april 2011 [43] Github, “Plans & Pricing”, https://github.com/plans, geraadpleegd april 2011 [44] Ecllipse, “Eclipse Downloads”, http://www.eclipse.org/downloads/, geraadpleegd april 2011 [45] Apache, “Apache Maven Project”, http://maven.apache.org/, geraadpleegd april 2011 [46] Siddiqi Z., “The pain of migrating from Ant to Maven”, http://weblogs.java.net/blog/zarar/archive/2006/12/the_pain_of_mig_1.html, geraadpleegd april 2011 [47] Hudson, “Meet Hudson”, http://wiki.hudson-ci.org/display/HUDSON/Meet+Hudson, geraadpleeg april 2011 [48] , “Buildbot”, http://trac.buildbot.net/, geraadpleegd april 2011

Nerval Limited Code Style

Introduction Code style Javadoc Naming conventions Code size restrictions Layout Whitespaces Blocks Type-specific rules Classes Imports Methods Statements Variables and fields Avoiding well known problems Metrics Code duplication

Introduction This document contains the style that code developers for Nerval Limited should adhere to.

Code style This section describes the rules related to code style, that should be adhered to when developing code for Nerval Limited.

Javadoc Javadoc is important for (future) developers to understand the code and for users of any external API to be able to work with it quickly. Therefore two Javadoc API’s should be created, one for people who want to use the external (public) API and one for maintenance and reference during development.

The first (user API) will have to contain all public fields and methods. Get and Set methods created with the Lombok project [LOM] do not need to contain a descriptive text in the Javadoc. Any public method that is not created by this project (including Get and Set methods) should contain a descriptive text and all parameters and return values should be documented. A package needs a package-info.java file, containing the description of that package. All public fields should have a Javadoc description.

The latter (developer API) will extend the user API. All methods need to have a Javadoc description and all parameters and return values should be documented. All fields used outside the class should have a Javadoc description.

Naming conventions The use of naming conventions will create some form of uniformity, making it easier to read, and therefore maintain, all code of Nerval Limited.

Packages should start with one or more lowercase letters, possibly followed by a strings consisting of lowercase letters, numbers and underscores. These strings should start with either a lowercase letter or an underscore and need to be separated from other strings by a dot.

Classes start with a capital and can be followed by letters and numbers. Abstract classes should begin with ‘Abstract’ or end with ‘Factory’. No non-abstract classes are allowed to start with ‘Abstract’. Interfaces should begin with a capital ’I’, followed by a normal classname. This means Interfaces will always start with two capitals.

Any type parameter can consist of one capital or two capitals if the use of a single capital is not descriptive enough.

Constant names can only contain capitals and numbers, possibly separated by an underscore.

Any variables and methods should start with a small letter and can possibly be followed by one or more letters, either lower- or uppercase, and numbers. No more than one uppercase letter can appear in a row. Methods of JUnit test cases should be named properly (e.g. setUp()). Not naming a JUnit method properly might lead to it not being executed.

Code size restrictions To keep the classes and methods readable and understandable, their size will be restricted. The maximum length of a line is 110 characters, so the code can be displayed on most computer screens, without having to scroll horizontally.

Methods can be at most 150 lines or 50 statements and there should be no more than 100 methods per class. If a method is bigger, it probably does too much and some of this work should be delegated to other methods. If more methods are present in a class, this class probably does too much work and some of this work should be done by other (existing or new) classes. A method can have at most 7 parameters. Having more means the data should probably be covered by an abstraction layer. Parameters are not allowed to be assigned a value. Also, to keep the control flow easy to understand, at most 2 return statements can be present in a method (equals methods excluded).

Anonymous inner classes can have a maximum length of 30 lines.

Layout To improve readability and create some form of uniformity amongst all code of Nerval Limited, some rules about the layout of the code are determined. Some of these rules increase the readability of the code, others are purely aesthetic choices, but they should still be followed for uniformity.

Whitespaces No spaces are allowed between a method's name and the parentheses. The same holds for a constructor. In methods and constructors, no space should be placed immediately after the left parenthesis and immediately before the right parenthesis. This whitespace is allowed in statement parentheses (e.g. for, if, while), if the developer thinks doing so will increase the readability of the (loop)condition. A space is required after a typecast. Also, no spaces are allowed around ‘<’ and ‘>’ when they are not used as an operator (e.g. new List()), except to separate words (e.g List name).

Spaces are required after a comma and semicolon and around all operators and keywords. Curly braces should typically be surrounded by whitespaces, unless in an empty method (e.g. ‘public void methodName() {}’).

Blocks Blocks should not be empty (e.g. empty if-statements) and be surrounded by curly braces, also in cases when there is only 1 statement inside the block (e.g. one statement inside an if- statement). The left curly braces should be placed at the end of a line and the right curly braces should be placed on a new line. Keywords like else are allowed on the same line after the right curly brace. An exception is made for single statement blocks. There the curly braces can be on the same line, but the right curly brace should end the line (e.g. ‘if (bool) { doSomething; }’).

Type-specific rules This section contains the general rules for classes, imports, methods, statements and variables.

Classes Every file can contain at most one outer type (e.g. class). This makes the code more readable. To enforce a proper file architecture for the product, every class should have a package name corresponding to the directory it is in. A class should start with it’s fields, ordered from public to private. Then the constructors, then the methods and finally the inner classes.

Classes with only static methods in their API should not have a public constructor. There is no need to instantiate a utility class, so it should not be possible. Classes with only private constructors should be final, because they can’t be extended properly and so it shouldn’t be possible.

Interfaces should define a type, so Interfaces that serve as a bag of constants are not allowed.

Exceptions should only contain final fields. It is undesirable to change the fields of an exception, since the cause for the exception could get blurred.

Imports Importing unused code is unnecessary and does not add to a clean, readable code style. Therefore no star, unused or redundant imports should be used. Also, no static imports should be used, because they blur the line between what is a local method and what not.

Methods Apart from the size and name restrictions on methods, the use of the throws keyword is restricted as well. To keep the contract of the method clear, throws should only reflect exceptions that can actually occur and a method should not throw a subclass of an exception that is also thrown by the method. Throwing or catching exceptions that are too general (Exception, RuntimeException, Error or Throwable) is not allowed. Using exceptions that are too general can cause specific exceptions to be handled incorrectly. Also, no more than 2 throws per method are allowed. Having more only clutters the contract.

Furthermore, the clone method is not allowed to be overridden and no finalize method is allowed. Both methods have a complicated contract and are usually not needed.

The equals method also has some extra restrictions. When implementing an equals method, the equals(Object) method should be overridden as well. This prevents problems where the default equals(Object) method is used when the custom method was supposed to be used. When overriding the equals method, the hashCode method should be overridden as well. This is required for hash based collections to work properly with the object. Calls to the equals method should be done in a way that prevents null pointer exceptions as much as possible.

Statements To increase the readability of the code, only one statement per line is allowed and empty statements are not allowed (i.e. just a semicolon).

For- and if-statements can be nested 2 levels deep. Try-statements can only be nested 1 level deep. If more levels of nested statements are used, the method will be too complex to be readable. If more levels are needed, the motivation for this decision should be documented.

Switch statements should always contain a default case. This improves maintainability, as a new environment will not cause the switch to be skipped. If the developer is sure all cases are covered, the default case can contain an assert to make sure. The default case should always come last for readability. Cases in switch statements should end with a break. When a break is not present, a comment ‘//falls through’ should be added to indicate that the fall-through is intended.

Variables and fields Local variables should not have the same name as fields to prevent confusion as to what field you are using. For the same reason, fields should be preceded with the this keyword when they are used. Variable assignments and declarations should be placed on a separate line to increase readability.

All public fields should be static and final to avoid setting them directly from other classes. A set method should be used instead.

Duplicate String literals are not allowed. Using them probably means they should be defined as a constant. Empty strings are excluded from this rule.

Also, the index of a for loop should not be changed within the loop to increase the readability.

Avoiding well known problems To avoid some well known problems when using Java, some rules have been determined. These rules are checked by the checkstyle configuration to help developers detect bugs before they are triggered. ● No double checked locking (DCL) is allowed. This is done, because DCL doesn’t work well in Java. ● Numbers other than -1, 0 , 1 and 2 should be constants or enums of some sort. Using a number other than -1, 0, 1 and 2 usually means you missed a constant or used a number where enums are more appropriate. In some cases, like some for-statements that just need to run a fixed number of times (not related to a parameter), the use of a ‘magic number’ can be justified. If this is the case, a comment should be made, specifying why the use of the number is justified. ● Boolean expressions should be simplified if possible to make them easier to understand.

Metrics To assure the quality of the code and it’s design, the code will have to comply to some specified metrics.

Fanout/fanin When classes depend on a lot of other classes or a specific class is used by a lot of other classes, the system becomes very dependent. This is bad for maintainability so the following restrictions are set: ● A class can instantiate at most seven different classes. ● A class can depend on at most twenty different classes.

Complexity To keep the code readable and more maintainable methods should have a cyclomatic complexity of at most seven and no more than 200 possible execution paths.

Code duplication Code duplication should be avoided, as it is usually a sign of bad architecture. Checkstyle offers a simple copy-paste check, which can help out developers. For big projects, an external code duplication detection tool (like CCfinder or Simlike) can be used that does more than just simple copy-paste detection.

Projectomschrijving Het project bestaat uit het ontwerpen van een sociaal platform, gekoppeld aan Facebook. Het project is tweedelig. Enerzijds het ontwerpen van efficiënte algoritmen die grote hoeveelheid data van Facebook gebruikers kunnen verwerken. Het tweede gedeelte bestaat uit het implementeren van deze algoritmen in een Facebook Applicatie. Er is een mogelijkheid om er een mobiele telefoon applicatie bij te maken gekoppeld aan de Facebook Applicatie, dit is afhankelijk van de tijd die het kost om de Facebook Applicatie te implementeren en het aantal personen welke participeren in het project.

Algoritmen De bedoeling is dat Facebook gebruikers hun profieldata beschikbaar stellen aan het sociale platform. Op basis van de beschikbare data, zullen gebruikers aan elkaar voorgesteld worden. Het idee hierbij is dat er een voorspelling gedaan moet worden over het eventuele succes van een vriendschap, op basis van gedeelde interesses, opleiding, werk, spelletjes, etc. Hierbij speelt ook de geografische locatie en leeftijd van personen een rol. De gebruikers hoeven elkaar nog niet te kennen. De insteek is om nieuwe gelijkgestemde vrienden via Facebook te ontmoeten, maar het kan ook toegepast worden op bestaande vriendschappen om te kijken in hoeverre een bestaande vriendschap op gelijkgestemdheid is gebaseerd.

Om Facebook gebruikers aan elkaar voor te stellen, moet er een algoritme ontworpen worden. Dit algoritme zal op basis van beschikbare gegevens een “succespercentage” berekenen, dit percentage geeft aan in hoeverre het kansrijk is dat het eerste contact uitloopt op een vriendschap. Hierbij dient rekening gehouden te worden met een gebruikersdomein van Facebook van miljoenen gebruikers. Met andere woorden, de tijd- en ruimtecomplexiteit van het algoritme moet zo laag mogelijk zijn. Het algoritme dient ook schaalbaar te zijn, om deze reden geldt de eis dat het algoritme parallelliseerbaar moet zijn. Een extra wens is dat een zoekende gebruiker voorkeuren kan opgeven, welke van invloed zijn op het zoekresultaat. In hoeverre dit haalbaar is in combinatie met de eerder genoemde eisen zal onderzocht moeten worden.

Opzet voor het algoritme: het algoritme bestaat uit twee delen. Het ene deel is een zoekalgoritme, welke door de gebruikersdata zoekt om potentiele vriendschappen te vinden. Het andere deel is een initialisatiealgoritme. Deze zal zoekstructuren creëren waar het zoekalgoritme gebruik van zal maken. Dit kan van alles inhouden, zoekbomen, (gewogen) grafen, etc. Het is aan het projectteam zelf om deze algoritmen ontwerpen.

Facebook Applicatie Het algoritme dient uiteindelijk geïmplementeerd te worden als een gebruiksvriendelijke Facebook Applicatie. Gebruikers kunnen zich dan aanmelden met hun Facebook profiel. Hierdoor komt de benodigde data beschikbaar en kan er een potentiele verzameling vrienden voorgesteld worden door de applicatie. Hierbij is er per voorgestelde vriend een percentage zichtbaar waarin wordt vermeld hoe goed de match is (op een bepaalde schaal), waar deze persoon zich geografisch gezien bevindt en het aantal vrienden welke deze persoon al heeft.

1

Gebruikers kunnen vervolgens elkaar berichten sturen om elkaar te leren kennen en een vriendschapsverzoek indienen bij elkaar. Het eindresultaat moet ook aansluiten bij de wensen van de eindgebruikers. Hierbij zal o.a. moeten nagedacht worden over aspecten als privacy. Is het wenselijk dat de werkelijke namen van personen bij zoekresultaten komen te staan, moeten namen weggelaten worden, moeten namen veranderd worden of kan de gebruiker dit zelf instellen?

Mobiele telefoon applicatie Een mobiele telefoon applicatie kan een nieuwe dimensie toevoegen aan het sociale platform. Op de telefoon zou men naar potentiele vrienden kunnen zoeken in de directe nabijheid. Denk aan mensen in dezelfde stad/stadion/discotheek/etc. Dit gebeurt op basis van real-time locatiebepaling van op dat moment actieve gebruikers van de telefoon applicatie en de “succespercentages” tussen gebruikers. Wanneer een gebruiker een voorgestelde vriend wil benaderen, kan deze de ander een instant message sturen met daarin het verzoek om te ontmoeten (dit bericht kan door de gebruiker zelf aangepast worden).

2

Bachelorproject Code Analysis

Volker Lanting

July 7, 2011 Contents

1 Summary 2

2 Introduction 2

3 API - Java 3 3.1 Sonar ...... 3 3.1.1 Cohesion ...... 4 3.1.2 Complexity ...... 6 3.1.3 Duplication ...... 8 3.1.4 Dependencies ...... 8 3.1.5 Test coverage ...... 10 3.2 PMD ...... 10 3.3 Checkstyle ...... 10 3.4 Other refactoring ...... 11

4 Facebook Application - JavaScript 12 4.1 Rule violations ...... 12

5 Discussion 13

1 1 Summary

We analyzed our code using Sonar, PMD and Checkstyle. Our Software Im- provement Group maintainability rating was 3.5 stars. The main improve- ments lie in test coverage and cutting dependencies. For the JavaScript code, the main improvements are test cases and code style.

2 Introduction

In this report we will analyse the code that we made during our bachelor project. This was supposed to be done by the Software Improvement Group (SIG), but because of trouble with non disclosure agreements we did not get to make use of their service. Therefore we will thoroughly analyse our own code. We used the SIG plugin for sonar to determine our maintainability ranking. It averaged at 3.5 stars, mainly because of the lack of test coverage. Several automated analysis tools were used to indicate where possible problems might lie. These indicated areas are then manually inspected to see whether there really is a problem and, if so, how it could be solved. Our code consists of three main parts in different languages and therefore different tools were used for the analysis.

• API The API is programmed in Java. Since there are a lot of code metric tools for Java, we were able to analyse the code pretty well. We used the following programs:

– Sonar – PMD – Checkstyle

• Facebook application The Facebook Application was made in html and JavaScript. This makes unit testing quite hard, and we were only able to find one tool for the code metrics. This tool is called Sonar, but it was not able to cope well with our Object Oriented JavaScript model. However, it still gave quite a few good pointers.

• Mobile application The Mobile application was made in MOBL. It is still a new language, so we did not have any way to test or analyze our code automatically. We decided not to discuss the mobile application

2 in this report, since it is not ready for commercial use yet. If the company decides to continue development of the mobile application using MOBL, we suggest close (manual) monitoring of the code untill automatic analysis tools become available.

3 API - Java

The API offers a way for applications like, for example, our Facebook ap- plication to access the information that we have stored in our database. It consists of three layers: • Servlets The servlets offer a way for applications to access our data over the internet.

• Data control The data control layer consists of several low-level controllers. Each controller talks directly to the database (using Hector) and is responsi- ble for the control of a single source of information or conceptual object (e.g. data from Facebook or a data belonging to a user).

• Business logic The business logic is the code that retrieves the ‘raw’ data from the data controllers and transforms it to a meaningfull format so the servlets can pass it along to the user. We used PMD and Checkstyle plugins for our development program called Eclipse. This made it easier to monitor our code quality during development. It resulted in quite a steady level of ‘rule compliance’ as can be seen in Figure 1.

3.1 Sonar Sonar runs tests and analyses code based on a whole scala of metrics. It generates visual representations of these metrics and a list of rule violations. The list of violations is usefull to spot possible problems directly. The visual representations of the metrics can be used to identify pieces of code that have a higher risk of containing problems or are badly designed. These pieces of code can then be checked manually to see if there realyl is a problem. We used the SIG plugin for Sonar. This plugin uses the data that Sonar provides to calculate how the code scores in more general area’s. The current status of the code can be seen in Figure 2.

3 Figure 1: A graph representing the progress on the complexity, rule compli- ance and test coverage during the last couple of days.

Figure 2: The current state of our code in the Sonar plugin of SIG

In the next sections we will discuss the metrics that Sonar uses and which pieces of our code ‘violated’ these metrics and why.

3.1.1 Cohesion Sonar checks for low cohesion between methods. This could indicate that we put too much functionality in a single class. We discovered that using a class specific Logger in all methods of a class actually fools the cohesion check by making methods look more cohesive than they are. Therefore it is important to keep this in mind when looking for possible low cohesion.

These cases of low cohesion were found: • DatabaseController The DatabaseController is a broad abstraction that implements a lot of business logic for the servlets. Because of this it has no clear functional- ity and there is low cohesion between the methods. The output of Sonar for this metric on the DatabaseController can be seen in Figure 3. This

4 abstraction should be replaced by the smaller abstractions provided by the controllers. For every source of data and ‘conceptual object’ there should be a controller, like the FacebookData controller for data from the source Facebook and the UserController for the conceptual object User. These controllers abstract the database implementation and can there- fore be used directly by the servlets. If the database changes the con- trollers can simply be replaced by other controllers with the same in- terface towards the business logic layer.

Figure 3: The cohesion of the Database controller. It shows two connected components, which is caused by the fact that the class contains business logic for several servlets. Some of these servlets use the UserController and others the SimlikeController.

• MatchController Just like the database controller this class contains business logic that shouls actually not be done in one class. It can be refactored by putting the logic into the servlets themselve, or by splitting the functionality over several classes.

5 • Controllers The primary function of the controllers is to abstract reading, storing and deleting data from the database. These are actually three different tasks that are the cause of low cohesion. The metrics are just indicators of possible problems and do not neces- sarily mean the code is wrong. In the case of the controllers we feel their primary function is specific enough and therefore we will not refactor these classes.

• JabberPasswordGenerator The JabberPasswordGenerator generates a random password consisting of a set of allowed characters. These characters are created in the constructor instead of given as a constant. The low cohesion of this class is caused by the generation of the allowed characters. We will make constants of them.

• Servlets All servlets have a method that is used by Shiro to check if someone who calls the servlet is authorized to do so. This method has no relation to the functionality of the servlet, causing low cohesion. This is not a problem, it is simply a result of the use of Shiro.

• Realms We use Shiro for authentication and role-based authorization. To do this Realms had to be created. They have low cohesion, but are imple- mented according to the interface of Shiro and in our opinion there is nothing wrong. Therefore we will not refactor them.

• CassandraConfig The CassandraConfig has low cohesion, because the loadConfig method has no ‘relation’ to the other methods. This is actually not com- pletely true, since the CassandraConfig represents a specific instance of a database configuration. The loadConfig method is called only by the constructor and it instantiates the particular configuration, making it a perfectly valid design.

3.1.2 Complexity The complexity of a piece of code is determined by the amount of possible execution paths through the code. High complexity means an improved risk of faults and hard to read code. However, high complexity can not be avoided

6 in non-trivial tasks. What is important, is the spreading this complexity amongst classes and methods. The Cyclomatic Complexity (CC) of a method is considered to be quite high when it is above 10. We had one method with CC 12. This method can be seen in Figure 4. We feel it is not a problem, since most of the complexity comes from the catch clauses. They do not make the code much harder to read, nor should the functionality of the method be split into smaller parts just because a lot of Exceptions might be thrown. We made some minor adjustments to other classes to lower the complexity some more. The result is visible in the blue line in Figure 1.

Figure 4: The loadConfig method of the CassandraConfig class has a cyclo- matic complexity of 12. This is caused by one if-statement and six catch clauses (6 × 2 = 12).

7 3.1.3 Duplication Duplicate pieces of code usually means the code should be put into a method and shared amongst the methods/classes that have duplicate code. We have two classes that have quite a big piece of duplicate code: GetAllUserInfor- mation and GetAllSimlikes. Both classes extend the AbstractSimlikeServlet. This class does the au- thentication and authorization for most servlets. For authorization, the method requiredPermissions is required. However, the two classes do autho- rization based on individual permissions instead of the standard role-based authorization. Therefore the methods return an empty list of permissions resulting in duplicate code. This problem can be solved by making two subclasses of AbstractSim- likeServlet, one requiring role-based authorization and one which does not. However, getAllUserInformation will be removed when we refactor GetUser- Information (see Section 3.4). Since this would only leave GetAllSimlikes, the duplication will be gone and a subclass would be overkill.

3.1.4 Dependencies A high number of dependencies means the piece of code is used in many places in the application. This is bad for maintenance, since a changed interface might lead to many changes. Also it makes the system more complex. We will discuss the cyclic dependencies and the packages with the highest number of dependencies.

• MessageController and MessagMapController These classes are in the same package, but depend on each other. This is caused by poor seperation of responsibilities. The MessageMapCon- troller should only be concerned with the maps where a message can be stored, not with the messages themselves. However, at the moment it contains methods to store and retrieve messages from maps as well. We will move these methods to the MessageController, this way the MessageController will depend on the MessageMapController, but not the other way around.

• SimlikeRealm and SimtokenController The SimlikeRealm needs information of Simtokens to authenticate a Subject. Therefore the dependency from the realm to the controller is correct. However, the SimtokenController has a method that re- turns an AuthenticationInfo object. This method is only called by the SimlikeRealm and gets most parameters from the SimlikeRealm. This

8 dependency should not exist and it is easily refactored by renaming the method to getId and making it return the id of the Simtoken. The SimlikeRealm can then create the AuthenticationInfo object itself.

• Simtoken and SimlikeRealm Simtokens and the SimlikeRealm are tightly coupled. We need a ref- erence in the SimlikeRealm which says it is possible to authenticate to the realm with a token of class Simtoken. This dependency is forced by Shiro (see the Realm.supports method). However, the Sim- token.getPrincipal method returns a reference to the SimlikeRealm’s class name. This dependency is not necessary and it could be removed by changing the Simtoken.getPrincipal method to return something other than a PrincipalCollection. This would also make it possible to authenticate to other realms with the Simtoken. It is recommended to extend the UsernamePasswordToken of Shiro in all token classes.

• UserController and ProfilePrivacySetting For individual permission checking, as opposed to the role-based au- thorization of Shiro, we use ProfilePrivacySettings. These settings are retrieved from the database and ask the database for information. This means the ProfilePrivacySettings are tightly coupled to the database controllers. This can and should be avoided. There should be a class that retrieves the privacy settings and verifies if the caller is authorized. This would solve the cyclic dependency and make the code easier to understand.

• Servlets and Permissions Servlets use role-based authorization to check if the caller is authorized. For each servlet a permission constant is made. The servlets need to check whether the caller has the appropriate permission, so we can not avoid the cyclic dependencies between the servlets and the Permissions class.

• High number of dependencies The only high number of dependencies is at the exception package. It is quite normal that they are used by a lot of classes, so it needs no refactoring.

9 3.1.5 Test coverage Automated tests are important for the maintainability and analyzability of the system. Most classes can be covered with jUnit test cases, especially since we use Mockito to mock objects. We started off good with over 75% coverage, but because of approaching deadlines the testing got pushed back. This resulted in an overall coverage of only 54%. Although we have done plenty of ‘monkey testing’, it is no replacement for automated unit tests and they should definitely be created before new functionality is implemented. Also, the template files we use for testing should be exchanged by cor- responding objects. We already have Message and Simlike objects and this approach should also be used for Users. It would make changes to the Json format easy to maintain and make the code easier to read.

3.2 PMD PMD checks code for style problems, common programming mistakes and code metrics. Due to its integration in our development program Eclipse, it was easy to monitor possible problems.

We identified the following problems thanks to PMD:

• We still have a lot of quickly constructed, unused and unneeded test/dummy classes. They should be deleted.

3.3 Checkstyle Checkstyle checks the code for style problems and some simple metrics like class fan-out, duplication and unit size. In the latter two checks checkstyle is outperformed by Sonar.

We indentified the following problems thanks to Checkstyle:

• The UserController class has a big data abstraction coupling and fan- out. The cause of this is the fact that it does too much work. Its function should be to store/retrieve/delete user related data from the database, but it also contains methods that modify Facebook data (we have a FacebookDataController for that). It also accesses the user’s privacy settings, which could be done in a class specifically created for that purpose.

10 • The SimlikeController class has a big fan-out. The primary function of the SimlikeController is to store/retrieve/delete data corresponding to the likes of a user. It currently also converts Facebook json formatted likes to our representation of likes. This should be done by a specialized class in the business logic layer, so it will be easily maintainable when the Facebook API changes. Making this refactoring will solve the high fan-out.

3.4 Other refactoring After checking all Sonar, PMD and Checkstyle violations we discovered quite a few possible code improvements. We will list them here for future devel- opers: • Input checking In the token classes we stored an array directly in the constructor. This makes it possible for other classes to change the state of our object. It should be copied first. For now the API is closed and the servlets only pass save values. This means the data will come from a trusted source, but for future changes it is important that input is copied if it is mutable and comes from an untrusted source. • Constants Especially in the UserController we need more private constants. This increases maintainability. • GetUserInformation and OneToOneMatching These servlets were made with the intention of replacing the GetBasi- cUserInformation, GetAllUserInformation, SimpleOneToOneMatching and ExtendedOneToOneMatching. The other servlets should actually be deleted and their logic placed in the GetUserInformation and One- ToOneMatching servlets. • FacebookDataController, JabberDataController, UserExter- nalDataController The UserExternalDataController is database implementation specific and should therefore be removed. The FacebookDataController and the JabberDataController should only contain methods specific to the storage/retrieval/deletion of their source specific data. • Simlike The Simlike field entryDate should actually be used. It can be retrieved, but is not used in any other classes.

11 • Simtoken The simtoken should be set to expire after a certain period of time. This way we limit the possible damage if someone steals someone elses Simlike ID and simtoken combination.

4 Facebook Application - JavaScript

The code for the Facebook application was written in JavaScript. We used Sonar to analyze our results. The results and maintainability rating of our JavaScript code are displayed graphically in Figure 5. We scored shockingly low on the rule compliance, but a quick inspection showed that this was due to the inability of Sonar to process the Object Oriented JavaScript that we had spread out over several files. For exam- ple, the Simlike object is stored on the same server as the API and the Simlike.Gui.FbApp server is stored on the same server as the Facebook ap- plication. This is done to prevent cross-site scripting errors. However, Sonar was unable to find the object Simlike in Simlike.Gui.FbApp and therefore did not progress beyond any defined scopes (i.e. with-statements). There were still some correct violations found and we will discuss them in the following section.

4.1 Rule violations The following violations were found and need to be addressed: • Checking identity Instead of using the normal ‘Java’ style comparators (e.g ==, ! =) we should use the identity operator (e.g. ===) to check if it is indeed the same ‘type’. • Wrapping for loops The body of a for loop could be wrapped in an if statement to prevent using unwanted variables of an array. In our case this is not required, but it is good to keep this in mind for future implementations as well. • Code style Although we paid great attention to a clean and uniform code style in the Java code, we did not do the same in the Javascript code. Im- provements include the line-length, semicolons after function declara- tions, use of whitespaces and the wrapping of immediately executed functions in parentheses.

12 Figure 5: The results of the SIG plugin for Sonar for the JavaScript code of the Facebook application.

5 Discussion

Although we have enough possible improvements, we are quite content with the quality of our code. We scored 3.5 stars on the SIG maintainability ranking and it can be boosted to 4.5 by increasing test coverage. Although we have no time to improve the code now, we have given quite extensive and detailed recommendations. Most important is the creation of test cases for most of the uncovered code. The Javascript code could not be analyzed as well as we would have liked, because Sonar could not deal with the Object Oriented approach spread over several files. However, apart from no unit tests we scored pretty good according to the SIG plugin for Sonar. Also, the lack of any tools to test the MOBL code was quite disappointing.

13 Study report

Simlike platform

May 16, 2011, Delft

Commissioned by Nerval Limited, United Kingdom, in cooperation with the Delft University of Technology, Netherlands.

Bachelor students: Joris Albeda (1514172) Jeroen Dijkhuizen (1521950) Joey Ezechiëls (1338994) Volker Lanting (1513273)

Preface Nerval Limited, an internet start-up, is building a new social media platform. As a bachelor project at the TU Delft the authors of this document will be developing a first version of this platform, called Simlike.

Before the development can get started, a preliminary study is performed. This report contains the results of that study. A small study has been performed before, see [001]. The former study resulted in an enumeration of existing tools for potential use during the development. In this report, the found tools and techniques are compared to each other based on the requirements of Simlike, these requirements can be found in [REQ]. This report is intended to complement the previous preliminary study [001].

2

Table of contents Summary ...... 4 Introduction ...... 5 Constraints ...... 7 Hardware platform ...... 9 Web platform ...... 15 Database platform ...... 17 Algorithm platform ...... 20 Conclusion ...... 22 References ...... 23

3

Summary

This document is a study on how to lay a solid architectural foundation for Simlike. The foundation consists of: ● the hardware platform used to run software on, ● a web platform (e.g. webservers), ● a database platform to store data and ● an algorithm platform to process the data.

For the hardware platform a VPS solution will be used, a Virtual Private Server. This will support the web platform and database platform. The VPS solution will be outsourced for economic reasons. The Easy Cloud Compute service from Amazon Web Services (AWS) is the best choice for hosting the VPSs.

The web platform will run on top of AWS Elastic Beanstalk, a scalable and easy to maintain solution. As a result of this choice, Java will be used as server-side web development language.

Apache Cassandra will be used as data store for the database platform. It was chosen for its performance and completely distributed nature, it has no single point of failure.

DECLARED CLASSIFIED BY NERVAL LIMITED

4

Introduction

This report describes the study of available techniques and tools for the development of a new social media platform called Simlike. The study can be used to review and justify the decisions that are made during the development process.

Simlike will be visible to the public as three services: a mobile application, a Facebook application and a website. These services can communicate using an Application Programming Interface (API). Some services will have to be performed in the background due to their computationally intensive nature. These components constitute Simlike, they are visualised in figure 1.

This document is a study on how to lay a solid architectural foundation for Simlike. The foundation consists of: ● the hardware platform used to run software on, ● a web platform (e.g. web servers), ● a database platform to store data and ● an algorithm platform to process the data.

Figure 1. Overview of Simlike and its architectural foundation.

5

The remainder of this document is organised as follows: first, evaluation criteria are discussed in Section Constraints. The tools and techniques will be compared based on the constraints that arise from the requirements for Simlike.

Next, the components of the foundation of Simlike are discussed and the relevant tools and techniques compared. In Section Hardware Platform the required hardware is discussed. In Section Database Platform several alternatives for scalable data storage are discussed. The techniques that will be used for the required algorithms are discussed in Section Algorithm Platform.

In section Conclusion the decisions made in this study are summarized and discussed.

6

Constraints

To compare the tools and techniques that can be used for the development of Simlike, some constraints have to be identified to compare them by. The most demanding constraints come from the following requirements, which are all “Must Haves” [REQ]:

● O2 Security ● O4 Activity Monitoring ● O6 Responsive ● O7 Reliable ● O8 Scalable ● O10 Expandable

Let us examine these requirements one by one to understand their implications. O2 Security demands that the data is stored and handled securely, both physically and virtually. This means the hardware must obey certain security standards which should prevent unauthorised personnel from accessing the hardware. Concurrently, the virtual component of security demands secure data storage and communication. This implicates that there must be some form of regulation and shielding against unauthorised access from the (internet) network.

O4 Activity Monitoring of which users are offline and online demands a real-time processing of information. In the event that users are offline, they must not be perceived as online. This also corresponds with requirement O6 Responsive, which asks for a real-time experience.

O7 Reliable compels the architecture to be robust and tolerant to failures. Preferably there should be no Single Point Of Failure (SPOF). It also demands a high degree of availability of the platform. If the platform were to be inaccessible, this would be deemed unacceptable.

Though it will start on a smaller scale, the architecture must be designed to be able to handle millions of users. O8 Scalable requests that the designed architecture can handle this kind of growth, while keeping the costs proportional to the number of users and without having to redesign the architecture in the process.

Finally, according to O10 Expandable, the architecture must make it possible to add more features in the future.

There are a few other considerations which are evaluated as well. Even though they have not been explicitly expressed by the client, they constitute a best practice. It is desirable not be bounded to one particular supplier or vendor, this is called a vendor lock-in. By preventing a vendor lock-in, suppliers and vendors have less power over the client. Potentially this could give the client a significant leverage in negotiations with vendors and suppliers. It also allows for more flexible purchasing strategies.

7

Another consideration comes from an economical perspective. Sometimes systems must be placed together. When large amounts of data are processed by multiple systems, data traffic costs can be eliminated by placing these systems in the same data centre. However, the requirement O7 Reliable dictates that not all systems are in the same data centre, as this would constitute as a SPOF (e.g. in case of fire, flood etc.).

8

Hardware platform

The Hardware platform consists of the hardware required to host the Web and Database platforms. The research done in [001] suggested using Google App Engine (GAE), Amazon Web Services (AWS) or dedicated servers as part of a (virtual) hardware platform. At this stage, GAE can already be ruled out. GAE violates constraint O10 Expandable, because it places constraints on expandability by enforcing Java and Python as the currently only available programming languages to be used. Even more so, it offers Google Datastore as the only feasible data storage platform. It might be possible to set up a custom data storage platform in a separate environment, but communicating with it from GAE introduces higher latencies and extra data traffic costs since it would have to be located outside Google‟s data centres. It appears opting for GAE would also mean accepting a vendor lock-in: it would be difficult to switch later on. Theoretically it could be possible to move away from GAE, but the alternatives are not mature enough to support a production environment [002]. Moving away from GAE effectively sacrifices all the scalability offered by GAE, making it simply impractical (this violates O8 Scalable as well). In other words, using GAE would leave very little room for change later on.

The choice between AWS and a dedicated server remains, which is actually a choice between a Virtual Private Server (VPS) and a dedicated server. Let‟s further describe the differences between the two latter (other vendors also offer VPS solutions, which will be examined in a moment).

The first major difference is the effort required to add, replace or move a server. This can be related to O7 Reliable. For the system to be reliable, a failing server must be replaced as soon as possible. The same is true when the platform is under heavy load. When that happens, more servers must be added to ensure the system remains responsive (O8 Scalable). Another way for VPSs to cope with heavy load is to load balance VPSs over the available hardware. This is beneficial to O6 Responsive. In case of hardware failure, backup servers must kick in to keep the platform accessible. These actions are relatively easier, faster and cheaper for a VPS than for a dedicated server [011].

Another difference is how resources are spent. Dedicated servers have less overhead; the software is directly running on the hardware. A VPS uses an abstraction layer between the OS and the hardware, which has an overhead of a few per cent. Typically (but not necessarily) a VPS shares a host with other VPSs, introducing overhead due to context switching (this is detrimental for O6 Responsive), although a minimum amount of resources is always guaranteed. On the other hand, it is common for a server not to be fully utilising all the available resources. This makes it possible for VPSs to achieve a higher system density per hardware, despite the added overhead. If the servers are not constantly using all the resources, choosing a VPS is cheaper and more scalable, since less hardware is needed for more servers (O8 Scalable).

9

Criteria Virtual Private Server (VPS) Dedicated Server

O6 Responsive +/- +

O7 Reliable + +/-

O8 Scalable ++ +/-

Costs + +/-

Result + +/- Table 1. Grading of hardware platforms with regards to requirements

As a hardware platform VPSs are more attractive than dedicated servers. Note, however, that dedicated servers are also a viable option. At some point, when the application has grown to a significant user base and there always is a minimum load on the system, it might become interesting (from economical perspective) to utilise a mix of VPSs and dedicated servers. In the beginning this would not be economically attractive.

For now the choice is to stick with VPSs. This presents two new choices: to develop a VPS platform in-house or to outsource the VPS platform to a third party. Consider this decision in the context of the client: an internet start-up business with limited resources. Furthermore, the solution should be up and running within 3 months. The main advantage of outsourcing the VPSs for the client is that no personnel are required for the maintenance of the hardware, saving costs on training and personnel. Secondly, no investment for expensive hardware would be required. Outsourcing VPSs puts some pressure on O2 Security: who will be able to physically access the servers? This has to be carefully considered when choosing a provider, as well as when designing the software. Another important aspect is the vendor lock-in. It is desirable to be able to move the VPSs to another provider. The in-house VPS platform could be designed to prevent a vendor lock-in. When outsourcing, it depends on the provider if this is possible. Other providers already have massive infrastructure able to withstand cyber-attacks [006]. This kind of infrastructure will be absent in the event of developing an in-house platform, making this option less reliable.

10

Criteria In-house VPS platform Outsource VPS platform

O2 Security + +/-

O6 Reliability - +

Cost -- ++

Vendor lock-in + +/-

Result +/- + Table 2. Developing an in-house VPS platform versus outsourcing a VPS platform.

The final decision is to outsource the VPS platform. The cost aspect of developing an in-house VPS platform makes this option undesirable. Two aspects remain important when choosing a vendor: O2 Security and vendor lock-in. The client suggested AWS since they have a positive experience with this platform. For the sake of an informed decision, a similar vendor will be compared to AWS, namely Rackspace Cloud (RC). RC is examined because AWS and RC are the current market leaders for scalable VPS hosting.

The first difference is found in the vendor lock-in aspect. AWS does not offer VPS export functionality [003]. But AWS does offer enhanced export functionality for data that sits on AWS [004]. One can send them a physical storage device on which they will store the data and send it back. The system will potentially store terabytes of data, making this an attractive way of exporting data when necessary. Rackspace does not offer any export functionality at all [005]. In the case of RC, a switch of vendor would require a reinstall of all servers and a significant data traffic bill.

Looking at O6 Reliable, there are quite a few notable details. Both AWS and RC offer a load balancing service. It is unclear if RC‟s service offers load balancing over multiple datacentres and at what price. AWS does offer load balancing over multiple datacentres out-of-the-box, but they must be located in the same region [008]. This limitation can be overcome by using a DNS service which is aware of the user‟s geographical location (which sends a visitor to the nearest data centre). AWS also offers the choice in which regions you will place your servers, currently it offers hosting in five regions: US East (Northern Virginia), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), and Asia Pacific (Tokyo). Within each region there are multiple availability zones, which are isolated from each other. The Amazon EC2 FAQ says the following about availability zones [007]:

“Q: How isolated are Availability Zones from one another?

11

Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.”

RC has data centres divided over five regions: US (Texas), US (Virginia), US (Illinois), EU (United Kingdom) and Asia (Hong Kong) [009]. The Asia region does not have a redundant data centre available. AWS regions all have redundant data centres, meaning that it is possible to have extra protection against potential downtime in a region (when correctly configured). AWS also has better dispersion of its data centres in the US which is good for general responsiveness. By using a geographical location aware DNS service, the latency can be brought down significantly for visitors.

AWS‟s load balancing plays together nicely with a feature called Auto Scaling [010]. Amazon can monitor the load of a VPS. Once the load exceeds a certain threshold for a specified amount of time, it automatically starts extra VPSs and balances the load. This also works the other way around, when the loads decreases, the extra VPSs are shut down again. This is on a pay-what-you-use policy: you only pay for the actual resources consumed. This is the most interesting property of AWS. It allows for very high spikes in the load, at the fraction of the cost required otherwise. RC does not offer Auto Scaling.

Another interesting economical aspect is that data transfer within the same data centre is free for both providers. However, no information for RC can be found about how VPSs are placed in different data centres. With AWS, you can choose the locations yourself, or let them automatically be determined. AWS offers a discounted data traffic price for systems within the same region. Normal data traffic costs are $ 0.08 in / $ 0.18 out per GB for RC and $ 0.10 in / $ 0.15 out per GB for AWS. It is expected that the outgoing traffic will outweigh the incoming traffic, making AWS more attractive cost-wise. Outgoing data traffic costs become even more attractive for AWS as the traffic grows. The more outgoing traffic, the less cost per GB (lowest possible cost is $ 0.08 per GB).

Rackspace has taken measures to ensure the security of its hosting services [009]. On the physical level, they require anyone going into the data centres to authenticate themselves using key cards and biometrics. People who successfully authorize themselves are allowed escorted access to the data centre. In addition, there is both internal and external 24/7 surveillance in the data centres. Furthermore, any RC data centre employee is background-checked thoroughly and multiple times before they‟re hired.

Amazon employs a number of the same physical security measures as RC: extensive employee background checks, video surveillance and intrusion detection systems. The company also requires authorized personnel to pass 2-factor authentication a minimum of 2 times in order to get to the data centre floors. In addition, all non-personnel wishing to access one of the facilities

12 must present both a valid business need and identification, and is only allowed escorted access to the facilities. The data centre locations are generally kept on a need-to-know basis.

AWS offers a whole set of services besides the VPS service called Amazon Elastic Cloud Compute (Amazon EC2). These services are designed to integrate nicely with Amazon EC2 and are all on a pay-for-what-you-use basis. They allow for a high degree of modular expandability. The services that would be of interest to the design of the system are: ● Amazon Elastic MapReduce. Amazon offers out-of-the-box Hadoop MapReduce functionality. It is possible to run a MapReduce job on a cluster of servers, with control over how many servers run the job. There is no need to configure the infrastructure, as Amazon takes care of this. This service is potentially of interest for the Algorithms platform. ● Amazon CloudFront. This is a content delivery service, which allows for low latency and high speed data transfer. This could be of use for the Web platform. ● Amazon SimpleDB. A key-value data store which requires no database administration (this part is done automatically by Amazon). It scales very well for large applications, making it a candidate for the Database platform. ● AWS Elastic Beanstalk. A distribution system for web applications, it can handle capacity provisioning, load balancing, auto-scaling, and application health monitoring automatically [016]. This may be of interest for the Web Platform. ● Amazon Route 53. A scalable DNS service, capable of load balancing requests to the data centres with the best network conditions for the incoming request. ● Amazon Simple Storage Service and Amazon Elastic Block Store. Both scalable persistent storage platforms. Possibly of use to the Database and Web platforms.

Rackspace also offers an extra service called Cloud Files. This service offers scalable persistent storage and worldwide content delivery. Cloud Files also works on a pay-for-what- you-use basis. This is the only extra service offered by Rackspace.

13

Criteria Amazon Web Services (AWS) Rackspace Cloud (RC)

O2 Security ++ +

O6 Reliable + -

O7 Responsive + +/-

O8 Scalable ++ +/-

O10 Expandable + +/-

Costs ++ -

Vendor lock-in +/- -

Result + +/- Table 3. Comparing two candidates for outsourcing the VPS platform.

AWS outweighs RC on every aspect; hence AWS will be the Hardware platform on which the Web and Database platform will be hosted.

14

Web platform

The Web platform is the data interface to the outside world. The Website and Facebook Application will be hosted on this platform. The Web platform utilises Amazon Web Services (AWS) as the underlying hardware. There are two options in the AWS toolbox that may fit the determined criteria: Easy Cloud Compute (EC2) combined with Auto Scaling and Load Balancing, or AWS Elastic Beanstalk (AEB).

EC2 is an architecture which runs VPSs. It is designed to be as easy to operate as possible. VPSs can be created, rebooted and stopped in an instant. Furthermore, you only pay for the resources you use. The client already has positive experience with this platform. Auto Scaling is a service for EC2 which automatically adds more VPSs as the load increases. When the load is low, VPSs may be automatically shutdown to save resources. Another service, called Load Balancing, distributes the load equally over the running VPSs.

AEB is specifically designed to host Java web applications. It automatically handles the amount of resources required to serve all requests to the application, as well as load balancing. AEB consists of the same products as the previous option: EC2, Auto Scaling and Load Balancing. The difference is that AEB can manage these services automatically. At the same time, it allows you to take back the control over these services as you please (even access to the underlying VPS is granted).

AEB and EC2 utilise the same AWS services, which means that they should perform equally well with regards to the O6 Responsive, O7 Reliable, and O8 Scalable criteria. AEB is still officially in beta, casting some doubt about its reliability for production use. Both options support encrypted communication (SSL).

AEB has one big advantage over EC2, it is much easier to use. It offers Eclipse integration by means of the AWS Toolkit plugin. This Eclipse plugin allows for easy application version updates and rollbacks on live environments, application health monitoring, easy application portability and easy application management. Another aspect of AEB is the imposed usage of Java and Apache Tomcat as the hosting platform. This can be seen as a disadvantage because of the occurring vendor lock-in. It also puts restrictions on the possibilities for O10 Expandable.

When opting for EC2, the VPS has to be configured manually. This gives the most flexibility and has less dependency on third parties, but it has higher costs in the form of configuring and maintaining the VPS and all software on it. AEB initially has a higher cost than EC2, because it requires running a load balancer while this might not be necessary from the start. However, Amazon offers a free usage tier for load balancing, which smoothens out this difference in the first year [012]. Another consideration is the hosting of the Database platform. Initially this could be done on the same VPS, but not when opting for AEB since it is not aware of any database

15 system [013]. This means extra costs for the Database platform. Though, once the platform grows large, the Database and Web platform will live on separate hardware anyhow.

Criteria AWS Elastic Beanstalk Elastic Cloud Compute (EC2) (AEB)

O2 Security +/- +/-

O6 Responsive +/- +/-

O7 Reliable - +/-

O8 Scalable +/- +/-

O10 Expandable - +/-

Costs - -

Vendor lock-in - +/-

Ease of use ++ +/-

Result +/- +/- Table 4. Comparing AWS Elastic Beanstalk to Elastic Cloud Compute.

AEB and EC2 are evenly matched. The decision boils down to a trade-off between costs, maintainability and vendor lock-in. The time that could be saved by choosing AEB, could be spent on implementing extra features. In return, the Web Application and Facebook Application will be bounded to Apache Tomcat and Java. Java is one of the potential programming languages elected in a former study, so this is not a major setback [001]. Choosing AEB could be more expensive in the beginning, but the client is willing to spent more money in return for more features in an earlier stage. At this point AEB is the most attractive solution, a final decision will be made later on.

16

Database platform

The Database platform consists of the technology used to store and look up information in a way that complies with the earlier mentioned constraints. The database will potentially store terabytes of data and searching through this data is the core business of Simlike. It is vital that this can happen in a fast, scalable and reliable manner.

Several questions arise when choosing such a storage platform. What file system can be used? How is data consistency achieved across different machines? In which location(s) is the data stored? How about redundancy of data? How is low latency achieved? How tolerant is the solution to failures? Does it scale well? How easy is it to maintain the platform?

The first part of the solution can be found in distributed database management systems (DDBMS). These DDBMS are designed to scale well over multiple machines. Another advantage is that every machine can access all the data through a single interface. This simplifies the maintenance of the systems and also simplifies the interface to the underlying complex technology.

In a small study performed in the beginning of the development, these candidates were selected for a storage platform: Amazon SimpleDB, Apache Cassandra, Apache HBase and Google Datastore [001]. These are all non-relational databases. This choice was made because of the huge amounts of data that have to be stored eventually. Relational databases do not scale well at these proportions.

Google Datastore can already be ruled out. This technology only works together with Google App Engine which was already ruled out in the Hardware platform. Amazon SimpleDB is an interesting option because it requires no database administration at all. Apache Cassandra and Apache HBase are very interesting because they are specifically designed for huge amounts of data. They are both a part of a larger collection of open-source software for reliable, scalable, distributed computing. Hence they could form a solid foundation for the Algorithms platform as well.

A graph database is another storage technology that came up during in depth research. This kind is particularly suited to store data structures like a “social graph”. It provides fast access times for graph traversal. A possible candidate is Neo4j, an open source graph database written in Java.

Neo4j offers a graph database under the GPL license. This means it can be used by a web application without any licensing problems. When the user base expands and data and traffic increase, the Neo4j database can be scaled using High Availability. This, however, is licensed under AGPL, forcing the product to become open source. For 2000 USD a month, a commercial license with 24/7 support by phone can be bought, avoiding the „forced open source‟ problem.

17

The four alternatives (SimpleDB, Cassandra, HBase and Neo4j) will now be discussed, based on the constraints for our application. ● Security - Amazon takes care of the security when SimpleDB is used and for HBase a „secure version‟ is available with support for security. Both Cassandra and Neo4j have no documentation on their security, so the user will have to set it up on his own. This is obviously a big pro for SimpleDB and HBase. ● Activity monitoring - All alternatives have a way of storing information about a user, so activity monitoring is possible on all platforms, but no special support is available. ● Responsive - Both SimpleDB and Cassandra index the data, making it easier to find the data and are therefore more responsive. Neo4j is a graph database, meaning that it is implemented to make efficient (quick) graph traversal possible. Although HBase has its own file system, it does not index the data, making it possibly less responsive when large amounts of data are present. ● Reliable - All alternatives have the possibility to store data in several places, making reliability possible. ● Scalable - SimpleDB has a limit on how big it can get. Neo4j needs a commercial license to be scalable and HBase has a name node, which can become a problem for scalability. Overall Cassandra seems to be the best choice for scalability. ● Expandable - Both Cassandra and HBase have a lot of other projects that can be used to expand them (e.g. MapReduce, Apache Mahout). Neo4j and SimpleDB do not have this support and, therefore, aren‟t as easy to expand. ● Costs - Neo4j and SimpleDB are not free. SimpleDB will cost an extra server and Neo4j needs a commercial license for our application. HBase and Cassandra are free, but there are easy-install images to reduce the costs of installation and configuration of Cassandra instances on the Amazon Web Services. Therefore Cassandra is the best choice as far as costs are concerned. ● Vendor lock-in - To make Neo4j scalable a commercial license from Neotechnology is required, creating vendor lock-in. For Amazon SimpleDB, the Amazon Web Services need to be used. Both Cassandra and HBase are open source and can be installed on others servers as well.

18

Criteria Amazon Apache Cassandra Apache HBase Neo4j SimpleDB

O2 Security + - +/- -

O4 Activity +/- +/- +/- +/- Monitoring

O6 Responsive + + +/- +

O7 Reliable + + + +

O8 Scalable +/- ++ + +/-

O10 Expandable +/- + + +/-

Costs +/- ++ + +/-

Vendor lock-in - + + -

Result +/- + + +/- Table 5. Comparison of database platforms.

When comparing the three „standard‟ NOSQL databases (SimpleDB, Cassandra and HBase), SimpleDB is less scalable and doesn‟t have real-time MapReduce support. Cassandra indexes data, which makes it easier to find data in a large data set. Therefore Cassandra is the most suited „standard‟ NOSQL database for our situation.

The advantage of a graph-based NOSQL database like Neo4j over a database like Cassandra is its fast graph traversal. However, it is harder to set up on Amazon Web Services and it is expensive to expand and make it reliable. Also, the algorithm that was developed so far depends on parallel execution to reach real-time speeds when the data set is large. Neo4j does not offer any parallel programming support like MapReduce.

The prime candidate is Cassandra. Cassandra has support for Hadoop MapReduce. It is interesting to combine these two technologies. Apache HBase has the risk of a Single Point of Failure (SPOF): the Hadoop Name Node. If this node would fail, the entire Hadoop system is unavailable. But by swapping Apache HBase for Apache Cassandra, this problem is overcome. Furthermore, Cassandra has much better real-time performance than HBase, giving it another plus.

19

Algorithm platform

DECLARED CLASSIFIED BY NERVAL LIMITED

20

DECLARED CLASSIFIED BY NERVAL LIMITED

21

Conclusion

For the hardware platform a VPS solution will be used for the web platform and database platform. The VPS solution will be outsourced for economic reasons. The Easy Cloud Compute service from Amazon Web Services (AWS) is the best choice for hosting the VPSs.

The web platform will run on top of AWS Elastic Beanstalk, a scalable and easy to maintain solution. As a result of this choice, Java will be used as server-side web development language.

Apache Cassandra will be used as data store for the database platform. It was chosen for its performance and completely distributed nature, it has no single point of failure.

DECLARED CLASSIFIED BY NERVAL LIMITED

22

References

[PVA] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Plan van Aanpak”, 26 april 2011. [REQ] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Requirements analysis”, 25 april 2011. [001] Albeda, Dijkhuizen, Ezechiëls, Lanting, “Oriëntatieverslag”, 26 april 2011. [002] Nick Johnson, “Announcing BDBDatastore, a replacement datastore for App Engine”, http://blog.notdot.net/2009/04/Announcing-BDBDatastore-a-replacement-datastore-for-App- Engine, accessed May 3, 2011. [003] Cloudave, “AWS Planning To Add VM Export?”, http://www.cloudave.com/8986/aws- planning-to-add-vm-export/, accessed May 4, 2011. [004] Amazon, “AWS Import/Export”, http://aws.amazon.com/importexport/, accessed May 4, 2011. [005] Rackspace, “Cloud Servers and Cloud Computing FAQ from Rackspace Cloud Hosting”, http://www.rackspace.com/cloud/cloud_hosting_products/servers/faq/, accessed May 4, 2011. [006] Lori M. Kaufman, "Data Security in the World of Cloud Computing," IEEE Security and Privacy, vol. 7, no. 4, pp. 61-64, July/Aug. 2009, doi:10.1109/MSP.2009.87 [007] Amazon Web Services, “Amazon EC2 FAQs”, http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availability_Zones_from_one_another, accessed on May 4, 2011. [008] Amazon Web Services, “Elastic Load Balancing”, http://aws.amazon.com/elasticloadbalancing/, accessed on May 4, 2011. [009] Rackspace, “Rackspace Hosting has Worldwide Data Center Network to serve you”, http://www.rackspace.com/whyrackspace/network/datacenters.php, accessed on May 4, 2011. [010] Amazon Web Services, “Auto Scaling”, http://aws.amazon.com/autoscaling/, accessed on May 4, 2011. [011] Nick Reese, “VPS vs Dedicated Servers: Making an Informed Decision”, http://www.artofblog.com/vps-vs-dedicated-servers/, accessed on May 9, 2011, Art of Blog. [012] Amazon, “AWS Free Usage Tier”, http://aws.amazon.com/free/, accessed on May 9, 2011. [013] Athir Nuaimi, “What Amazon‟s Elastic Beanstalk Can & Can‟t Do”, http://www.rndguy.ca/2011/01/19/what-amazons-elastic-beanstalk-can-cant-do/, accessed on May 9, 2011. [014] Hadoop, “Welcome to Hive!”, http://hive.apache.org/, accessed on May 10, 2011. [015] DataStax, “Installing the Brisk AMI on Amazon EC2”, http://www.datastax.com/docs/0.8/brisk/install_brisk_ami, accessed on May 10, 2011. [016] Amazon, “AWS Elastic Beanstalk”, http://aws.amazon.com/elasticbeanstalk/, accessed on May 9, 2011.

23