<<

A Collaborative Internet Archive for Personal and Social Use

Ensuring File Availability and User Friendliness Through a Peer-to-peer Internet Archiving System

Tonje Røyeng

Thesis submitted for the degree of Master in Programming and System Architecture 60 credits

Department of Informatics Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Spring 2020

A Collaborative Internet Archive for Personal and Social Use

Ensuring File Availability and User Friendliness Through a Peer-to-peer Internet Archiving System

Tonje Røyeng © 2020 Tonje Røyeng

A Collaborative Internet Archive for Personal and Social Use http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo Abstract

The nature of the internet is ephemeral. Hyperlinks break, technologies are rendered obsolete, information is changed, deleted, lost and we are left unable to access the same information we were only a little while ago. One solution to this is to archive permanent copies of websites, which can be done for their cultural and historical importance, or personal use. This archiving effort can ensure that the content does not change or disappear over time, and adds a persistence of data that the Internet itself lacks. The goal of this project is to implement a prototype for a peer-to-peer system for personal archiving and of Internet content that is both reliable and user friendly. This thesis explores the theoretical aspects of peer-to-peer and Internet archiving systems, and gives an overview of the systems that already exist. This examination acts as the foundation of our system design, where we seek to combine the good qualities of existing Internet archiving systems with the robustness of a peer-to-peer system. The result is a peer-to-peer Internet archiving system that uses a to structure the network, and an application that allows the user to interact with the system through a graphical user interface. The experimental results show that the system performs as expected. It is possible to archive and view sites through a graphical user interface, and the archived files through their file ID. The files are duplicated across a network of peers, and they are also republished every hour to maintain the number of copies. The system is also acceptably performant, and user friendly, following research on attention spans and design principles. The result is a system that combines the qualities of peer-to-peer and Internet archiving systems to create a collaborative Internet archiving system for personal and social use.

i ii Acknowledgements

First of all, I would like to thank my supervisors, Prof. Eric Jul and Oleks Shturmov for their guidance throughout this project, their constructive criticism and for letting me shape my own project. Their experience and help along the way has been invaluable. Secondly, I want to thank my friends, both the ones I have met during my studies, and anyone else who has provided me with much needed distractions and support along the way. A special thanks to everyone who was willing to participate in my user test. Finally, I want to acknowledge the support of my family, in particular my mum, for always listening to me and encouraging me, and my partner, who has been a shoulder to lean on, a helpful discussion partner, and a very patient household member in these last few months.

iii iv Contents

I Preliminaries 1

1 Introduction 2 1.1 Research Question ...... 2 1.2 Goal ...... 2 1.3 Approach ...... 3 1.4 Design and Implementation ...... 4 1.5 Evaluation ...... 5 1.5.1 Results ...... 5 1.6 Conclusion ...... 7 1.7 Work Done ...... 8 1.8 Limitations ...... 8 1.9 Outline ...... 9

2 Background 11 2.1 Introduction to Background ...... 11 2.2 Peer-to-peer Systems ...... 12 2.2.1 Decentralised and Distributed ...... 12 2.2.2 Characteristics ...... 13 2.2.3 Distributed Hash Table (DHT) ...... 14 2.2.4 Taxonomy ...... 15 2.2.5 Summary of Peer-to-peer Systems ...... 31 2.3 Internet Archiving ...... 31 2.3.1 Saving the Internet ...... 31 2.3.2 Issues in Internet Archiving ...... 32 2.3.3 Taxonomy ...... 35 2.3.4 Summary of Internet Archiving Systems ...... 46 2.4 Sharing in a Cultural Context ...... 46 2.4.1 Pirating ...... 46 2.4.2 Social Media ...... 47 2.4.3 Summary of Sharing in a Cultural Context ...... 47 2.5 Summary of Background ...... 47

II Project 49

3 Analysis 50 3.1 Problem ...... 50

v 3.1.1 Core Issues ...... 50 3.1.2 Summary of Problem ...... 53 3.2 Solution ...... 53 3.2.1 Required Functionality ...... 53 3.2.2 Evaluation ...... 54 3.2.3 Summary of Solution ...... 55 3.3 Conclusion of Analysis ...... 56

4 Design 57 4.1 Core functionality ...... 57 4.1.1 Peer Communication ...... 57 4.1.2 Archiving Sites ...... 58 4.1.3 Fetching Files From DHT ...... 58 4.1.4 Sharing ...... 58 4.2 Structured Peer-to-peer System ...... 58 4.2.1 ...... 59 4.2.2 Summary of Structure ...... 59 4.3 Files ...... 60 4.3.1 Local saving ...... 60 4.3.2 File Duplication ...... 60 4.3.3 File Location ...... 61 4.3.4 Summary of Files ...... 61 4.4 Graphical User Interface ...... 61 4.4.1 Clarity ...... 62 4.4.2 Consistency ...... 63 4.4.3 Simplicity ...... 64 4.4.4 Summary of Graphical User Interface ...... 64 4.5 Trade-offs ...... 65 4.6 Summary of Design ...... 65

5 Implementation 67 5.1 Two-tier Architecture ...... 67 5.2 Running the Application ...... 68 5.2.1 Application ...... 68 5.2.2 CLI ...... 69 5.3 Back-end ...... 70 5.3.1 File Structure ...... 70 5.3.2 Peer Handling ...... 71 5.3.3 File Handling ...... 73 5.3.4 Summary of Back-end ...... 75 5.4 Front-end ...... 75 5.4.1 File Structure ...... 75 5.4.2 Functionality ...... 76 5.4.3 Visual Design ...... 79 5.4.4 Summary of Front-end ...... 82 5.5 Summary of Implementation ...... 83

vi 6 Evaluation 84 6.1 Implementation Tests ...... 84 6.1.1 File Structure Overview ...... 84 6.1.2 Functionality ...... 85 6.1.3 Reliability ...... 86 6.1.4 Performance ...... 90 6.1.5 Summary of Implementation Tests ...... 93 6.2 Graphical User Interface Evaluation ...... 94 6.2.1 Survey ...... 94 6.2.2 Analysis ...... 96 6.2.3 Conclusion of Graphical User Interface Evaluation . 99 6.3 Conclusion of Evaluation ...... 99

III Conclusion and future work 102

7 Conclusion 103 7.1 Summary ...... 103 7.1.1 Results ...... 104 7.2 Limitations ...... 105 7.3 Perspective ...... 105 7.4 Future Work ...... 106

Appendices 113

A Output From Functionality Test 114

B Questionnaire 118

vii List of Tables

2.1 Comparison of peer-to-peer systems ...... 30 2.2 Comparison of Internet archiving systems ...... 45

6.1 Time it takes to archive file ...... 92 6.2 Time it takes to store file in DHT ...... 92 6.3 Time it takes to fetch file ...... 93

viii List of Figures

2.1 Difference between centralised and decentralised systems . 12 2.2 Launch screen of uTorrent Web ...... 24 2.3 Download screen of uTorrent Web ...... 24 2.4 -gtk GUI ...... 25 2.5 IPFS status menu ...... 26 2.6 IPFS desktop GUI ...... 27 2.7 GUI ...... 28 2.8 ArchiveBox file overview ...... 39 2.9 Webrecorder landing page ...... 40 2.10 Webrecorder collection page ...... 41 2.11 Webrecorder manage collection page ...... 42 2.12 Pocket landing page ...... 42 2.13 Pocket action menu for article ...... 43 2.14 Article saved to Pocket, viewed in web app ...... 43 2.15 Internet Archive landing page ...... 44 2.16 Wayback Machine landing page ...... 45

4.1 Example system state ...... 58 4.2 Example node state, using marked node from figure 4.1 . . . 59 4.3 First wireframe of system design ...... 62 4.4 Colours used ...... 64

5.1 Structure of back-end scripts ...... 72 5.2 Program flow when saving a new site to archive ...... 74 5.3 Front-end application home page ...... 77 5.4 Front-end application display page ...... 78 5.5 File menu ...... 78 5.6 Share pop-up ...... 79 5.7 Share pop-up, copied ID ...... 79 5.8 Delete pop-up ...... 80 5.9 Loading icon ...... 80 5.10 Site archived icon ...... 81 5.11 Invalid URL error message ...... 81 5.12 Site not found in network ...... 82 5.13 Closeup of delete pop-up ...... 82

6.1 Static reliability test results ...... 89 6.2 Dynamic reliability test results ...... 90

ix 6.3 Example of screenshots in questionnaire ...... 95 6.4 Examples of share and delete icons used in other systems . . 97

x Part I

Preliminaries

1 Chapter 1

Introduction

This thesis seeks to address the lack of persistence of data on the Internet, like the many Internet archiving systems before it. It aims to do this through the use of peer-to-peer technology to add a layer of reliability that -server solutions lack, while still being user friendly and intended for personal use. There are many good solutions for Internet archiving out there that are user friendly, personal and has a sharing function. However, all the systems we have examined that meet all these criteria are dependent on central servers to archive content. Our goal is to make it possible for users to archive websites with a high probability that the files will not be changed or deleted over time, as they are prone to on the Internet. To ensure that the files stay available, and are not dependent on one server, we have chosen to create a peer-to-peer file archiving system. This way, the files will be replicated across many machines, which allows the users to access them even when a number of the machines are offline. The decentralised and distributed nature of the system is at the core of this project and is the main difference between our system and other Internet archiving systems.

1.1 Research Question

How can we create a collaborative peer-to-peer Internet archiving system for personal use that is reliable, user friendly and performant?

1.2 Goal

The goal of this thesis is to design and implement a prototype for a peer-to- peer Internet archiving system intended for both personal and social use. The background for this system comes from multiple different areas, but it is primarily rooted in the robust and collaborative character of peer-to-peer systems and the ephemeral nature of the Internet. The former dictates the structure of the system, and the latter embeds the system in the tradition of Internet archiving, an endeavour of cultural and individual importance. As the title of this thesis suggests, the system is intended for both personal and

2 social use. For this system, this means that the archival of Internet sites is personal, but that the user has the option to share their archived sites with other users of the system. This adds a social aspect to the user experience, which reminds the users that the system is collaborative. In terms of functionality, the prototype should allow the user to archive websites and fetch the archived files from the network of peers. The system should be both reliable and user friendly, meaning that files should be available at all times and that the system should be easy to use for anyone who has some experience using computers. Traditional client- server architecture suffers from relying on a single server to keep track of any files that are needed, which becomes a point of vulnerability. Peer- to-peer systems seek to combat this through utilising every participating node as a server without relying on one single node for the system to survive. File availability is achieved through the duplication of files across several nodes, which makes sure that it is possible to fetch a given file from multiple locations. Adding an overlaying structure such as a distributed hash table can make it possible to efficiently locate a copy of a file. User friendliness will be addressed in two ways; through the user interface and the system’s performance. The user interface should be based on established design guidelines to ensure that it is easy to use for as many users as possible, and the system should be performant as to keep the user’s attention as long as they use the system.

1.3 Approach

This project is a work of software engineering, and will primarily be conducted through an analysis of existing systems, the development of a software prototype and an experimental evaluation of the prototype. As such, the work is structured much like traditional software engineering projects, which are split into four main categories of work; analysis, design, code and testing. However, the result of the project is not a finished piece of software that is ready to be deployed, but rather a prototype that can be seen as the first iteration of a more iterative development process [1, pp. 36– 38]. For any software development process, it is important to understand the context of the system, because the resulting piece of software will not exist in a vacuum and may benefit from the years of knowledge already available in the field [1, p. 249]. Additionally, an understanding of the users and the context of the system may be an aid in the visual design process, as different users have different experiences and different views on what "user friendly" means [1, p. 406]. Therefore, this thesis starts by surveying which systems and solutions already exist by conducting two systematic comparisons, one of several peer-to-peer systems and one of Internet archiving systems. In both cases, a set of categories for comparison were chosen that are relevant to the project in one way or another. An analysis of these comparisons revealed some core aspects to focus on in our system, as we discovered what works well and what does not, both in

3 peer-to-peer and Internet archiving systems. This analysis, in turn, aided in the design of the system. Following the implementation of the system, experimental testing was conducted, in order to both ensure that the goals for the system was met, and to uncover any issues that would need to be addressed in a later version of the system. The automated tests primarily addressed three areas; system functionality, reliability and performance. Additionally, the user interface was examined through analysis and a small survey conducted with test users.

1.4 Design and Implementation

The system design is split into two main parts; the system structure and the graphical user interface. The network of peers is organised using the Kademlia [2, 3] distributed hash table. Using a distributed hash table allows the peers to communicate in a coordinated and efficient way, and Kademlia offers a fast and reliable algorithm for inter-peer communication that performs routing within O(log n) steps. The system is a file system where it is necessary to keep track of all files at any time, and where the user should always be able to access their files, which is made possible through the distributed hash table. Additionally, files will automatically be saved to the user’s device, so that they can retrieve the files fast, and without an internet connection. The user can also share their files, by sharing the file ID with another user. For the graphical user interface, we based our design on the findings from our analyses of existing systems. For the analyses, we chose four established design principles for user friendliness and examined each system with these in mind. These principles were clarity, consistency, simplicity, and responsiveness, chosen because they cover some of the most important aspects of what makes a graphical user interface user friendly. The findings were used to guide our design, to make sure that the design was user friendly from the start. The technical prototype was implemented in Node.js, using a two- tier architecture with a clear separation between the back-end and front- end. The back-end was implemented using the Node.js Express library to create an API for the front-end to communicate with. The back-end handles everything that has to do with the peer network, and the files, using libp2p, a JavaScript implementation of the Kademlia distributed hash table for file distribution and routing. The front-end is a React app that displays the archived files in the browser and allows the user to archive new sites, as well as share and delete existing archived sites. In the interest of not spending too much time and effort on the files themselves, any site is simply archived as a screenshot.

4 1.5 Evaluation

The system was evaluated according to three evaluation metrics;reliability, user friendliness and performance, which all address some of the core issues we have outlined:

Reliability Reliability is very important because the user should be able to access their files at any time. The decision to make a decentralised system where files are stored on multiple machines was in part made because of the vulnerability of centralised systems, and thus the reliability of decentralised systems. While it is impossible to guarantee file availability in any system, the duplication of files across multiple sources can ensure that the file is available even if several of the sources fail.

User friendliness Another aspect of the system that we consider import- ant is the user friendliness. This involves both the visual appearance of the system and its performance, the latter of which will be dis- cussed on its own. User friendliness is important to ensure that the users do not become frustrated or confused when using the system.

Performance Performance can be viewed as a subcategory of user friendli- ness because systems that take too long to respond can cause the user to get bored and therefore unwilling to use the system. Therefore, it is important to evaluate the system’s ability to respond to any user input by timing the main user actions.

The system was evaluated through automated tests and an examination of the user friendliness. The automated tests can be divided into three main categories; functionality, reliability and performance. The functionality test ensured that it is possible to perform the actions that have been described in the design of the system. Reliability was tested through two different tests that simulate a network of unreliable peers and continually check the file availability. Finally, the performance was tested by performing the main user actions multiple times and finding the average and median run times. The graphical user interface of the system was examined both by analysing the user interface according to the principles that will be outlined in the comparison of existing systems and by publishing a small survey to get real user input.

1.5.1 Results In terms of functionality, the system meets its goals. This means that the user can archive a site, which is then added to a network of peers through a distributed hash table. The front-end application allows the user to archive, view and delete sites, as well as access a share ID that can be shared with other users, allowing them to fetch a copy of the same file. The functionality is tested by calling the back-end functions that are used by the application, and tests both the functionality of the distributed hash

5 table and the file archiving. The results show that this works as intended, except for occasional errors that occurred particularly during the more computationally heavy tests. Reliability was tested by simulating a small network of peers using Docker containers, which allowed us to disconnect and connect peers on the fly. The intention of this was to simulate a real network state with unreliable peers, without having to distribute the peers across different machines. We ran two separate, yet similar reliability tests. Both tests created ten peers, had one peer archive multiple sites and then periodically check how many copies were available for the various archived sites. The difference between the tests was the actual network simulation. One disconnected and reconnected peers to the network, and the other test stopped and created new peers. The results show that the system is reliable to an extent. As we decided to use a pre-existing implementation for the distributed hash table, it was difficult to determine whether values were republished to new peers. The tests ran for ten hours, to give the system ample time to republish values, which it should do each hour, according to the Kademlia algorithm. Using libp2p alone, this seemed to not be the case, so we implemented a rudimentary republishing function that runs every hour, but leaves the republishing up to the peer that initially archived the file. While this is not an ideal solution for a final system, results show that it works as intended, rendering the system reliable according to our slightly limited tests. The performance was tested primarily through the two main user actions; archiving a site and retrieving it from the network. To ensure that the tests reflected a more realistic system state, these tests were run on two machines connected through the system. Additionally, the archiving of a site was separated into two different actions to test the file handling and peer handling separately. The results from the tests show that archiving a site takes on average about 4 seconds and that storing and retrieving it from the network takes less than 1 second. Archiving a site is the most time-consuming user action, and while this means that the user might lose their immediate focus on the task, it does not take so long that they grow bored and abandon it entirely [4, p. 135]. Because of this slightly longer response time, the graphical user interface includes a loading animation, to indicate to the user that the system is working. Storing and fetching a file from the DHT is fast, and well below the 1 second limit for keeping the user’s attention. The system, in terms of performance, is therefore considered to be user friendly. In a more developed version of the system, it would be necessary to examine whether the tools currently used to archive files are the best fit. They are well suited to capture screenshots of websites, but as a later version of the system would aim to archive more information from a site, it would be necessary to archive the files in some other way. This would, in turn, call for a reexamination of the performance of the system. As it stands now, the performance is satisfactory. To ensure that the graphical user interface is user friendly, we analysed existing peer-to-peer and Internet archiving systems with regards to

6 established design principles, and based our system design on the findings from this. Also, we conducted a short analysis of the final design according to the same principles. Given the time constraints and scope of the project, it was decided to not conduct extensive user testing, as this would take a lot of time and effort, and ultimately be difficult to organise given the social distancing rules, as this part of the project was to be conducted during the Coronavirus pandemic. As a result of this, we only got input from a small number of test users through a survey, which nonetheless proved valuable, and gave the analysis a stronger foundation. The conclusion is that the graphical user interface is user friendly, as it only has a limited set of possible user actions, and these are easy to understand and based on established norms and principles for web design.

1.6 Conclusion

The goal of this thesis was to design and implement a prototype for a peer- to-peer Internet archiving system, rooted in the robust and collaborative nature of peer-to-peer systems. Our system is primarily designed for personal use, as intended, but it also offers the possibility to share file IDs with other users, allowing them to fetch a copy of the file from the system. As a peer-to-peer system is dependent on a certain level of collaboration between the users to function, this adds a social aspect to the system that reminds the users that they, as individual users, benefit from contributing to the whole. A thorough examination of existing peer-to-peer and Internet archiving systems revealed that many systems meet some, but not all, of our criteria. This examination was therefore used to guide the design of our system, drawing inspiration from the solutions that are available already. Our three main evaluation criteria were reliability, performance and user friendliness. One of the driving factors for this project was the unreliable nature of client-server structures and therefore of the Internet itself, which is also the motivation behind most Internet archiving systems. Reliability, therefore, means that a file should remain available in the system indefinitely, and is one of the main arguments for choosing a peer- to-peer structure for the system. Creating a peer-to-peer system, and using a distributed hash table for efficient and organised communication between peers makes it possible to replicate files across a network of machines, eliminating the single point of vulnerability that is present in client-server structures. Our test results show that this duplication of files ensures that a file remains available even if multiple machines go down. Performance and user friendliness go hand in hand, as they both make sure that the system is easy and enjoyable to use. Having good performance means that the user actions do not take too long to perform, which in turn keeps the user’s attention on the system without boring them. Besides, designing the graphical user interface in such a way that it is easy to use makes the system accessible to a larger number of users, and ensures that the experience of using the system does not cause frustration in any way. The test results show that the system is performant to a

7 satisfactory degree and that the graphical user interface is user friendly, taking into consideration the constraints of the project. The main contribution of this project is the way it combines the robust and collaborative nature of peer-to-peer systems with the user friendliness and usability of personal Internet archiving systems. While the reliability of the final prototype was not as good as the project set out to achieve, the project still illustrates the advantages of using a peer-to-peer system with a distributed hash table in the creation of a decentralised and distributed Internet archiving system.

1.7 Work Done

One thing that was initially explored in the background section, and researched for the design and implementation of the system was the nature of the archived content. This specifically entails whether or not the copy of the website should look and function in the same way as the original, or if it should merely capture content like text, images and videos without caring about the structure of the original website. In particular, this involved research into WARC files, a file format specifically designed to archive websites, and how to handle these. It would have taken a lot of time and effort to use these more complex files in the final prototype. Therefore, it was ultimately decided that the appearance of the files was not directly tied to any of the goals for the prototype, and this research was all but removed from the final thesis.

1.8 Limitations

The system is a prototype, and as such does not have all the functionality that a final version of the system should have. As mentioned in the previous section, one aspect that was greatly compromised was the archived files. All the examined Internet archiving systems fetch either the whole site or all the text and images that are on the site to be archived. After examining which libraries were available for Node.js, it was decided to use screenshots instead of actual text, images and styling. This was easier to implement, and as it did not directly address any of the core issues we had identified, we decided it was enough for the initial prototype. Smaller compromises were also made, such as the decision to not set up a proper database, and not allowing the user to decide whether they want a local copy of a given file. As mentioned earlier, using a pre-existing distributed hash table implementation has had both advantages and disadvantages. On one hand, this has saved us a lot of time and effort, but it has also left one of our most prominent requirements, namely reliability, out of our control. The result of this is that it was both difficult to test the reliability, and to draw any definite conclusions as to whether this criterion was met to a satisfactory degree. Another limitation that ties into this was the inability to test with a large scale network of peers. A lack of time and access to a large

8 number of remote machines meant that none of the tests were conducted in a realistic environment, which makes it difficult to say for certain whether the evaluation criteria were met. The evaluation of the graphical user interface also ended up being very limited. It was never intended to be extensive, as the user friendliness was only one of several evaluation criteria. However, as explained in section 1.5.1, it ended up both limited, and somewhat rushed. As such, the examination of the user friendliness might be biased, both because it is difficult to be impartial in an analysis of something of one’s own creation, and because the test users were few and homogeneous. This will, of course, be examined in more detail in the evaluation itself. Another limitation to the user friendliness is the way that the system is run by the user. The prototype requires knowledge of how to use a terminal, and packet managers, which means that as it stands right now, the installation and setup process is not user friendly, and the discussion of user friendliness is therefore limited to the graphical user interface itself.

1.9 Outline

The thesis is divided into three overarching parts, with one or several chapters each. The following list gives a short overview of the various parts and chapters:

Part I: Preliminaries

Chapter 2: Background examines both peer-to-peer and Internet archiv- ing systems in detail, and provides a systematic comparison of multiple specific systems within each category. It will also briefly address the cultural context of our project.

Part II: Project

Chapter 3: Analysis analyses the findings from the comparisons, and outlines the problem that the project is solving. It also suggests some core functionality and how to evaluate the system once it is completed. Chapter 4: Design presents the core functionality of the system, the technical system design and the graphical user interface. Ka- demlia, the distributed hash table algorithm used in the proto- type is described in detail, as well as other details regarding the files that will be saved in the system. Finally, the graphical user interface is outlined with a low-fidelity prototype, and a discus- sion of the design according to established design principles. Chapter 5: Implementation gives an overview of the technical imple- mentation of the system. First, the general structure is presen- ted, before a description of how to run the system is given. Then, the back-end and front-end are described in detail, including file structure, implementation choices and important components.

9 Chapter 6: Evaluation presents both the tests that were written to evaluate the system, as well as any results from these. The functionality, and by extension the evaluation criteria were all tested with automated tests, while the user friendliness was examined with a survey and an analysis of the graphical user interface according to established design principles.

Part III: Conclusion and future work

Chapter 7: Conclusion summarises the project, including the back- ground, design, implementation and evaluation, as well as the limitations. The chapter also reflects on the system in a broader perspective and outlines future work.

10 Chapter 2

Background

2.1 Introduction to Background

This chapter will be the foundation on which we build our system. The topics examined and discussed in this chapter are all important to be able to understand the background and motivation for the system proposed in this thesis. Throughout the chapter, the historical, technical and cultural background of the project is examined. Additionally, both peer-to-peer and Internet archiving systems are thoroughly examined and compared, which will lay the basis for our system design. The first section looks at peer-to-peer systems, first by examining the characteristics of such systems, before moving on to a comparison of five different systems. This comparison reveals the advantages and disadvantages of various types of peer-to-peer systems. It also highlights that a decentralised and distributed system is a good structure for a file system, particularly one that includes file sharing. Users in a peer-to-peer network can communicate and transfer files to each other, without having to go through a central server. Using a distributed hash table can ensure that this inter-peer communication is efficient and structured. Following this, we take a closer look at Internet archiving systems. Similarly to the previous chapter, this is done by looking at their history and characteristics and by comparing five different systems. Most existing Internet archiving systems are centralised and do not offer the advantages of peer-to-peer systems that we wish to utilise in our system. However, as the comparison shows, there are good solutions out there that are intended for personal use, have a sharing function and are user friendly, all of which are important aspects of our system. The cultural and historical importance of Internet archiving is discussed in the section on Internet archiving, as this an integral part of the very existence of such systems. However, there is another cultural aspect of our project that is not explored in any of the other sections, which will be explored in the final section, namely sharing. Therefore, the last section briefly examines sharing in a cultural context by looking at torrenting systems and social media.

11 2.2 Peer-to-peer Systems

As the project is centred around making a peer-to-peer system, this section will explore the characteristics and background of such systems, and give an overview of similar systems and related work that already exists. First, the section looks at the decentralised and distributed nature of peer-to-peer systems, before giving a brief overview of some of their characteristics and a more detailed description of distributed hash tables. Following this, five examples of peer-to-peer systems are discussed in a systematic comparison, where they are compared according to a number of categories that tie directly into this project.

2.2.1 Decentralised and Distributed

Figure 2.1: Difference between centralised and decentralised systems

Before introducing peer-to-peer systems in detail, this section will look at the decentralised and distributed nature of such systems, and how this separates them from other computer networks. Generally speaking, a is distributed by nature, because it runs on multiple machines, as opposed to systems that only execute on a single machine. However, there is an important distinction between decentralised and centralised computer networks that will be explored in this section. Distribution, as mentioned, is concerned with the location of the entities of the system, whereas centralisation and decentralisation is a matter of control. If the control of the system is located in one place, the system is centralised (see figure 2.1). A typical example of this is client-server systems, where one server works as the communication point for all the

12 clients. It is, however, possible for a centralised system to be distributed. An example of this is cloud services, who potentially utilise multiple machines to save the users’ files, while still maintaining control of the entire system in one place. Decentralised systems, on the other hand, have multiple points of control and do not rely on one central entity to maintain control of the entire system (see figure 2.1). This is especially useful to reduce the possibility of a system crash being caused by one machine failing (see 2.2.1). However, it can be difficult to create one cohesive decentralised system without adding too much communication overhead in order to maintain a synchronised system through a distributed consensus. It is, therefore, necessary to ensure that protocols and algorithms are established to ensure good communication and synchronisation. Centralisation then, is a matter of control, whereas distribution con- cerns the location of system entities. As mentioned, distributed systems can be both centralised and decentralised, and peer-to-peer systems are usually both distributed and decentralised, though there is at least one example of a centralised, but distributed peer-to-peer system, which we will come back to in section 2.2.4.

Single Point of Failure As mentioned, one of the main arguments for decentralised systems is that there is no single point of failure. A single point of failure is one machine or central server that all requests have to go through, making it essential to a system’s functionality. In traditional client-server systems, the server will be one of those points. Such a point can pose a major problem because if it fails, it will cripple the entire system. One solution to combat single points of failure is redundancy, meaning that the system contains a certain number of copies of whatever asset needs to be available to users. For a centralised system, this may mean that there are multiple servers to minimise the likelihood of a complete system failure. In a peer-to-peer system, this entails both the fact that the system is decentralised and that shared resources are replicated across the available peers. For a peer-to-peer file-sharing system, this means that the files need to be replicated across multiple computers to ensure access even if a number of them crash. This is necessary because the users’ own computers act as the nodes, and it is to be expected that they will go down with a relatively high frequency (see 2.2.4).

2.2.2 Characteristics A peer-to-peer (P2P) system is a decentralised network system where every peer (node) in the network is considered equal, and the goal is to share resources such as storage space and bandwidth without relying on a central server [5, pp. 424–425]. One of the biggest differences between peer- to-peer systems and client-server systems is that the peers communicate directly with each other in the peer-to-peer system, as opposed to having

13 one central point of control. This approach to network computing avoids certain bottlenecks, as discussed above, and utilises the computing power that exists on the users’ own machines. With these underlying ideas, several different types of P2P systems have emerged, with variations in structure and the level of centralisation. Structured P2P systems have a so-called Distributed Hash Table (described in 2.2.3) which links content and IP address. Unstructured P2P systems have no correlation between content and IP address which may result in large network loads whenever a peer searches for content, depending on the level of centralisation. P2P systems having varying degrees of centralisation may sound at odds with their definition, but it is a compromise that has been made in many systems to ensure that a search will yield results within a reasonable time frame. There are three overarching types of unstructured P2P systems, depending on their level of centralisation; centralised, pure and hybrid.

Centralised These P2P systems rely on a central server to look up information about the location of content, while the storage and transfer of content still occur between peers. This system structure is great for easy and fast lookup, but problematic because the server is a single point of failure, and needs a lot of storage space to account for growth [6, p. 38].

Pure Pure P2P systems do not have any type of centralisation, and all peers connect directly to a number of other peers. To connect to the system or locate a file, the network is simply flooded with messages, to ensure a high probability of getting to the right place eventually. This results in a lot of potentially long-distance messages, which may result in a slow network due to high bandwidth consumption [6, p. 49].

Hybrid These systems do not have one central server, but instead add an additional layer of peers. These peers act as servers for a limited number of nodes [6, p. 49]. This is done in an attempt to minimise the message load while avoiding having one single point of failure.

2.2.3 Distributed Hash Table (DHT) Unstructured P2P systems can suffer from either too much search traffic or vulnerable components. Researchers have attempted to solve these problems by designing systems that use distributed hash tables to look up content. A distributed hash table (DHT) is a hash table where the content is saved on a specific peer according to its address. The hash table itself matches the contents’ hash (key) to the actual content (value) and then saves the key, value pair on the appropriate peer. To be able to find out where to save and fetch content from, DHT systems utilise a routing algorithm to make consistent saving and lookup possible. There are various approaches to these algorithms, as can be seen for example in the Pastry [7] and Tapestry [8] algorithms. The Pastry system

14 seeks to utilise the technical aspects of P2P systems like decentralised control and self-organisation, and provides a content location and routing system through an overlay network of connected nodes, and will be examined closer in the next section (2.2.4). Tapestry is also an overlay network that provides a distributed hash table and routing mechanisms to create a self-organising and fault-tolerant P2P system. Both these systems avoid the large amount of traffic that flooding the network with messages causes, by using routing tables and passing messages through the system based on node IDs.

2.2.4 Taxonomy For P2P systems, we will examine five different file-sharing systems that are of either historical or cultural relevance to this project. We wanted the selection to be diverse to showcase the pros and cons of various types of P2P systems. The goal of this section is to give a brief overview of the systems and highlight some key aspects of them to provide grounds for a discussion and comparison. The findings from this section will then aid in the design of our own system. This section starts by giving a brief introduction to the systems, before moving on to the key aspects which will be examined one by one. We have chosen to look at the systems’ level of centralisation, whether they are structured or unstructured, how they approach file sharing and file availability, as well as user friendliness, and performance. Their level of centralisation and file availability addresses the benefits of creating a decentralised file system, file sharing examines the systems as communities for sharing, and user friendliness and performance both address the user friendliness. The structure of the system is a matter of implementation and is discussed in relation to this.

Systems Gnutella Gnutella is a large P2P network that used to be one of the most popular file-sharing services on the Internet, next to similar systems like FastTrack [9, p. 21]. The first iteration of the system had no overlaying structure and relied on a method called “flooding” to get messages to the right place [6, p. 43]. Whenever a peer requested a file, this request would be sent to every peer it was connected to. They would, in turn, forward the message to every peer they were connected to until the file was located. Later iterations of Gnutella added a new type of peers called super peers (named ultrapeers in Gnutella), which keep track of the peers that are connected to them, and are in charge of routing messages to the right place [6, p. 37]. To connect to the Gnutella network, the user has to download a Gnutella client. Before its discontinuation in 2010, the most popular Gnutella client was LimeWire, which allowed users to search for and download files [10].

15 Napster This system was one of the first to popularise the sharing of media content like music in a peer-to-peer manner [9, p. 19]. The company itself was bought out and is still active, and Napster is currently the name of a music streaming service, but the discussion in this thesis will revolve around the inoperative P2P file sharing software. The Napster system was made up of a central server and a number of peers. The server kept track of all files available in the system, and which peers had copies of the file, while the peers themselves could upload files and request files. Peers had to communicate with the server every time they uploaded or requested a file, but the server itself only contained information; file storing and file transfer was left up to the peers. To participate in the service, users had to download the Napster client, which provided a user interface to keep track of the user’s files and downloads.

PAST (Pastry) PAST is a P2P storage system based on the Pastry location and routing scheme [7]. Pastry is a DHT where each node is assigned a random and unique node identifier. Each node keeps track two node sets; a leaf set, and a neighbourhood set, as well as a routing table. The leaf set contains the nodes that are numerically closest to the given node by nodeId, and the neighbourhood set contains the nodes that are closest to the given node by a proximity metric decided by the system that uses Pastry. The routing table is used in routing messages and ensures that the message is always being brought closer to its destination with each routing step [7, pp. 331– 332]. When a file is inserted into the PAST system, a fileId is assigned to it and it is routed to the n nodes whose nodeIds are numerically closest to the 128 most significant bits of the fileId [11, p. 76]. This file replication, coupled the routing algorithm provided by Pastry, means that the PAST system can ensure file availability and scalability in a P2P file system.

BitTorrent is the largest P2P network in the world [9, p. 21, 12, 13, p. 205]. BitTorrent is a file-sharing P2P communication protocol, which splits files into little pieces and allows the user to download different pieces from different places, making it very effective in the transfer of large files, such as video files. To use BitTorrent, the user needs to download a client like uTorrent 1 or 2, which implements the BitTorrent protocol. The initial version of the protocol was dependent on so-called BitTorrent trackers to function. BitTorrent trackers are servers that keep track of where copies of a file are located and which peers are currently available [14, p. 368]. This information is relayed to any peer that requests a file. Information about trackers and file metadata can be found in torrent files, which have to be downloaded from a website that keeps an index of torrent files. To eliminate the need for trackers, several

1https://www.utorrent.com/ 2https://www.vuze.com/

16 BitTorrent trackers have also implemented a DHT, which allows for “trackerless” torrents. According to the BitTorrent protocol, a user will simultaneously download and upload a file, which will make it faster and easier for other users to access it. When the download is finished, the user can choose to continue uploading the file [14, p. 368].

IPFS The InterPlanetary File System (IPFS), first released in 2015, is a P2P file-sharing system with the goal of making the entire Internet a distributed P2P system [15]. The creators of IPFS argue that the current client-server structure of the Internet is vulnerable and that websites’ availability suffers from being dependent on whatever server they are located on. They suggest using content-addressing, as opposed to addressing by location. This means that the system will locate a website by its content rather than its physical location. It also makes it possible for the same content to be stored in multiple different locations; on request from the user, the system will locate the one nearest to them [16]. IPFS uses libp2p, a DHT implementation for message routing and content location [17] as a basis for its network. To use IPFS, a user has to download the desktop app. There is also a browser extension available to make use of the system easier.

Centralised While it could be said that a centralised system is not a P2P system at all, this distinction between centralised and decentralised is kept in the comparison to accommodate for Napster. Centralised, in this context, does not mean that the entire system is centralised, but that there is a central entity that the entire system depends on. Napster, therefore, is centralised and distributed, and is classified as a P2P system because the peers communicate directly with each other to request and transfer files. As mentioned, Napster was one of the first systems of its kind, and it is important to recognise this type of system in the emergence of P2P systems, both to highlight the problems that this system structure solves and to highlight the issues that can arise in such a system. P2P systems can be hard to scale because it is difficult to organise a large number of peers, which will be touched upon in section 2.2.4. Having a central server like Napster eliminates the need for a routing algorithm that ensures that every message reaches its destination within a reasonable amount of time. This makes it very easy for peers to connect to the system, request a file and contact other peers, as the central server keeps track of all files and peers. However, this server will also be a single point of failure, and will be subject to congestion should it receive too many requests at one time. It is possible, and necessary for systems with many users, to do some load balancing. This entails using multiple servers to distribute network traffic to prevent congestion. However, acquiring and maintaining a large number of servers is both expensive and time-consuming. So, while centralised P2P systems can make it easy to keep track of

17 the peers, their files, and the communication between them, it ultimately does not utilise the aspects of P2P systems that make them desirable to many in the first place. They lack decentralised control and have a single point of failure that the peers cannot function without. Napster’s role in popularising the often illegal distribution of music over a P2P network will be further discussed in section 2.4.1.

Decentralised This category encompasses every system that does not have one central unit of control, but is administered by the network of peers. This could be either hybrid systems, where some peers have more administrative roles than others, or pure peer-to-peer systems, where nodes communicate directly with each other on equal ground. Gnutella, PAST, BitTorrent and IPFS are all decentralised systems, even though they could be said to have varying degrees of decentralisation. One definitive advantage of decentralised systems is the lack of one central unit that the entire system depends on. These units are a vulnerability of the entire system, as discussed in section 2.2.1. This also means that the system does not need centralised control, as it can govern and organise itself. In reality, it is difficult for a system to be entirely self- governed, and many of these systems use different measures to ensure that peers are able to locate the files they want. Gnutella implemented so-called super peers to improve the system’s performance and scalability, which will be further discussed in section 2.2.4 and 2.2.4. This added a hierarchy to the nodes in the network, and left all peers in one of two categories; super peers or leaf nodes [6, p. 49]. Leaf nodes only communicate with super peers, which acts as servers for all their connected leaf nodes. Any messages go through the super peers, which keep track of its leaf nodes and which files they have. To ensure that each super peer does not act as a single point of failure for a portion of the system, each leaf node is connected to a number of super peers [5, p. 447]. This way, the leaf nodes are not dependent on one super peer to be able to participate in the system. Even though BitTorrent is no longer dependent on trackers to function properly, it still needs index sites, which are sites that keep track of large amounts of torrent files. Examples of this are , RARGB and TorrentDownloads [18]. In addition, to be able to join the DHT offered by some BitTorrent clients, the joining peer needs to know about at least one other peer in the DHT, which makes it necessary to have some kind of bootstrap server. Both these index sites and bootstrap servers are vulnerabilities that could cause the system harm, should they be compromised. However, seeing as there are many sites available, and BitTorrent clients each have their own DHT and can connect to the DHT through other peers it is communicating with, the chance of this being a critical vulnerability to the BitTorrent network is relatively small [19, p. 2222]. IPFS and PAST both use a DHT to keep track of and route messages

18 between peers. This eliminates the need to implement super peers, as every peer plays an equal part in the work of keeping an updated record of peers. In both of these systems, peers communicate directly with each other and keep track of the peers that are closest to them in routing tables. As peers leave and join the network, the peers update their routing tables according to which peers are available to them. Any communication between peers is routed from peer to peer in a manner that ensures that the message is always brought closer to the target peer. All the systems mentioned above are P2P systems where the peers communicate directly with each other. Even so, they usually cannot operate entirely on their own, and utilise a number of solutions for efficient communication, file locating and peer initialisation. Implementing super peers is one solution that introduces a hierarchy of peers to limit the information most of the peers need to keep track of, and increases the efficiency of the communication. In the case of torrenting services, the peers do not need to know which other peers are available to them until they want to download a given file. Therefore, it makes sense for the files to keep track of a number of peers that have the file, and to locate other peers after the download has begun. Finally, using a DHT allows a system to be more or less fully decentralised and self-organised, as peers continually keep track of which peers are available to them.

Structured/Unstructured In this category, structured means that the network of peers uses some sort of overlaying structure to communicate, like a DHT. Unstructured systems are systems that do not systematically route messages through the network but use other techniques to ensure that messages get where they are supposed to. One example of a structured system is PAST which uses a DHT to organise the network so that routing messages will be both fast and consume as little resources as possible. This approach is beneficial because it allows the system to be self-organised, and guarantees that messages will reach their destination in a limited number of steps, as discussed in section 2.2.4. The downside of this approach is that initialisation and upkeep of the system will be more costly; each time a new peer joins or leaves, the tables in every node have to be updated. DHT algorithms attempt to make this as efficient as possible, but it is still a source of overhead for structured systems that use DHTs. Napster and BitTorrent are both interesting cases that fall in-between structured and unstructured. Napster could be said to be structured, as it has a central server that keeps track of files and peers, but it does not organise the peers in any way, which is why it is also defined as unstructured. BitTorrent, on the other hand, used to be unstructured and has since added a DHT to improve the system which works together with the traditional trackers. This has allowed the peers themselves to keep track of a given file’s available peers, without relying on a server for this [20].

19 The earliest iteration of Gnutella, in particular, suffered from being unstructured, as flooding peers with messages potentially take up a lot of bandwidth and time, especially if the peer the message is intended for is not close to the peer who sent the message. This becomes an even bigger problem when the size of the system grows, and messages have to travel through a lot of peers to get to their destinations. The implementation of super peers improved the Gnutella system’s scalability and its ability to efficiently route messages. Structured systems, therefore, are fast and reliable in terms of routing, but they require more upkeep to ensure that the structure holds even after multiple nodes leave and join the network. Unstructured systems are faster to set up and require less upkeep, but they suffer in terms of bandwidth and routing.

File Availability File availability is one of the core issues of my system, which will be discussed further in section 3.1.1 and entails whether or not the system can guarantee the availability of a file, and what limitations there might be to this aspect of P2P file sharing systems. This issue is at the very core of P2P file systems, as their decentralised and distributed nature relies on the users’ machines to store files. Unlike server farms that exist for the purpose of always being on and available, personal computers are constantly being turned on and off. This means that these types of systems are, at their core, unreliable, and measures need to be taken to ensure file availability. As mentioned, the users of the system provide both the network and the storage facilities in a P2P system, and these users are connected with a wide variety of machines, both in terms of availability, storage space and computational power. One solution to the issue of file availability is to make sure that any file is duplicated a certain number of times throughout the network. To make this work in a real system, it is also necessary to keep an updated record of any replicas that are dynamically created as machines come and go. Usually, a P2P file sharing system will rely on file duplication to keep files available even when multiple peers disconnect from the system. Napster, Gnutella and BitTorrent are all reliant on the users to have copies of the files. These systems will only have as many copies of a file as there are users who have decided to upload it, and if none of the users are online, the file is not available to other users. None of these systems perform file duplication to keep a file available, but users of BitTorrent will simultaneously download and upload a file. A peer who is downloading a file is called a leecher, and a peer who is uploading a file is called a seeder. The BitTorrent peers use a tit-for-tat strategy, which means that it tries to optimise the download speed by punishing peers who only leech [21]. BitTorrent does this, in part, to ensure the availability of files. However, as all these systems rely on users to keep a file in the system, they cannot guarantee file availability, which often results in less popular files being rendered unavailable [22, 23].

20 PAST utilises file duplication to ensure file availability, and the files are strategically placed on peers according to hash. This means that it will be easier to locate the file, as it is possible to use the routing table to locate a peer who has the file. PAST will also maintain a given number of copies, as long as there are enough peers to store copies on. In a P2P system where file availability over time is a core element, file duplication is necessary to be able to guarantee availability. Many of the systems in the comparison do not make any attempt at guaranteeing file availability, because they mainly exist as file sharing systems rather than file storage systems. The result of this is that the most popular files remain available and exist across many peers, while less popular files will be difficult to come by, and potentially disappear over time. A file system that seeks to maintain copies of the users’ files cannot follow the same approach, because the files belong to users who most likely want to keep everything they archive. One solution to this is file duplication and systemised archiving according to file hash that makes the files easy to locate.

File Sharing Historically, many P2P systems have been centred around file sharing, and all the systems in this comparison have ties to this category. As this project seeks to create a system that is both personal, yet collaborative through sharing, this is an interesting category to explore. This section will look at how the various systems approach file sharing, and how these might differ. Gnutella, Napster and BitTorrent are all examples of file-sharing P2P systems that have been, perhaps wrongfully, classified as “illegal” file- sharing systems. This is because they have been widely used for the illegal distribution of files, which will be discussed further in section 2.4.1, but neither the clients, networks or protocols they use are in any way illegal. Due to the vast amount of media content like music, tv-shows and movies available within these systems, they have also been, and still are, immensely popular and widely used. Napster reached millions of users [24, p. 339] within months of its inception, Gnutella is still alive to this day3, and a study from 2018 shows that BitTorrent at the time made up for 4.10% of the global Internet traffic [12]. What these sites have in common is that they have been used by anyone who wants to transfer or download files, and have been popularised by the presence of readily available media content. PAST is a file system that has not been implemented as anything other than a research project, but it was intended to be a global file storage utility. was never the main purpose of the PAST system, whose intention was to share bandwidth and storage space to be able to archive files, both to be able to store larger files and to ensure backup [11]. Still, sharing a file was possible through the distribution of a file’s fileId, even if this was not a central aspect of the system. This system is, therefore, a file

3http://gtk-gnutella.sourceforge.net/en/

21 storage system for individual users that seeks to utilise the collaborative benefits of a P2P system. IPFS is not so much a file sharing system as it is a website sharing system. Much like PAST, IPFS wishes to use the distributed and collaborative aspects of P2P systems to make the Internet more reliable and robust. One of the main arguments for the IPFS system is that it is possible to access a file at any time because it is not located on one particular server. The way the web is designed today, there is no simple way of achieving this, as all files on the Internet are located by their address, which corresponds to their physical location on a server. However, if multiple machines had copies of a file, it would be possible to access it, even if some of them are unavailable. The most widespread use of P2P in file sharing, then, is the distribution of media like music, t-shows, movies and books. These so-called "pirating" systems have been branded as illegal, though their use and functionality not necessarily is. Other systems, like PAST, are research projects that were created with the intention of implementing a large-scale file storage utility based on P2P technology. Finally, IPFS is another system that has taken the advantages of P2P technology and applied it to the endeavour of making the Internet itself into a P2P system, to be able to access websites more efficiently and prevent them from disappearing.

User Friendliness User friendliness, in this thesis, entails whether or not the system’s user interface (UI) is easy and intuitive to use. This is because we want our system to be usable for anyone who uses the Internet and wishes to be able to access the content they consume, even if it is changed, removed or lost. Most of the systems in this comparison are available through a graphical user interface (GUI). To define user friendliness, it is necessary to keep the intended user group in mind, as various groups of users have various needs and will view computer applications differently. Variations will also appear within one user group, but limiting the user group will make it easier to envision which prerequisites the user has in order to use the system. The target group for our system will be people between the ages of 20- 50, who have experience using both computers and existing bookmarking services. This could either be a bookmarking system provided by a browser, or a more advanced system like the ones described in section 2.3. The UI of the various systems will be evaluated according to four design principles from [25], namely clarity, consistency, simplicity, and responsiveness. As many of the principles in Galitz’ book are tightly interconnected, we chose to limit the comparison to these four. These principles also appear in many other books on user interface design and user friendliness, and the following list will examine each principle and outline what will be emphasised in the analysis.

Clarity This design principle is tightly connected with simplicity but has

22 been kept as its own category to focus more on individual elements and vocabulary rather than the overall structure of the UI. Clarity, as described by Galitz in [25, p. 46], means that: "Metaphors, or analogies, should be realistic and simple. Interface words and text should be simple, unambiguous, and free of computer jargon." In particular, this category will look at the use of metaphors such as icons, and the vocabulary used in the UI. Consistency Consistency is a design principle that is often brought up in writings about usability [25, pp. 48–49, 26, pp. 174–175, 1, pp. 404– 405], and focuses on consistencies in design and functionality. This means that any visual elements that have similar functions should look similar, that the use of fonts and colours should be the same across the UI, and that any user actions should yield the same results every time. The users develop a mental model of how a system works as they use it, and this should be supported by the UI with predictable behaviour and visual design. Simplicity This design principle is also one that has been emphasised often and repeatedly [25, pp. 56–57, 26, pp. 170–172, 27, p. 45], and as the name suggests, it highlights the importance of keeping user interfaces simple. To achieve this, the designer is encouraged to use well-known icons, words and UI controls and to keep the UI well structured and free of clutter. This category will explore the overall complexity of the UI and look at how many different elements are present in the UI and how these are organised. Responsiveness This category mainly concerns the performance of the systems, and will therefore not be examined in detail until the next section that specifically addresses this aspect of the systems in the comparison.

The user friendliness of Gnutella and BitTorrent both depend on the client, as the networks are both available from a variety of different clients. For the purpose of this comparison, Gnutella will be examined through one of the few clients still available, gtk-gnutella4 1.1.15 for Windows [28], and BitTorrent through its most popular client, uTorrent5 [29], specifically uTorrent Web 1.8.7 for macOS. For IPFS, its desktop application, version 0.11.4 for macOS will be examined. Napster, as it is no longer available as an application, will be examined through a screenshot. As PAST is a research project, and not readily available for public use, it will be exempt from the comparison. In addition, as none of the systems in the comparison are Internet archiving systems, they will be analysed based on their visual appearance only. The functionality of the systems is very different from the one we will be making in this project, and therefore it was deemed unnecessary to spend time and effort on this.

4http://gtk-gnutella.sourceforge.net/en/ 5https://www.utorrent.com/

23 Figure 2.2: Launch screen of uTorrent Web

Figure 2.3: Download screen of uTorrent Web

BitTorrent uTorrent is now primarily available as a web app, called uTorrent Web. It still has to be downloaded and installed to the user’s computer, but the UI is only available through a web browser.6 The initial uTorrent Web UI, as seen in figure 2.2, that meets the user upon launch is simple and straightforward. It allows the user to either search for a torrent, add a torrent manually, or view a video tutorial. There is also a small menu for user profile and settings in the top right corner. When the user adds a torrent, a different page with a list of the torrents is shown. Figure 2.3 shows this page, with one torrent in the list, which is opened to show the details. In terms of clarity and simplicity, the uTorrent Web application has

6This is the case for macOS after Catalina, Windows and users can still download and use the desktop application.

24 narrowed its functionality down to the bare essentials and displays no unnecessary information. The landing page, in particular, has eliminated anything that is not essential to the task at hand, adding a torrent, which results in a very simple design that makes it clear to the user what the possible actions are. For any users who are unfamiliar with the application, there is also a tutorial available, but this is not presented unprompted. As for the clarity of any icons and vocabulary, the application uses some icons that will be discussed briefly, and overall the vocabulary used is clear and conveys the intention of any element in the UI. As for icons that are unaccompanied by text, the application has a small menu in the top right corner, as well as a small menu for each torrent file, both visible in figure 2.3. uTorrent definitely has visual consistency, as can be seen through its use of similar colours and fonts throughout the design. The main colour scheme is a dark grey background with light grey accents, white text and bright green colour to highlight certain elements. A blue colour is also used to highlight the download progress for a torrent.

Figure 2.4: gnutella-gtk GUI

Gnutella gtk-gnutella, as shown in 2.4 is a desktop app that automatic- ally connects the user to the network, and allows them to search for files. The search field is located at the top of the screen and includes an input field, options to change the media type to search for, and two configura- tion parameters. On the left side of the screen, there is a panel showing the user’s searches, and multiple animated bars showing uploads, downloads and network traffic. The main portion of the screen is made up of the con- nected machines in the network, but this section of the GUI can be changed to portray a number of different things, like search, downloads and statist- ics, as can be seen in the header for the section. At the bottom of the GUI, there are options to customise the network connections or disconnect from

25 the network. The app uses a lot of descriptive words in its GUI, which is to say that any menus or options are quite verbose and leaves little doubt as to their function. Some might require domain knowledge to understand, like "Søkeovervåker" (Search Monitor), but in general, the clarity of the vocabulary is good. However, there are a lot of different options, menus and sections in the GUI, which affects both its clarity and simplicity. Additionally, everything is designed to look rather uniform, and so there are no parts of the UI that are highlighted and differentiated from any other, making the visual hierarchy entirely dependent on the elements’ location on the screen. The design is simple in that one thing dominates the GUI, but there is still a lot of information on the fringes of the design that battles for the user’s attention. Visually, the design is consistent in that it has a simple colour scheme throughout the design. The background is mainly white with black text, while the header and sidebar have a grey background. Some bright colours are used to highlight certain elements, particularly the green used to indicate the traffic speed, and the red used by the "What’s New?" button in the top right corner and the "Koble fra" (Disconnect) button at the bottom of the screen. This use of contrasting colours is not overwhelming, and successfully grabs the user’s attention, though the moving green bars may also be the cause of distraction, as they are very bright and take up a good portion of the sidebar.

Figure 2.5: IPFS status menu

IPFS IPFS will be analysed through its desktop application, which is made up of a status menu icon, as seen in figure 2.5, and a desktop app, as seen in figure 2.6. Upon launch, IPFS only appears as the status menu icon, which indicates that the system is running, and gives a short list of options

26 Figure 2.6: IPFS desktop GUI for the system. The clarity of this menu is good, as it uses descriptive names, and it is both simple and concise. Clicking either "Status", "Files" or "Peers" will open the desktop app. The desktop app GUI is divided into three sections, the sidebar, header and main work area. The sidebar has five different menu items and is accentuated by a dark blue colour. The header is has a light blue background to separate it from the rest of the design and only contains a text input field with an accompanying button, and a menu with two icons. Everything else is displayed in the main work area, which in figure 2.6 shows the peer connection status, and the traffic both over time and currently. The clarity of the design is good, and it is very simple and straightfor- ward. Throughout the app, domain-specific vocabulary is used, and so the clarity of the system might be lessened by the more technical vocabulary such as "peer" and "QmHash". However, due to the nature of the system, it would be difficult to eradicate this completely, and it is likely that the users of IPFS either have knowledge of the domain, or the ability to acquire this. To this end, the question mark icon in the top-right header menu provides step by step explanations of the different elements of the main work area. In terms of simplicity, the design uses a lot of white space and different col- our backgrounds to differentiate between the various elements. There are also few menu items in the sidebar, and the main work area only shows the essential information, with the option to show more advanced information. The colour scheme of the GUI is consistent, and it uses mainly shades of blue and teal. The sidebar, highlighted menu item and header are all shades of the same blue colour, while the IPFS logo, the "in" traffic in the graphics, and the icon menu in the top right corner are all in shades of teal. Two colours vary from this; the blue colour used for links in the box with information about the peer, and the orange accent colour used

27 for "out" traffic in the graphics. The blue colour corresponds well to the blue colour that is standard for HTML links and is likely used to indicate that these are clickable links, which they are. The orange colour is used in contrast to the blue and green colours used in the rest of the GUI and works well to differentiate the two colours in the graphics. Clicking the items in the sidebar menu only changes the main work area, while the sidebar and header stay the same which adds a layer of overall consistency to the GUI.

Figure 2.7: Napster GUI

Napster As the initial description of Napster mentions, the Napster software is no longer available, but it was a desktop app, as pictured in 2.7 (screenshot fetched from [30]). Napster’s GUI was split into three main sections; a header, the main work area and a footer. The header contains a menu for the different functionality offered by the program, and the footer shows information about the network and files. In figure 2.7, the "Transfer" menu item is selected, and the main work area shows files that the user is downloading or uploading, as well as two options to clear the finished files or cancel a download or upload. In terms of clarity, this design does well, as it uses a lot of descriptive words that clearly define the difference between the various menu options, file information types and any other information given. This verbose approach might make it more difficult to quickly scan through the elements of the GUI, but it leaves to doubt as to the intention of the various elements. Like gtk-gnutella, the clarity of Napster’s GUI might suffer from how uniform both the colours and the elements are, as there is nothing to really differentiate the various parts from each other aside from borders.

28 However, the Napster design is less cluttered and keeps the elements on one page to a minimum, which makes it easier to navigate. Overall, both the clarity and simplicity of the GUI is good. The consistency of the GUI is good and adheres to the look and feel of the Windows 95 operating system. It has a simple colour scheme, where the background and buttons are all the same shade of grey, and the main work area is emphasised with a white background. All the text is black, and two colours are used to emphasise and differentiate between the download and upload progress.

Performance This category examines the performance of the systems, that is how fast they perform whatever actions the user requests. More specifically, this section will investigate the systems’ ability to route messages through the network, which for file systems will focus on the systems’ ability to locate a file. The performance of the systems will, in the context of this project, also be seen as a subcategory user friendliness, as the two are closely related. If a system takes too long to respond to a user request, the user will be bored and lose interest [4, pp. 135–137, 31], which is not desirable. Therefore, the design principle responsiveness encompasses performance. As long as a file is available in the system, Napster can guarantee that a lookup will be fast and successful, with a O(1) lookup time, as it only requires one message to the central server. This also goes for BitTorrent with the use of trackers, because the torrent files contain a direct link to its tracker. This results in good performance when it comes to file location for both systems, but it will also lead to a single point of failure that will suffer under large traffic, as discussed in previous sections. PAST, through the use of Pastry, performs message routing in less than O(log2b n) steps on average (where b is a configuration parameter usually with value 4) [11, p. 78]. This way, the system can guarantee that the message is always brought closer to its destination by systematically routing it through the network of peers. Other DHTs like Chord [32] or Kademlia [2] are also able to perform routing in O(log n) steps. The latter of these is used by both the BitTorrent DHTs and IPFS. Gnutella’s initial solution to message routing was very inefficient (see section 2.2.4) because it would simply flood the network with messages, hoping to get to the right peer within a limited number of hops. The number of hops would be limited by a time to live (TTL) flag that was decreased with every hop. If the TTL reached zero, or the message reached a node it had already visited, the message was not passed on. The result of this was that the worst-case scenario for a query in the Gnutella network was that the message never reached its destination. This meant that the message returned no result, even if the peer it was trying to reach was available. While the performance of the network was greatly improved by the implementation of super peers, the system is still limited by using the same flooding techniques as the earlier protocol among the super peers [33, p. 85].

29 As this category shows, the systems can be divided into three categories when it comes to performance. There are the systems that have a constant lookup time, as they only require one message to a central server or tracker, like Napster and BitTorrent. All the systems that use a DHT, in our case PAST and IPFS, have a logarithmic lookup time. Finally, there are systems that have no guarantee that the lookup will even be successful, such as the early iteration of Gnutella, which makes the performance quite bad. All these solutions have advantages and disadvantages. Using a central server is efficient, but the server becomes a vulnerable point for the system. A DHT makes message and content routing fast, but requires all the peers to keep track of peers close to them, which takes time and effort.

Summary of Taxonomy This taxonomy has looked at five different file-sharing P2P systems in light of a number of different categories that are relevant to this project. Table 2.1 shows the results from the comparison, where a checkmark indicates that the system meets the criteria for the category, a cross means the system does not meet them, and a tilde means the criteria are met to some degree. This section has looked at categories that highlight the advantages and disadvantages of certain system structures and implementation choices that can have an impact on the design of our own system.

Gnutella Napster PAST BitTorrent IPFS Decentralised X X XXX Structured X ∼ X ∼ X File availability ∼ ∼ X ∼ X File sharing X X X X ∼ User friendliness ∼ X / XX Performance X X X X X

Table 2.1: Comparison of peer-to-peer systems

The section started out by looking at the systems’ level of centralisation, which revealed that using a DHT can make it possible to create a self- governing P2P system. While the time and effort it takes to set up and maintain a DHT might be greater than with other system structures, it eliminates the need to keep track of peers and files in any other way. Additionally, in a system where file availability is key, this type of structure can also make it easier to keep track of which files are available in the system, and maintain a certain number of copies of each file to make sure they do not disappear over time. As sharing is a core element of this project, this section also briefly looked at file sharing as one of the more popular uses of P2P systems today. Finally, this section analyses the user friendliness of the systems, taking into consideration both their visual design and performance. The former showed the importance of having a

30 simple GUI that does not overwhelm or confuse the user, and the latter showed that using a DHT is a good way to ensure good performance, and reliability at the same time.

2.2.5 Summary of Peer-to-peer Systems This section has looked at P2P systems in detail, by examining both their background and characteristics, and by comparing a variety of P2P systems that already exist. The section starts by outlining the differences between P2P systems and traditional centralised systems, before delving into more details on the characteristics of these types of systems. In the final, and largest section, five different P2P systems were compared to highlight the differences and similarities between the systems. This comparison, along with a similar comparison of Internet archiving systems will act as the basis for our own system.

2.3 Internet Archiving

While the structure of the system in this project is P2P, it is also an Internet archiving system in functionality. Therefore, this section will look at the history and culture surrounding Internet archiving as well as three central issues, i.e. tampering, funding and licensing. Then, like the last section, five different Internet archiving systems will be examined according to categories that are relevant to this project.

2.3.1 Saving the Internet Within the field of digital archiving, there is a long-lived tradition of archiving Internet content in various forms. One of the IPFS system’s main arguments for their redefined web is the fact that websites are short-lived and that the history of humanity is being deleted daily as old websites become obsolete and unavailable. The fact that there is such a vast amount of media and information on the Internet, combined with the knowledge that the nature of this medium is ephemeral and uncertain, has spawned a multitude of various Internet archiving systems. One such system is The Internet Archive, which is perhaps the most widely known organisation who deal with the archiving of a variety of online artefacts, including websites, books and media [34, 35]. There has also been an effort to collect all the Internet archives into one place to make it easier to access the archived content without having to check multiple sources, through the Memento [36] project. Memento is available as a web browser extension that lets the user go back in time to view previous versions of the site they are currently visiting, if there exists at least one archived copy of the site. Another system that specialises in archiving various information from a browser, including history, bookmarks or websites is ArchiveBox [37]. As opposed to the Internet Archive’s public and cultural goal for archiving

31 digital content, ArchiveBox is meant for personal use to be able to revisit web content the way it was at a certain point in time. However, the underlying principles are the same; to save local copies of online content that is prone to change, or disappear over time. Systems like ArchiveBox, that are intended for personal use, can be viewed as advanced bookmarking services. Like the bookmarking services that are available in web browsers, they allow the user to save a website for later consumption. In addition to this, Internet archiving services also often make the content available offline and ensures both that the content remains unchanged and that it will be available even if the original website changes or disappears. The list of similar systems is long, containing all kinds of archiving services for small, or large scale archiving of web content [38–42]. A selection of these systems will be examined closer in section 2.3.3. A web article from 2018 highlights the importance of web archiving and lists a variety of reasons why doing this is important [43]. Weigle focuses on the cultural importance of such systems, and how the Internet is an integral part of our history and embedded in our culture. This effort to record online history is also evident through nationwide efforts to archive online content, usually directed by national libraries. In Norway, the national library has an initiative called Nettarkivet, which is an Internet archive dedicated to collecting websites registered under the Norwegian domain .no, as well as any other sites that are of cultural relevance to Norway [44]. This illustrates that the preservation of Internet content in the interest of cultural importance is not merely an endeavour for the particularly interested, but also a matter of national concern.

2.3.2 Issues in Internet Archiving This section looks at the issues tampering, funding and licensing and how they relate to Internet archiving systems. Tampering examines how to prevent the archived files from being edited, to achieve the persistence of data that the Internet lacks. Funding looks at the various ways that Internet archiving initiatives are funded, and how a P2P structure can help alleviate some of the costs. Finally, licensing discusses the copyright issues that might arise when permanent copies of websites are created, and how to make sure that the operations of an Internet archiving system are legal.

Tampering One issue to take into consideration is tampering. Tampering, in this context, means to change the archived files in any way, either by editing the contents or by corrupting the file. As one of the main arguments for Internet archiving is that saving copies of websites saves them from being tampered with at a later time, preventing this is important. One solution to tampering that is used by IPFS, among others, is content addressing. As mentioned in section 2.2.4, this entails addressing by content rather than location. An address, in this context, is a name referring to the point which makes it possible to access an entity in a system [45,

32 p. 238]. Traditionally, the Internet uses location addressing, which means locating a website by the physical location of the site’s host, through its IP address. This separates the addressing of the site from its actual content, which allows a website to change over time without having to also change its IP address. Through content addressing, a much tighter link between content and its address is created. For IPFS, and most other DHTs, this means assigning a unique hash value to each file, calculated from its contents. This hash value then acts as both identifier and address for the content, because the identifier is used to locate the content [45, p. 246]. The result of this is that any change to the content in the file will result in a new hash value. This works as a measure against tampering because it is easy to check if a file has been tampered with, by computing the hash again, and checking it against the existing hash.

Funding Another issue that needs to be addressed by anyone who wishes to make an Internet archiving system is funding. This section examines this issue by looking at three different types of funding, before discussing why a P2P system is a good solution to avoid the brunt of the cost. The three types of funding examined in this section are private donations, state funding, and profit. While many of the initiatives for Internet archiving are open-source and non-profit and so does not have any employees, the storage space needed to be able to archive potentially large files is costly, and this, as well as other operational costs, need to be funded. Consequentially, many of these systems are dependent on donations to keep the operation running. Large systems like The Internet Archive need large sums of money to be maintained and receives donations from various foundations and funds [46]. They also accept donations from individual users, which is often also the case for smaller systems, like ArchiveBox. The other type of initiatives that do not earn any money are state- funded projects, like Nettarkivet. These projects are funded by the government, and state employees will undertake the task of maintaining the system, which means that their operations are not dependent on outside sources. The last type is for-profit corporations that offer paid services to the user to earn money. To use these products, the user either has to pay to use it, or they are offered limited functionality and can extend on this by paying money. All these types of organisations need money to fund storage space, which is not necessary for a P2P system, as the users themselves provide the storage space. This leaves the cost of creating and maintaining the system, which could be solved by the collaborative nature of open-source systems. This is the case for ipwb, a P2P Internet archive that will be discussed in further detail in section 2.3.3. An issue that arises in P2P systems is that you also need users to get storage space, and creating incentives that will convince users to use the system is not always easy,

33 which will be discussed further in 2.3.3. There are multiple ways that an Internet archiving system could be funded. This section has looked at three different types and offered a P2P structure as a cheaper alternative. Internet archiving systems are generally either funded by other organisations or their users, by the state, or they are a for-profit corporation whose services the user has to pay for. An alternative to all these is a P2P system, where the users function as the storage space, which works well in systems where there is no need for a lot of storage space.

Licensing One final issue that requires some attention is licensing. Internet archiving systems create permanent copies of websites, and wherever there is copying of artefacts, there is a question of copyright. Content like articles, images and videos that are posted to websites, and even the websites themselves are created by people who do not wish for their content to be copied without their permission. Internet archiving is special in this sense, because most of the time, the website is copied and archived exactly as it appears, including any credit attributed to the creators. Yet, it is an issue that has been addressed by Internet archives, and need to be considered in the creation of such a system. One way of tackling this issue is to not share any of the websites with the public. Many library initiatives, like Nettarkivet, do not publicly release their archived content, and so avoid any licensing issues by keeping a closed archive in the name of cultural and national interest. This is, however, not a solution that will work for Internet archives that seek to make the history of the Internet publicly available or allow users to archive the content they want for later consumption. Both library initiatives and archives like the Internet archive generally fall under fair use, which exempts them from copyright claims as their archiving efforts are in the name of research and public benefit [47]. Many archives, like The Internet Archive, will allow users to archive any website as long as the website does not prevent it and will remove any websites on request from the creator. This also goes for Pocket, a personal archiving service that will be further discussed in the next section. Their terms of service state that "By posting, sharing or saving any videos, articles or content, you represent that doing so does not infringe any third party’s copyrights, trademarks, privacy rights or other intellectual property or legal rights of any kind." [48] This "lazy" approach places the responsibility on the website creators and their own users. Either the website creators have to make it impossible to copy the site, or the users have to make sure that they are not violating any terms of conditions when creating the copy. The legality of Internet archiving and copyright issues on the Internet, in general, is not an easy problem to solve [49]. In a personal archiving system, there is no need for the content to be available to all the users at one time, and like the closed national Internet archives, keeping the files private can help with copyright issues. Encrypting the files in the system, so that

34 they are only available to those who have the key to access it would mean that the copies are made privately and that they remain private throughout their lifespan.

Summary of Issues in Internet Archiving This section has looked at Internet archiving in light of three specific issues, namely tampering, funding and licensing. Tampering looked at how to prevent files from being edited after archival and suggested content addressing as a possible solution to this end. Funding outlined the various ways that Internet archiving might be funded, like donations, state funding or by offering paid services and then looked at how a P2P system structure can help lessen the cost of the system. Licensing examined copyright issues that might arise in a P2P system and highlighted that this might not be an issue in a system where the files are not made publicly available and suggested encryption as a way of keeping the files private.

2.3.3 Taxonomy In the same way that we compared and examined key features of P2P systems, we will also look at five different Internet archiving systems that are relevant to our system. These systems all specialise in the archiving of Internet content, usually whole websites, either for cultural reasons or as an advanced bookmarking service for individuals. The goal of this section is to examine and compare a variety of Internet archiving systems to see the advantages and disadvantages to how they approach Internet archiving, as a way to guide the design of our system. As the comparison in section 2.2.4, this section starts with a short intro- duction to the systems that are compared in the taxonomy, before moving on to the actual categories of comparison. For Internet archiving systems, five different categories that tie directly into our project will be explored, either in the form of issues that we seek to address, or implementation choices for the prototype. The systems’ level of centralisation is addressed in the P2P category, their personal and collaborative nature is examined in the categories Personal and Sharing, and user friendliness is examined like last time; through an analysis of their UI. The final category, Local Saving, looks at an implementation choice that has to be made.

Systems Webrecorder.io Webrecorder.io (Webrecorder)7 system allows the user to capture browsable copies of websites, which makes it possible to save, revisit and share content that might disappear in the future [39]. The service requires the user to sign up for an account to be able to create a permanent collection of websites. The service is also available as a desktop app. It is possible to use the service without creating a

7As of June 12, 2020, the name of Webrecorder has changed to Conifer, and its visual design is different, but the functionality of the system remains the same.

35 user, but this requires the user to download the desktop app and save the websites locally on their own computer. There is also another desktop app available, called Webrecorder Player, which allows the user to revisit local copies of archived sites offline and locally.

ArchiveBox This is another service that is intended for personal use, much like an advanced bookmarking service. ArchiveBox has support for saving multiple types of information from the browser, including history, bookmarks or websites [37]. This service does not require the user to register and is only available through a command-line interface.

Pocket While Webrecorder and ArchiveBox both focus on saving websites specifically, Pocket does not aim to save browsable websites, but rather snippets from them like articles, videos and stories [50]. This content is available on any device that has Pocket installed, and the user can read it whenever and wherever they wish. Pocket requires the user to sign up to use the service, which is available in the browser, or as an app for the phone or tablet.

The Internet Archive This project, as mentioned, aims to be a digital library, containing various types of digital content such as websites, books, audio, video, images and software. All the content that is archived here is freely available to the public. This system is not intended for personal use, as it has a much larger, cultural goal in mind. However, it is possible for users to utilise the Wayback Machine [51] to archive sites that they want to preserve, to visit old versions of a website, or to view archived sites that are no longer available. ipwb InterPlanetary Wayback (ipwb) [41] seeks to archive WARC files in the IPFS network and provide the means to replay these using pywb [40]. In their paper [52], they point out that projects like The Internet Archive are dependent on the organisation to ensure that files stay available, which is not the case when you use a P2P network to save and distribute files. To use ipwb, a user has to use a command-line interface to download, install and use the system.

Personal This category entails whether the system is intended for personal use, like an advanced bookmarking service, or if the goal is to archive the entire Internet. As this project is centred around making an application for personal use, yet seeks inspiration from the importance of Internet archiving in general, this was included in the comparison. Webrecorder, Pocket and ArchiveBox are all intended for personal use, while the Internet Archive and ipwb both seek to archive the Internet for cultural and historic reasons, while also allowing private users to utilise their capabilities for their own purposes.

36 One difference between personal and impersonal systems is that the impersonal systems generally seek to archive as much of the internet as possible, and in theory; all of it. While this goal is difficult to realise, as the Internet continues to grow and change every day, it is still evident that these systems aim to save a lot of data. This data has to be saved somewhere, which means that Internet archives like these necessarily need a lot of storage space. While personal systems also potentially need a lot of storage space with a large user base, the storage space per user is significantly smaller. Additionally, the files do not need to be publicly available in a system intended for personal use, and thus it becomes easier to avoid licensing issues. As the system in this project is a P2P system, it is essential that the users see the benefit of participating, as a P2P system is dependent on having active peers in order to persist. A system meant for personal use has the benefit of gathering one user’s files in one place, while still keeping the collaborative nature of the system present through the sharing function. As the system is tailored towards the user’s experience, rather than some higher goal, might help incentivise some users to participate in the network.

Local Saving Local saving means that a user has to save the files on their own computer, as opposed to a server provided by the service. The only system that requires the user to save files locally is ArchiveBox. Webrecorder, Pocket and The Internet Archive all provide servers for the content that is archived. This eliminates the need to save the content locally to a degree, but it adds a vulnerability in that the user is dependent on the central servers of these systems. Pocket and Webrecorder both allow the user to view the content offline, which requires the user to download it. The dependency on central servers outside of one’s control requires trust from the user. The content will become unavailable to the user if the server fails, and if it is compromised the content may be lost. Using a P2P structure like ipwb to save the files can allow the user to share storage space with other users, and contribute as much as they are capable of. It does, however, require that the user gives up some of their storage space to share with others. Some users may not be willing to do this, which will be further discussed in the next section. With a file duplication algorithm, the system will also be able to guarantee file availability as long as a certain number of peers is available.

Sharing Sharing is an essential part of this project and means that the user should have the option to share the content they save, typically through a share button. Having a sharing function separates the system from purely personal use, and reminds the user that it is possible to share the content they consume. Through sharing, the use of the system extends beyond

37 the individual user and opens for a collaborative environment where users share information, knowledge, and entertainment. Webrecorder, Pocket and The Internet Archive all have the option to share the website you have saved. Webrecorder has a share button, with support for sharing on Twitter and Facebook, in addition to providing a link that the user can share however they want and a snippet of HTML code, if the user would like to embed the captured website. Pocket offers a range of different possibilities; copying a link, sharing within the service, or outside. Within the service, it is possible to share with a friend or to recommend the content publicly. In terms of social media sites, it supports Facebook, Twitter, Linkedin, Reddit, Tumblr and Buffer. Finally, The Internet Archive has two share buttons; Facebook and Twitter, and any archived site is also available through its URL, which the user can choose to share. As mentioned in section 2.3.2, it can be difficult to convince users to use a P2P system, as they have to give up storage space and bandwidth to participate. A system which is only intended for personal use might suffer from this because the users are not incentivised to collaborate with other users. However, implementing sharing as a core function, and facilitating for social use of the system can act as encouragement for the users to participate and share resources as well as content. One example of this is BitTorrent, which is a system that is dependent on the users participating in order to function. BitTorrent, as mentioned in section 2.2.4, separates its users into two groups; seeders and leechers. For BitTorrent, the users are incentivised both by the punishment they receive for not contributing, but also the sense of community. It is obvious that a file will not be available if nobody aids in the seeding of it, and so users are willing to use their bandwidth and storage space to keep the system usable.

P2P As the end goal of this project is to create a P2P system, this category was included in the comparison to examine whether or not this is a common system structure for Internet archiving systems. As it is, only one of the systems we have chosen to examine is a P2P system, namely ipwb. One definite advantage to making a decentralised system is the lack of a central server, as discussed in 2.2.1, and this also goes for Internet archiving systems. As long as the control, and storage of the system is centralised and beyond the users’ control, there is no guarantee that the files will remain available. Creating a distributed and decentralised Internet archive can make it possible to ensure a high degree of file availability, without having to leave the users dependent on one central server, or their own availability to create and store safety copies. Another aspect that has already been addressed in section 2.3.2 is that creating the archiving system as a P2P system can eliminate the need to fund storage space, as long as the users are willing to use the system.

38 User Friendliness This category will examine the user friendliness of the Internet archiving systems in almost the same way as the P2P systems. This means that the systems’ UI will be examined, to determine whether or not they are user friendly with respect to our target group. The Internet archiving systems are available either through a GUI, or a command-line interface (CLI). As several of them have both an online service and an optional desktop or mobile application, we will primarily examine the online service. Like last time, the UIs will be examined in relation to the design principles from [25], specifically clarity, consistency, simplicity, and responsiveness. Last time, responsiveness was moved into a category of its own as the functionality of the systems was exempt from the comparison. This time, however, the functionality of the systems bears an important resemblance to our own project, and will, therefore, be briefly examined, along with the responsiveness. ArchiveBox and ipwb are both dependent on a CLI. A command-line interface typically requires the user to have previous knowledge of using CLIs, it offers little to no visual cues for how to use it and may have difficult syntax [25, p. 14]. The result of this is that they may be frustrating to use, and may even discourage any inexperienced users from ever using the system. These systems, therefore, are categorised as not user friendly.

Figure 2.8: ArchiveBox file overview

ArchiveBox ArchiveBox does, however, have a UI that lets the user browse their archived sites as seen in figure 2.8, which will be briefly discussed according to the principles. The functionality of this GUI is rather limited, and so both the clarity and the simplicity is quite good. It also uses descriptive words that emphasises what the various parts of the user’s archived sites are. In terms of consistency, the main colour scheme is consistent with the black background and red header. However, there are both yellow and green links, with no real distinction between them, even if the different colours suggest that there should be one, which makes the GUI somewhat inconsistent.

Webrecorder The Webrecorder landing page, as shown in figure 2.9, is what greets the user when they login to the service. The most prominent feature is the New Capture option, which allows the user to archive a new

39 Figure 2.9: Webrecorder landing page site. Each field of in this form have descriptive names, or placeholder text to guide the user. The clarity of this design is good, but the landing page also contains some unclear elements. Below the New Capture option, the user’s collections are displayed. Clicking the Default Collection headline leads to the page shown in figure 2.10, which does not show any of the sites that have been archived, not in the "Overview" or the "Browse All" tab. However, the Manage Collection button leads to the page shown in figure 2.11, which shows the archived sites. Here, the intention of the words is unclear, as the user might assume that the archived sites are located under the collection overview. The design uses the same colours and only one font, which makes it visually consistent. It also uses the same header throughout the system, which adds a sense of uniformity. Due to the confusing wording on the landing page, the Webrecorder design is not as simple as it could have been. Archived sites do not appear on the collection overview page unless they have been added to a List, which is an option under the Manage Collection page. Another inconsistency in the design is the words "capture" and "session". The New Capture feature and the New Session button that is visible under each collection both do the same thing; archive a website from its URL. However, if the user browses the website as it is captured, all the links they visit will also be captured in the same session. This functionality may be very helpful, but unless you read the user guide, it is difficult to understand that this is happening, and the two different names can also add to the confusion. In terms of responsiveness, Webrecorder does well. Archiving a site takes about one second, and the site shows the progress as it goes, and the user can start browsing the archived site before it is fully captured. If the user clicks any links on the site, the new URL will also be captured. Opening an archived file takes the same amount of time as capturing it, around one second.

40 Figure 2.10: Webrecorder collection page

Overall, the Webrecorder UI is reasonably user friendly. The design is consistent in its use of colours and fonts, and it responds fast to user input. While the form that lets the user capture a new site is clear and easy to use, there are some misleading links that can make it difficult to locate the archived sites.

Pocket The Pocket landing page, shown in figure 2.12 mainly consists of the user’s archived sites, in addition to three prominent menus. The menus are clearly separated by location and appearance, which makes it apparent that they serve separate purposes. The language used is straightforward and consistent and guides the user to understand the possible actions. One aspect of the Pocket service that may negatively affect its clarity is the prominent use of icons. This can be seen in the top right of figure 2.12, and in figure 2.13, which shows the possible actions for an opened article. For experienced users of the Internet, the meaning of these icons may seem obvious, but this is not necessarily always the case. These icons are based on conventions, like using a star for favouriting and a trashcan for deleting, but they are also dependent on the users knowing these conventions. Pocket solves this by using the same icons in the menu on the left-hand side of the landing page, which ties the icons to concrete actions, as well as showing text on hover on the icons that do not have accompanying text. The Pocket design has three main bright colours, pink, teal and orange. These are used sparingly and consistently, to highlight menu elements and buttons, which gives the design a simple, yet aesthetic feel. While the header bar changes between the landing page and the article view as seen in figure 2.14, it stays in the same place and adds a feeling of continuity within the service. The design also uses only one font, which adds to the consistency. Any user actions also perform as expected, and yields the same result each time.

41 Figure 2.11: Webrecorder manage collection page

Figure 2.12: Pocket landing page

As mentioned, the most prominent feature on the landing page is the archived sites, which makes sense, as this is what the user is there for. There are no elements that compete with the user’s attention, and the menus are placed at the top and left side of the page, which creates a clear hierarchy. Pocket also has a browser extension one of its most common actions, which allows the user to archive the site they are currently browsing with one click. This means that the user does not have to copy the URL of the site, and only needs to open the Pocket service when they intend to revisit the sites they have archived. There is an option to archive sites within the service, but this has been moved to the top-right menu and requires two clicks and an URL to complete. One thing that could potentially cause confusion in the Pocket design is the difference between archiving and deleting a file. The icons for these two are very similar, and the difference

42 Figure 2.13: Pocket action menu for article

Figure 2.14: Article saved to Pocket, viewed in web app may not be obvious to new users. Archiving a site to Pocket is very fast and takes only a few milliseconds, both when archiving manually and using the browser extension. Opening an archived site takes a little longer, but still less than one second. The Pocket UI is, according to our chosen metrics, user friendly. Its design is consistent, simple and performant. Emphasis is placed on the archived sites in the UI, and archival is made easy through a browser extension. The extensive use of icons may seem confusing to some users, but the icons follow conventions and have explanatory text on hover if needed.

The Internet Archive For The Internet Archive, we will examine both the landing page of the archive itself and the landing page of the Wayback Machine. The archive’s landing page, as shown in figure 2.15 has multiple menus, search fields and other elements. The result of this is that the page appears cluttered, and it can be difficult to know where to look for what you are interested in. Because it has multiple menus, and multiple sections below each other, the Internet Archive landing page has a somewhat confusing hierarchy. The same things that affects the site’s clarity also have an impact on its simplicity. The Wayback Machine search bar is prominently featured early on in the site hierarchy and has a helping headline that informs the user of its function. However, for inexperienced users, it may not be obvious what the difference between the Wayback Machine and the Internet Archive is, or if there even is one. One downside to the functionality of the Wayback Machine is that it can be difficult to find back to a particular instance of an

43 Figure 2.15: Internet Archive landing page archived site without its exact URL. The Internet Archive uses pretty much every colour, but it does so systematically. Each colour represents one of the six main categories of content that the archive provides. For example, web is yellow and books are orange. The colours used in the top collections that are visible at the bottom of figure 2.15 are not related to these. The Wayback Machine, as seen in figure 2.16 has a different colour scheme. Aside from the logo of the archive and the Wayback Machine, only one font, in different styles, is used on the site. In terms of consistency, there are some good elements like the categorising by colour and font use, but the rest of the colours do not necessarily match up with this and so the design can appear a bit lacklustre. An interesting feature that may not be obvious at first sight, is that the top menu bar does not navigate to different sub-sites of the Internet Archive, but it changes the content of the dark grey field that in figure 2.15 shows the Wayback Machine search field. While this is perhaps an unexpected behaviour from a site menu, it does not disrupt the use of the site, and can, in fact, make it easier to browse the contents of the site without having to move to a new sub-site each time. Capturing a site with the Wayback Machine takes a few seconds, and opening a captured site is even faster. According to our metrics, the Internet Archive UI is somewhat user friendly but has some aspects that affect its clarity, simplicity and consistency. However, it is still possible, and fairly easy to navigate. Finding the archived sites can be a bit tricky, but seeing as the Wayback Machine is only part of the archive’s large variety of archived content, it is understandable that this is not the main feature of the site.

44 Figure 2.16: Wayback Machine landing page

Summary of Taxonomy Like last time, this section has compared five different Internet archiving systems according to five categories that tie directly into this project. Table 2.2 shows the taxonomy with the same symbols as last time. As this section has looked at systems that are similar to our own in terms of functionality, it has addressed categories that are more directly related to the functionality of the systems than with P2P systems. More specifically, it has looked at the balance between a personal and collaborative system, how local saving can impact the system and its users, what problems a P2P structure can solve, and user friendliness. One important discovery is that the majority of the systems in the comparison have some kind of sharing function implemented. Even if this is not an integral part of the system’s functionality, it is still present.

Webrecorder ArchiveBox Pocket The Internet ipwb Archive Personal XXX XX Local saving X X ∼ X ∼ Sharing X X XX X P2P X X X X X User friendliness X X X ∼ X

Table 2.2: Comparison of Internet archiving systems

One fact is made evident from the systems that have been examined in this chapter; there is no one system that is both decentralised, made for personal and collaborative use, and user friendly. Webrecorder and Pocket, in particular, resemble the system in this project, but both are centralised

45 systems. However, as they have much of the same functionality from a user perspective, they will be sources of inspiration in the rest of the project, particularly when it comes to the GUI.

2.3.4 Summary of Internet Archiving Systems As the prototype in this system is to be an Internet archiving system, this section has explored this type of systems in detail. It started off by looking at the initiatives for Internet archiving that is prevalent both through organisations like The Internet Archive, national initiatives like Nettarkivet, and smaller systems intended for personal use like ArchiveBox. Then, the section went on to outline three central issues that should be addressed when making an Internet archiving system, namely tampering, funding and licensing. Finally, five different Internet archiving systems were compared according to five categories that are relevant to this project and will be important when we outline the core issues of our system.

2.4 Sharing in a Cultural Context

The discussion on Internet archiving systems is, as stated in section 2.3.1, rooted in the cultural importance of the Internet. As an extension of this, this section will briefly address the cultural context of file-sharing systems to be able to place this project in a context that can illustrate the need, and importance, of such a system. This chapter will focus on two types of file- sharing systems; P2P torrenting systems and social media.

2.4.1 Pirating One of the most widespread uses of P2P file-sharing technology is the distribution of media like songs, movies and TV-shows through torrenting services like Napster, Gnutella and BitTorrent [53]. While the content shared on these services not necessarily has to be illegal, this is often the case, which has led to multiple lawsuits against the services. This is also the reason why the use of such systems is commonly called "pirating". The history of peer-to-peer file-sharing started with Napster in 1999. This service became very popular before it was shut down in 2001 [54, p. 54][55, 56] after a long period of legal issues and outrage from the music industry. In the wake of Napster’s rise and fall, multiple similar services emerged, many of which are still up and running today. Despite a huge number of lawsuits against services like BitTorrent [57], these peer-to-peer systems are seemingly impossible to take down. Their decentralised nature makes it hard to pin down a single party to blame, and now, 20 years after Napster’s inception, pirating is still a widely popular use of peer-to-peer systems. It is evident through their immense popularity that pirating services play an important part in our digital culture. Through these services, con- sumers of digital media have created huge collaborative communities for

46 sharing and distribution of files, with very little central governance. Ur- icchio [58] relates these communities to other, similar online communities for news sharing and open-source software. He argues that participation in these communities entails a form of cultural citizenship that has the poten- tial to rival the traditional political citizenship outside of the Internet. The impact of these communities is, he claims, that they will change the way we think about culture and consume media.

2.4.2 Social Media While torrenting systems are perhaps the best example of popular and widely used peer-to-peer file-sharing systems, they pale in comparison with social media when it comes to visibility in our society and culture as a sharing platform. Social media like Twitter and Facebook have huge amounts of active daily users [59, 60], and a vast amount of content is being shared daily through these channels. A study from 2011 [61] defines sharing as one of the fundamental building blocks of social media and emphasises that people on social media are connected by the shared object, which differs depending on the platform which it is shared on. For example, on YouTube, the users are connected by shared videos, and on LinkedIn, they are connected by shared career updates. The cultural significance of social media and sharing is undeniable and has been a field of study for years [62–64]. One thing is certain; a lot of people who participate in social media networks, and consequently in the sharing culture that these networks cultivate. This highlights the importance of sharing in an online network and may explain why almost every single Internet archiving system in the comparison in the last section has a sharing function. An interesting takeaway from this is the feeling of community in a social network that sharing facilitates for.

2.4.3 Summary of Sharing in a Cultural Context As these two kinds of sharing platforms illustrate, there is a prevalent culture of online sharing in our society. Whether it is the illegal distribution and downloading of music, sharing pictures of cats on Instagram, or sharing news that you deem important, it is impossible to deny that sharing plays an important role in our day-to-day lives, especially online.

2.5 Summary of Background

This chapter has examined three different topics that are essential to understand where a system such as the one proposed in this thesis comes from. First, P2P systems were examined through an outline of their characteristics before moving on to a comparison of various types of P2P systems. By creating a decentralised and distributed system, it is possible to combat the weaknesses of centralised systems, such as having a single point of failure. P2P systems are also commonly used as file systems, and

47 more specifically, file sharing systems. This is because they allow users to communicate and transfer data directly between each other, which cuts out the need for a central server. It is also possible to make P2P systems performant through the use of DHTs, as examined. Internet archiving systems were also examined, by looking at their history, and three issues that need to be addressed in any system that seeks to archive the web. As with P2P systems, we also compared several Internet archiving systems to see what functionality is already offered, and what they may be lacking that our proposed system can address. Most of the systems in the comparison were centralised systems, which means that the functionality of the system, and therefore the users, are dependent on central servers. For some of the systems, Internet archiving is a matter of cultural interest, while others are personalised solutions that allow the user to archive what matters the most to them. What most of them have in common, is that they allow the user to share whatever they save on a multitude of platforms. Finally, we examined these type of systems in a cultural context. One of the main arguments for Internet archiving on a large scale is that the Internet has become an integral part of our history and culture and that this makes it necessary to preserve the huge amounts of information that is available on the Internet. The final section of this chapter looked at sharing as an important aspect of our social lives in the digital age. These points anchor the project in the current cultural context and show the relevance of such a system today.

48 Part II

Project

49 Chapter 3

Analysis

Before moving on to the actual design and implementation of the system, we will tie the theoretical discussion and comparisons of the previous chapter directly to our project, through an analysis of the problem and central issues. This chapter also contains an overview of the proposed system, including core elements and evaluation criteria.

3.1 Problem

The creation of this system, like the Internet archiving systems before it, is a direct response to the temporary and changeable nature of hyperlinks and websites. The goal of this project is to address this by adding to the existing tools that allow the user to save and share content from websites, with focus on file availability, user friendliness and sharing. The system proposed in this thesis bears a strong resemblance to some of the Internet archiving systems that we examined in the previous chapter, Webrecorder and Pocket in particular. Its main difference from these two systems is that it is a P2P system, and seeks to utilise the aspects of existing P2P systems that make them reliable and robust in the creation of a file system. It also differs from the one system in our comparison that spans both categories, ipwb, both in that it is intended for personal use, and aims to be more user friendly. The following section will examine the four core issues in more detail.

3.1.1 Core Issues Most of the categories explored in the previous chapter can be divided into four overarching themes; centralisation versus decentralisation, user friendliness, personalisation and sharing. As these four will be the foundation on which we build our system, we will address each of them in detail, and suggest design and implementation choices that directly address the issue.

50 Decentralisation The problem with centralised solutions like Webrecorder and Pocket is that they are dependent on central servers, and the user has no control over the files themselves unless they choose to download them to their own computer. Local saving is also what ArchiveBox is based on, which certainly offers the user more control. If needed, the user can also save copies of the file on hard drives or other computers to decrease the chance that the file is lost somehow. The downside to this is that the user has to save multiple copies of the file manually and take responsibility for the safety of those copies. An important aspect of file archiving, then, is the ability to ensure that the information is available whenever, and wherever it is needed. Accessing a file in a network where the nodes used to save said file are highly unreliable is not easy. The users of the system provide both the network and the storage facilities and are connected to the system with a wide variety of machines, both in terms of availability, storage space and computational power. By using an archiving system that automatically generates copies of a file, and maintains a certain number of copies distributed across a network of machines, users can be sure that the content they save will not disappear over time. To make this work in a real system, it is also necessary to keep an updated record of any replicas that are dynamically created as machines come and go. File availability is tightly linked with machine availability, and while many of the same problems and solutions can apply to both of these issues, it is appropriate to separate the two as they need to be approached in different, though interlinked ways. As with file availability, it is the unreliable nature of the network nodes that can cause failures in the system, and it is important to consider this unreliability as an integral part of the system. Using a DHT to organise the system is one way to assure that machine’s leaving and joining the network are handled appropriately. Our system will be decentralised to tackle the issues of centralised systems and will use a DHT to keep track of active peers. It will also duplicate any file that is archived in the system across several peers to ensure that it remains available.

User friendliness The intention is for the system to be used by anyone who has an interest in consuming and sharing information from websites. For this to be a possibility, the system must be user friendly and understandable for someone who does not have a Computer Science background. The target audience is, therefore, people who have experience using computers and existing bookmarking systems, and who frequently use the Internet to find information and read the news. While it is expected that the users have experience using the Internet, and various digital applications, the application still needs to be simple and intuitive to use. To achieve this, it is important to take into consideration

51 how the user interface looks and to pay particular attention to how it guides the user through the interactions. This can be done, for example, by limiting options, paying attention to the ratio and visual hierarchy of elements in the application [65]. It became evident through the examination of both P2P and Internet archiving systems’ user friendliness that simple solutions that make it clear what the options are without overwhelming the user are better in terms of user friendliness. Drawing inspiration from these, as well as the metrics used in the discussion, can ensure that the GUI follows design guidelines that can be evaluated by actual users. Another important aspect of user friendliness is performance. As mentioned in section 2.2.4, good performance is necessary if users are to continue using a system. Users have short attention spans, and therefore do not have the patience to wait for a system that takes too long to perform the users desired actions. In addition to adding structure to the system, using a DHT will also provide reliable message routing that can be performed in O(log n) steps. It is also important that any file duplication and upkeep does not slow the system down and hinders user interaction. Established design principles will be used in the design of the UI of the system, to ensure that the design is simple and does not overwhelm the user. We will also utilise a DHT to make sure that message routing can be performed fast.

Personalisation One of the categories we examined in the comparison of Internet archiving systems was whether they were intended for personal use or not. Our system is personal, which means that it does not seek to archive the entire Internet, but merely the parts that the individual users wish to save. Like Pocket, Webrecorder and ArchiveBox, the system will be aimed towards individual users. This means that each user has an archive where they can add and remove files as they wish. Nobody else should be able to access their archive unless they choose to share files with another user. The result of this is that users do not have to give up a lot of storage space, at least not outside of what is reasonable. If the system mainly aims to save articles containing text and a limited amount of images, the files that are generated will generally be small. Both content addressing and encryption, as discussed in 2.3.1 can ensure that the files in the system are only accessible to the people who archived them. Taking these kinds of measures can also make the system seem trustworthy to the users. The users should also be able to trust that the files they save will be duplicated and saved on other machines, which can provide safety and incentivise users to participate in the system.

Sharing Sharing, as discussed in section 2.3.3, can act as a motivator for users to participate in the system by adding a social factor as well as personal gain

52 to the merits of the system. Adding a social factor to the system can enforce the feeling that the users are part of a community, rather than a group of users who operate individually. This feeling can make the users more willing to participate in the network and give up their resources because they will be reminded of the communal benefit of participating. Most of the Internet archiving systems we examined have some sort of sharing function, which shows that sharing is an integral part of the way we view and consume content on the Internet. As our system is not a social network intended for communication between end-users, it makes sense for the sharing function to be implemented similarly to the Internet archiving systems that already exist. This means that sharing will be made possible through a sharing button, which will be a central part of the action menu for a file.

3.1.2 Summary of Problem In this section, we have looked at decentralisation, user friendliness, personalisation and sharing in relation to our system, and discussed how these may be reflected in the implementation. All of these categories are reflected in the systems from the previous chapter’s comparison, but none have all four. In particular, the decentralised and distributed nature of a P2P system is what separates our system from the others, especially the ones that are very similar in functionality. By drawing inspiration from the existing systems, we have examined the four core issues in more detail and suggested design and implementation choices that may aid in the creation of our system.

3.2 Solution

The goal of this project is to create a distributed Internet archive that allows users of the World Wide Web to save persistent copies of the websites they visit and store these across a P2P network. The result of this project is a prototype for such an archiving and sharing system, as well as a theoretical discussion that ties this prototype to existing systems and research. Content, in this thesis, typically entails online articles containing text and images. This section briefly outlines the core functionality of the system and describes how this functionality should be evaluated following the implementation.

3.2.1 Required Functionality To limit the scope of the project, we have chosen two aspects of the functionality that must be implemented to address all the core issues described in the previous section.

53 Save and Share Internet Content By this, it is meant that the user should be able to save content from a website through the prototype and share this with others. This experience should be fast, user friendly, and the files should remain available even if multiple peers go down. This functionality mainly involves the system as it appears to the user, and will be represented in the finished prototype.

File Duplication Across Network of Peers File duplication is necessary to be able to ensure file availability. This means that the prototype system should maintain a given number of copies of each file at any time, by keeping track of which peers are alive at any time. The system will have to detect when peers leave and arrive to ensure that a file is duplicated as needed.

3.2.2 Evaluation The evaluation of the system will happen in several ways. First and foremost, saving and sharing content should work as intended through a GUI, which can be tested through the prototype of the system. We have also chosen three metrics, which will evaluate the core functionality in various ways. These will be described in further detail below.

Reliability As the system is going to be a file archiving and sharing system, an important factor is reliability. This means that a user should be able to access any file they wish, at any time. Focusing on this metric will ensure that the core functionality of the system works the way it should, meaning that users can access and share the files that they have archived. To evaluate whether the system is reliable, we will simulate the life cycle of the system by stress-testing the system. This will entail killing and reviving peers randomly over time to check that the amount of duplicated files stays the same, even in an unreliable environment. The results from this will be presented as a graph showing file availability over time.

User Friendliness User friendliness will be examined in two ways; by evaluating the performance of the system, and an examination of the GUI. Performance will be addressed in detail in the next section. As mentioned in section 3.1.1, the design of the system will rely heavily on established design principles and the findings from the analysis of similar systems. Thus, the approach to visual design will primarily be based on experience and research. This approach takes less time and effort than design methods that have a lot of user involvement, which is a good fit for this project, as the user friendliness of the GUI is just one of several important metrics.

54 However, as this approach in practice cannot guarantee a user friendly result, it will be necessary to also get some user input. Therefore, to evaluate the user friendliness of the GUI, we will both conduct a small survey to gain some user feedback and perform the same analysis that we did for each of the systems compared in chapter 2 on our system. Understanding the users and their needs is an important part of any software development process [66]. In a large-scale software development process, this would typically entail the involvement of users before implementation, but considering the scope of this particular project, it was deemed sufficient to involve some users in the evaluation. This feedback could then be valuable in the further development of the prototype.

Performance Performance is an important aspect of user friendliness and will be evaluated by measuring the time it takes to perform specific actions. Jacob Nielsen [4, p. 135] has studied human response times, and groups them into three categories; 0.1 seconds, 1 second and 10 seconds. Nielsen claims that as long as the response time is under 0.1 seconds, the user will feel like the system reacts instantly. 1 second marks the limit of the user’s attention span. If the delay is 1 second, the user will notice it, but not lose focus. Anything above 1 second will likely cause the user’s attention to divert, but as long as the response time is under 10 seconds, they will not lose their focus completely. For delays between 1 and 10 seconds, it is important to show the user that the system is working, for example with a message that it is loading. For delays longer than this, an estimated time remaining should also be shown. The bottom line is; the shorter the response time, the more interactive the system feels, which in turn is more likely to keep the user’s attention. The evaluation of the response time will therefore mostly involve timing of user actions, but also actions that the systems perform in the background that could potentially slow down the rest of the system. The following show examples of actions, and how to measure the response time:

• Archive content from a website, measured from click.

• Open a file that has been shared, measured from click.

• Duplicate a file across peers, measured from the file is archived.

3.2.3 Summary of Solution This section has given a brief overview of the core functionality of the system, and how this is to be evaluated once the prototype is finished. The prototype should be an application that allows the user to archive websites and share the archived files with other users. The files they archive should also be duplicated and maintained across a network of peers. Through an evaluation of the system that directly addresses its reliability, user

55 friendliness and performance, it will be possible to determine whether the system meets these goals.

3.3 Conclusion of Analysis

This chapter has tied the discussion from the previous chapter to our project, by presenting the problem it seeks to solve before moving on to the proposed solution. The creation of our project is tied to several issues that are all addressed in other systems, but not all together. Our goal is to create a system that draws inspiration from these existing systems to create a decentralised Internet archiving system that is reliable, user friendly and performant. The system should be designed for both personal and social use, in that the users can archive the websites that they want, and share these with their peers. By creating a decentralised P2P system that creates multiple duplicates of each archived file, users of the system should be confident that the content they save will not change or disappear over time.

56 Chapter 4

Design

This chapter contains a detailed description of the design of the system, from system structure to the graphical user interface. The system will be a structured P2P system, using the Kademlia DHT to keep track of the system state and content routing. Any file that a user archives will be saved to their device, as well as duplicated throughout the network according to the Kademlia algorithm. The main user interface will be an in-browser file archive, which in time should be implemented as a browser extension. The chapter starts with a brief outline of the core functionality of the system. Then, an overview of the DHT that will be the basis for the peer network is given, before moving onto the files, discussing how the files are to be duplicated and maintained in the system. Following this, the visual design is shown through a wireframe and discussed according to design principles. Lastly, any trade-offs that have been made in the design process are summarised.

4.1 Core functionality

In the previous chapter, the core functionality was limited to two main categories, namely the saving and sharing of Internet content, and file duplication across a network of peers. These two categories can both be broken down even further into a set of specific actions that the system needs to be capable of performing. This section will give an overview of some specific actions that should be made possible through the implementation of the system.

4.1.1 Peer Communication Once a peer is connected to the network, it should be able to communicate with other peers according to the Kademlia DHT. This peer communication will ultimately be used to save and fetch files.

57 4.1.2 Archiving Sites It should be possible to archive sites, meaning that the user of the system should be able to provide a website URL and retain a permanent copy of this site. This copy should then be distributed in the network of peers according to the Kademlia algorithm.

4.1.3 Fetching Files From DHT The users of the system should be able to fetch their files from the DHT. Fetching files is tightly connected with sharing, as these two essentially describe the same action; fetching a file from the DHT using its hash. Therefore, there may not be any difference in the implementation of these two in the back-end of the system.

4.1.4 Sharing To share a file with a user, the key of the file is copied and shared with the person through another channel such as a messaging service, a social media platform, or a written note. The user can then search for the file with its key, which allows the system to locate a copy of the file in the network and provide the user with this. This functionality will be reflected in the UI of the system, as the user needs to access a file’s hash to be able to share it.

4.2 Structured Peer-to-peer System

Figure 4.1: Example system state

As discussed in section 2.2.4, it can be difficult to efficiently route messages in unstructured P2P systems. Our system relies on being able to quickly route messages and locate files to meet its requirements, which is why we have decided to make a structured P2P system using a DHT. This also makes it easier to use an algorithm for file duplication that stores files on nodes that make it possible to locate them using the DHT’s routing algorithm. The system needs to be able to locate and maintain copies of

58 Figure 4.2: Example node state, using marked node from figure 4.1

files that the users archive, and cannot rely on file popularity like torrenting systems, or unreliable searches like Gnutella.

4.2.1 Kademlia The system will use the Kademlia [2, 3] DHT to organise nodes and route messages. Peers in a network that uses Kademlia are organised in a binary tree structure, where each leaf node of the tree represents a peer. Figure 4.1 shows an example of a system state. Each node in the network is assigned a unique 160-bit node ID. To be able to ensure that routing is done in O(log n) steps, each node knows at least one node in each subtree of the whole tree of nodes. This makes it possible for the node to perform successive querying to get closer and closer to the target node. To store information in the system, Kademlia stores hkey, valuei pairs on nodes with IDs close to the key, where closeness is calculated using an XOR metric. The keys, like node IDs, are 160-bit signifiers. The XOR metric calculates the distance between two 160-bit identifiers as their bitwise exclusive or, interpreted as an integer. To route messages, each node contains a routing table, as shown in figure 4.2. This table is a binary tree that consists of several lists called k-buckets, one for each 0 ≤ i < 160. Each k-bucket is a list of routing addresses containing the IP address, UDP port and node ID for nodes in the network that are i distance from the node. The k-buckets are sorted by the time last seen, placing recently seen nodes at the tail, and least recently seen nodes at the head. The k is the max size of the buckets and is set as a system-wide parameter such that k nodes are unlikely to fail close to each other in time. Each node starts with one k-bucket, and dynamically expands the table as needed. Kademlia also provides file duplication and upkeep, which will be outlined in section 4.3.2.

4.2.2 Summary of Structure To be able to efficiently route messages and keep track of where files are located in the system, we have decided to create a structured P2P system based on the Kademlia DHT. The peers in the system will then be organised in a binary tree structure. Information is stored in the system in key-value

59 pairs on multiple nodes depending on the key’s closeness to the node ID according to the XOR metric. Each node knows at least one other node in each subtree of the system binary tree, kept in a routing table consisting of lists called k-buckets.

4.3 Files

As the system at its core is a file archival system, the files themselves play an important part. This section will address a few different design choices that have to do with files specifically. One of them, local saving, was one of the categories that were discussed in section 2.3.3, and will now be related to our system specifically. The remaining subsections will look at file duplication, monitoring and location according to the Kademlia DHT.

4.3.1 Local saving One of the categories that were examined in the comparison of Internet archiving systems (section 2.3.3) was local saving. This category did not directly address one of the core issues but rather represented an implementation choice. For our system, it is not really a choice. Each user needs to save files they archive themselves and copies of other users’ files locally. In a P2P file system, this is necessarily the case, as there is no central server to depend on for storage. As files are replicated across several devices that do not belong to the owner of the file, it is important to ensure that other users cannot access or tamper with files that do not belong to them. To this end, we will use content addressing, meaning that the key for a given file will be calculated based on the contents of the file. This will be done by using a hash function to calculate a key that has a high probability of being unique. Any file that has been tampered with will produce a different hash, and no longer match the key. The system is a network that requires Internet access to function properly, and it will, therefore, be assumed that the users have Internet access while using it. However, offline access may be desirable to some users because it allows them to access the archived files whenever they want. Offline access will also make retrieval of the files fast, as it is likely that the users will access the same files multiple times on the same device. Because of this, any file that is archived in the system will also be saved on the user’s device by default.

4.3.2 File Duplication As mentioned, the Kademlia DHT uses the XOR metric to store information in the system. To duplicate a file, the node will first find the k nodes closest to the key, before sending store messages to each of those nodes. This is done through the FIND_NODE and STORE operations. FIND_NODE uses a recursive algorithm to find the k closest nodes to a given node ID or key.

60 The number of file copies in the initial prototype will be determined by the example k parameter suggested by Maymounkov and Mazières [2], namely 20. Initial experimental results may result in an adjustment of this parameter if it becomes apparent that more or less is needed. Maintenance of the key-value pairs in the network will be done according to the Kademlia algorithm. This is done by republishing the key-value pairs once an hour to account for peers leaving and joining. If a node receives a STORE message for a given key, it will not republish the key-value pair within the next hour, to ensure that only one node republish the pair every hour.

4.3.3 File Location

The file location is done through the FIND_VALUE operation, which behaves much like FIND_NODE, except it will return the stored value, and terminates as soon as the value is found.

4.3.4 Summary of Files This section has outlined the design choices that relate to the files specifically. Namely, the files should be saved locally on the user’s machine, and for the file duplication to work as intended, it is also necessary to save other users’ files. The files will be addressed by their contents by using a hash function, which can help prevent tampering. In the future, encrypting the files is also a measure that should be taken to this end. The files will be duplicated and located according to the Kademlia DHT, which uses the XOR metric to store files on nodes that have IDs close to the file ID in ID space.

4.4 Graphical User Interface

This section gives an overview of the visual design of the system, and which decisions have been made about the design along the way. It does this by presenting an initial low-fidelity prototype of the system in the form of a wireframe, which shows the layout of the system, as well as any colours and other visual elements that are a part of the GUI. To make sure that the system is user friendly following the design principles that we examined in the comparisons of existing systems, this section also addresses each of the principles in relation to our design. The user interface is an in-browser application for browsing and accessing the archived files. Figure 4.3 shows the first wireframe for the system, made using plain HTML and CSS. The wireframe shows the main page of the application, displaying the archived files in a grid. Each file is represented by a preview image, its title and original URL, and has an action menu in the bottom right corner. Aside from the files, the most notable part of the design is the header, which contains a menu and form consisting of a text-field and a button. The menu items are primarily

61 placeholders in this version of the design, and all apart from one have been removed in the final design.

Figure 4.3: First wireframe of system design

Of the existing systems that we examined, our design most closely resembles Pocket, and has been inspired by Pocket’s grid display of the archived sites. One reason that this type of design was chosen is that it gives a very visual representation of the archived sites, and allows the user to quickly scan through their files using both the preview images and the file title to browse. This gives the user multiple mental hooks to see and recognise the files by [26, p. 260], as opposed to just using text, like Webrecorder and ArchiveBox. In addition to being easily scannable, using a grid like this adds ample white space in between the files, which increases readability [25, pp. 751–752]. The following sections look at clarity, consistency and simplicity in turn. These principles and any design choices that have been made along the way will be explained and tied directly to the wireframe.

4.4.1 Clarity The prototype will have very limited functionality, and so there are no extensive menus or vocabulary that the users need to understand to be able to use it. Because of this, the clarity is likely to be good because there are few causes for confusion. However, there are two aspects of the design that should be addressed concerning clarity, which is the use of icons and the vocabulary. Like Pocket, we use some icons in our design, but we have decided to keep these to a minimum, as to not confuse the user. To keep the intention of these icons clear, we have chosen icons that are used widely across the Internet. The icons were fetched from Google’s Material Design icons 1, which are part of a collection of user interface components based

1https://material.io/resources/icons/

62 on Google’s Material Design guidelines 2. The icons themselves, therefore, are based on established norms and researched design principles. In the initial design, we used a download icon and the trashcan icon to delete a file. However, in the final design, the download icon was swapped for the network icon to represent sharing. Another aspect of clarity is the vocabulary that is used in the system. There should be no doubt as to what the various text elements refer to, and this vocabulary should also be consistent throughout the design. Once again, our system has the advantage of being a limited prototype, as this means that each piece of application-specific vocabulary only appears once or twice in the GUI. There are, however, two important vocabulary choices that have been made. One of these was changed between the wireframe and final design, namely what to call the user files. In the wireframe, they are called "files" and in the final design they are called "sites". This change was made to separate the archived sites from files as the user knows them from their computer’s file system. "Sites" was chosen because it describes what the files are, while still using vocabulary that ought to be known to the users. The second vocabulary choice concerns what to call the hash value that is used to share a file. Ultimately, "Share ID" was chosen, as this term is descriptive and uses the word "share" which should be familiar to most users. However, like all design choices, the vocabulary should be examined through user testing and may be subject to change in the future.

4.4.2 Consistency To achieve consistency, the GUI contains a limited number of colours and fonts throughout the design. Galitz [25] argues that one should be consistent with colour use and that the number of colours should be limited to less than four or five. He also emphasises that one should design for monochrome first, meaning that the design should not be reliant upon colours to make sense. Figure 4.4 shows which colours have been used in the design. We have mainly chosen to use grayscale but added one vibrant colour to highlight elements such as menu items and buttons. This aids in drawing attention to these elements, and can help the user to get oriented. Inspired by most websites, the Internet archiving systems that were examined, and the recommendation in [25], we have decided to have a white background and black text. A dark grey colour is used for the menu header to highlight it, marking it as something separate from the main portion of the page. The header should also look the same throughout the application, even on different pages, to create a feeling of uniformity and therefore consistency. The same grey colour is used for the original site URL to lessen its visibility in order to highlight the title link which should open the file in the app. A light grey colour is also used for borders on the top and bottom of each file element. We use only one font throughout the design. Using too many fonts in a design can cause it to appear confusing and inconsistent, just like using too

2https://material.io/

63 Figure 4.4: Colours used many colours will. To differentiate between the different text elements of the system, we have used different font sizes, colours and weights, which help structure the design without appearing inconsistent [25, p. 164].

4.4.3 Simplicity The files will be the most prominent part of the in-browser archive design, as this part of the system is mainly intended for revisiting and sharing archived sites. The files take up most of the screen space and are displayed in a grid, with three files in each row. Along with a preview image of the saved site, each file element also shows the title, URL and user menu for each file. This way, no unnecessary information is displayed, allowing the user to navigate their files without having to process any unnecessary information. The site does not have any overarching menus, as the prototype only displays the user files. The user menu for each file only consists of two actions that are necessary to comply with the system design. This menu appears simple on account of it having few elements, and because the intention of each element is clear, as discussed in section 4.4.1.

4.4.4 Summary of Graphical User Interface This section has given a short presentation of a low-fidelity wireframe of the GUI and discussed the various design choices in relation to the design principles that were examined in the comparison of existing systems. The GUI is split into two main parts, a header and a grid view of the user files. To achieve clarity, consistency and simplicity, it was decided to use a limited set of icons, colours, fonts and menus, and to use ample white space in the design. The files themselves are presented with a mix of graphical and textual elements, which gives the user multiple frames of reference when browsing their archive. Because the prototype’s functionality is very limited compared to other similar systems, it is important to keep in mind that it may be easier to achieve the goals set by the design principles.

64 4.5 Trade-offs

This section outlines the major trade-offs that have been made in the design process of the system. The decision to create a structured P2P system was made to be able to reliably and efficiently locate archived files. This means that the system continually needs to keep track of available peers, and how many copies of a file is available. Systems that use a DHT spend more time and effort on this upkeep than systems with alternative solutions, which may affect the response time of the system. However, in a system where it is necessary to keep track of all files, and where efficient file location is important, the benefits of using a DHT outweighs the time it takes to maintain it. Additional measures such as local saving of files can also make up for the overhead created by the system upkeep. Using a centralised server to keep track of files could have been another way to locate files efficiently, but this would have created exactly the vulnerability that we seek to avoid by creating a P2P system. In the same vein, it would have been possible to create a centralised system for Internet archiving, but we have chosen to create a decentralised and distributed system. The vulnerable nature of the client-server structure of websites is one of the reasons that we are creating a web archiving service in the first place. Offering a solution that uses a more robust architecture for file archiving has been one of the driving forces of this project. Besides, multiple centralised solutions already exist that offers the same, or mostly the same functionality as our system, as seen in 2.3.3. One challenge when creating a P2P system is that it demands more from the users, as they are expected to provide storage and networking resources instead of just passively using them as they would with a centralised solution. To many users, a system that has a central server may seem like the easier solution, as they do not have to worry about giving up their own machine’s storage space. For a P2P system, as previously mentioned, it is therefore important to incentivise the users to make this sacrifice for the benefit of every user in the system. Sharing is, therefore, an integral part of our system design.

4.6 Summary of Design

In short, the core functionality of the system is archiving sites, storing and fetching the archived sites from the DHT, and sharing the archived sites using the hash value. The system will be a structured P2P system, using the Kademlia DHT for system structure and content routing. The files themselves will be saved locally on the users’ devices and in the network in key-value pairs on nodes who’s ID is close to the file key according to the XOR metric. The system will use content addressing, meaning that the key for a file will be a calculated hash value depending on its content, which can help prevent tampering. The GUI will be an in-browser application that allows the user to view

65 their archived files, and archive new sites. The visual design, as shown through a wireframe, displays the user’s files in a grid, and has a header with a form field that lets the user archive a site, using the site’s URL. Three design principles, clarity, consistency and simplicity, acted as a guide for the visual design and led to the final design having a limited number of icons, colours, fonts and menus. Achieving these goals was also made easier by the fact that the system has limited functionality at this stage.

66 Chapter 5

Implementation

This chapter gives a detailed description of the implementation of the system, including both its overlaying structure, the code and visual design. The prototype was implemented with Node.js and React, with a clear two- tier architecture. Node.js and its Express library were used to create the back-end and an API for the front-end to communicate with, and the back- end handles everything that involves files and the peer network. The front- end is a React app that allows the user to interact with the system in the browser, by displaying the user’s files and making it possible to archive new sites. First, this chapter gives a brief explanation of the system structure, before giving a breakdown of how to run the application, either with the front-end app or as a CLI. Following this, there is a detailed description of both the back-end and the front-end, including an overview of key files for both. The section detailing the back-end looks at the peer handling and file handling in turn, whereas the section on the front-end explains the functionality and then the visual design of the React app.

5.1 Two-tier Architecture

In terms of structure, the system has a two-tier architecture, giving it a clear separation between the back-end and front-end. The back-end takes care of routing, duplication of files and sharing, while the front-end is an in- browser app that lets the user interact with the system. One advantage of a two-tier architecture is that it makes the code more modular, and therefore easier to write, maintain and test. Modular code, where each component has a limited set of responsibilities or even just one responsibility at function level, ensures that no part of the code gets overly complicated. It should be possible to add a visual feature to the front-end application without having to change the back-end code, and vice versa. This way, the front-end app does not have to perform any file or DHT operations, and the back-end app does not have to react to user input. The front-end communicates with the back-end through an API with a limited set of functions. Another advantage of separating the front-end and back-end is that it

67 makes it much easier to scale the system to other uses. The same back-end can be moved to remote servers, or be used to create a mobile application with a different front-end. In the same vein, this separation also makes it easier to test the system, as it is possible to use and test all the functionality with automated tests through the back-end without having to go through the front-end.

5.2 Running the Application

To run the application, it is necessary to have Node.js 1 and the yarn 2 package manager installed. Node.js is a runtime environment that makes it possible to run JavaScript code both on the client- and server-side of an application. A package manager, like Yarn, takes care of installing, updating and removing packages that are used in the code, and allows the user to create pre-defined scripts to run the application or any tests related to it. The project can be found at https://github.uio.no/tonjro/distarchive. The following bash commands will clone the project and install any dependencies. $ git clone [email protected]:tonjro/distarchive.git $ cd distarchive/ $ yarn install-api && yarn install-client There are two separate ways to run the system; one is to run the application that uses the front-end, and one is to run a command-line interface (CLI). These will be described separately in the following sections. As the full application that includes the front-end is dependent on using specific localhost ports to connect the front-end and back-end, it was decided that it would be easier to not connect this application to other peers but to create a separate interface to manually test the DHT and peer network.

5.2.1 Application To run the full application, it is necessary to open two terminals and execute each of the following commands in a separate terminal. $ yarn api

$ yarn client The API terminal will display the HTTP status code for any API calls made by the client. The client will be available on http://localhost:3000/. Through this, it is possible to view and use the front-end application where it is possible to save and view files as described in the previous chapter. While running the application in this manner creates and stores the files in a DHT, it is currently not possible to spawn any more peers and

1https://nodejs.org/en/ 2https://yarnpkg.com/

68 connect them to the network. This mode, therefore, is primarily intended for the development and testing of the front-end.

5.2.2 CLI The CLI is implemented as a command loop, where the user is presented with five choices and is allowed to interact with the system through the terminal. It is possible to run the CLI in two different ways; locally in multiple terminals on one machine, or in a network of machines. In either case, it is necessary to run the following command in one terminal or on one machine. $ yarn cli -f This is because the Kademlia implementation we have used, which will be described in section 5.3.2, does not support connecting to a network without knowing about at least one peer in the network. For a final version of the system, this would likely be solved with one or more bootstrap servers, like torrenting services use, that keeps an updated record of peers that new peers can connect to the network through. As a temporary solution during the development of the prototype, the initial peer is always created using the same peer ID and port, so that other peers can easily connect to the network. To run the application locally, the remaining peers are automatically connected through the local IP address, and therefore it is only necessary to run the following command in as many terminals as desired: $ yarn cli However, if the CLI is executed in a network of remote machines, it is necessary to add the IP address, including the IP version and port, of the initial peer to the command. The IP address can be found in the log after running the initial peer, which will look something like this: > peer started. listening on addresses: > /ip4/127.0.0.1/tcp/10333 QmcrQZ6RJdpYuGvZqD5QEHAv6qX4BrQLJLQPQUrTrzdcgm > /ip4/192.168.1.252/tcp/10333 QmcrQZ6RJdpYuGvZqD5QEHAv6qX4BrQLJLQPQUrTrzdcgm Here, /ip4/192.168.1.252/tcp/10333 is the address needed to connect to the initial peer, and the following command can be run from any number of remote machines: $ yarn cli --ip /ip4/192.168.1.252/tcp/10333 It is possible to use any of the peers already connected to the network to connect a new peer, but then both the address and the ID needs to be provided: $ yarn cli --ip /ip4/192.168.1.252/tcp/10333 --id QmcrQZ6RJdpYuGvZqD5QEHAv6qX4BrQLJLQPQUrTrzdcgm The CLI makes it possible to ping other peers, using their peer ID, store and locate files, and find the number of providers for a given file. Files

69 are saved using the URL of the website and located using the hash that is displayed once the file is saved. Upon closing, the program will remove all local files and clean up the database.

5.3 Back-end

The back-end was implemented with Node.js, using the Express 3 frame- work to create an API for the front-end to communicate with. Node.js was chosen because of the existing JavaScript implementation of Kadem- lia, which will be described in the next section, and because it is well-suited for programming front-end applications. Node.js allows us to create both a back-end and front-end using JavaScript as the foundation, which makes it easier to enable seamless communication between the various parts of the system. This section gives a detailed description of the back-end scripts. It will start by giving a brief overview of the file structure, before examining the peer handling and file handling in turn.

5.3.1 File Structure The back-end code is located in the api/ folder. Most of these folders and files are essential for the back-end of our application, and will be described in detail below. Some are left out; these are files and folders that were automatically generated when initialising the Express app and have no significant role in our system specifically. bin/ The file in this folder is an executable that is used by Yarn to start the API, www. This file is also automatically generated by Express, but we have made one notable addition to it; when the API is started, a peer is created and a global peer-variable is set to be able to access the same peer across so-called routes, as described below. public/ This folder contains everything that is accessible to the people connecting to the application, which in our case is only the individual user. The important folder here is the images/ folder, which contains all the images presented in the front-end app, both previews and full- size versions. routes/ In Express, the routers 4 are what connects the front-end to the back-end. When the front-end communicates with the API, it uses any one of these routes to make HTTP requests to the API. In our system, the communication happens through one of four routes:

delete.js This router removes a given file from local storage. The file is not removed from the DHT as there is no option to do this in the libp2p (see section 5.3.2) API.

3https://expressjs.com/ 4https://expressjs.com/en/guide/routing.html

70 fetchFile.js Fetching a file means using the hash of a file to fetch the file from the DHT. A JavaScript object named content (see section 5.3.3) is returned to the front-end app to be displayed. index.js This router is called each time the front-end application is loaded, and will fetch any existing files from the database. saveFile.js Saving a file means using its URL to capture the site and save this to the database and the DHT. As in fetchFile.js, an object is returned to the front-end app. src/ The source folder contains all the back-end code, and the database. All the files in this folder will be described in detail in section 5.3.2 and 5.3.3, and their structure and functions are presented in figure 5.1. The following list gives a brief overview of the folders and files.

db/ The database, which consists of a JSON file. fileHandler/ The FileHandler takes care of any file handling and communication with the database. All file handling functions are defined in the index.js file. peerHandler/ The PeerHandler handles everything that involves the peers, including peer creation, communication between peers as well as storing and fetching files in the DHT. All peer handling functions are defined in the index.js file, while the p2p/index.js file contains the module used to create peers (see section 5.3.2). index.js In order to ensure that the routers and test files do not rely on using the FileHandler and PeerHandler directly, the functionality of these two has been abstracted up one level into this general Handler. This makes the system more modular, as each of these scripts can evolve and change without having to also update the rest of the API. test/ This folder contains all test files, which will be explained in chapter 6. app.js This files ties the whole API together, and is also automatically generated, aside from the addition of the routers, and a static folder for the front-end app to access the images.

5.3.2 Peer Handling The DHT was created using libp2p 5 [17], the same technology that IPFS is built on. libp2p offers an updated library for building P2P applications with JavaScript and Node.js, based on the Kademlia DHT. Using a preexisting implementation of a DHT with a built-in content routing interface makes the implementation process easier and allows us to focus on the system as a whole.

5https://github.com/libp2p/js-libp2p

71 Figure 5.1: Structure of back-end scripts

To be able to create libp2p peers, we have created a class, DistArchPeer. This class extends the Libp2p class, and defines how to communicate with other peers and enables the DHT. All handling of peers, as well as the DHT, is implemented in the PeerHandler. Peers are created with the createPeer function. The first peer to be initialised is used as a bootstrap peer, and will always have the same peer ID, as defined by the peer-id.json file in the peerHandler/ folder. This is because other peers need to know of at least one other peer in order to connect to the network, as previously mentioned. Any peer created after the initial peer will connect to the network by dialling the initial peer to establish a connection, before performing a lookup for the closest peers to itself, and dialling all of these. If the initial peer goes down, the network will continue to run, and new peers can connect if they know the address and ID of an existing peer, as described in section 5.2.2.

72 The main purpose of the peer handler is to store and fetch data from the DHT. The storeData function takes the file content and creates a hash, before converting both of these to a buffer, which it then stores in the DHT as a key-value pair. The hash is returned from the function. To find data, the findData function is called with the file hash. The buffer located in the DHT is returned. Additionally, the peer handler has a getNumberOfProviders function, that will find the number of providers of a file, given its hash as a buffer. It is only possible to provide the key argument as a buffer to this function, which is why the hash is converted into a buffer when storing data. This function, as well as the ping function, are mostly intended for testing purposes. Additionally, the peer handler has a republish function that was added following the initial reliability tests. The purpose of this function is to republish key-value pairs every hour and will be discussed further in chapter 6. The hash function is implemented in the peer handler using the Node.js Crypto library 6, which provides cryptographic features such as hashing functions. The hash is based on the content received from the file handler, but the peerId of the local peer is also added, to ensure that the hashes are unique, even if two peers archive the same site at the exact same time. The hash is represented in the rest of the system as a hex string, which can be used to fetch content from the DHT. In summary, the PeerHandler and the accompanying DistArchPeer module handles everything that has to do with the peer network and the DHT, using the libp2p library. All peers are automatically connected to the first bootstrap peer, before locating other peers. Besides the creation of peers, the peer handler has two main functions to put and get data from the DHT. It also has a ping function, and a function to fetch the number of duplicated of a given file for testing purposes.

5.3.3 File Handling The file handling consists mainly of fetching content from websites and saving content to the local database. It also has the option to fetch all local files and to convert value fetched from the DHT to file data. The websites are captured using the Node.js Puppeteer 7 library. When the user provides a URL, Puppeteer launches a Chromium browser, visits the given website and captures a screenshot. Instead of attempting to create a browsable rendition of the site, it was decided to use a screenshot to represent the website, both to narrow the scope of the project and because this functionality does not directly tie into our main focus. The images are saved as PNG files. Figure 5.2 shows the data flow when a user saves a new site to the archive. Based on the site URL, the system fetches the title and a screenshot of the site, and returns the title, URL, and file paths to the screenshot and a smaller preview image, to the Handler. The Handler, in turn, sends the

6https://nodejs.org/api/crypto.html 7https://github.com/puppeteer/puppeteer

73 Figure 5.2: Program flow when saving a new site to archive

data, and each image encoded as a buffer, to the peer handler. The peer handler appends the peer ID to the data, calculates the hash and adds the file to the DHT. Finally, the hash is returned to the Handler, which adds this to the content before the information is sent to the front-end app via the router, where the preview image is fetched from the API and displayed along with the title and URL of the site. The content object that is finally returned to the front-end app looks like this: 1 { 2 title: "Example␣title", 3 url: "www.example.com", 4 preview: "images/previews/2020-04-13T14:07:36.png", 5 site: "images/2020-04-13T14:07:36.png", 6 hash: "0e21ce51e269bd1923740cb85d56543a7cd5a1f28308ffd3b22c703", 7 } As stated in chapter 4, any file that is saved by the user is saved locally on the user’s device, in addition to the DHT. In this small prototype version of the system, this is solved by saving data about the files to a JSON 8 file, and the site images to a local folder. Setting up a database was considered, but in light of the size and scope of the project, it was decided to go with a simpler solution which works well with small quantities of data. The

8https://www.json.org/json-en.html

74 JSON file format is commonly used to save and distribute information on the web and is easy to read by humans. In a more extensive version of the system, this would need to be addressed again. Additionally, in a more extensive version, the user should be able to choose to remove some or all files from their device without deleting them from the system. It should also be possible to turn off the automatic local copy.

5.3.4 Summary of Back-end The back-end was implemented in Node.js, and functions as an API for the front-end to communicate with. In addition to this, the back-end also contains scripts that deal with peer and file handling. Peer handling and the DHT were implemented using libp2p, a pre-existing Node.js library that supports the creation of P2P systems. The peer handler takes care of creating peers and connecting them to the network, as well as storing and fetching content from the DHT. File handling involves both the capturing of websites and any communication with the local database. In order to create a universal interface for communication with the API, a general handler was created that takes care of the communication between the peer and file handler.

5.4 Front-end

The front-end of the system is a React 9 app that displays the user’s files in the browser. React is a JavaScript library for building front-end applications that is particularly suited for creating single-page apps. A React app is made up of components that all have a state, which can be updated dynamically as the user uses the app. In our particular case, the app allows the user to interact with the system through a limited set of actions. It is possible to archive a site, either using its URL or its hash and once a site is archived, it is possible to view the file, see its hash or delete the file. This section looks at the front-end app in detail and outlines both its functionality and its design. The section starts by giving an overview of the files that make up the app, before moving on to a detailed description of the functionality and finally a presentation and explanation of the visual design.

5.4.1 File Structure All the front-end code is located in the client/ folder. Unlike the back- end, most of the contents of this folder were automatically generated, and the only files that make up the main part of the front-end application are the following, all located in the src/ folder. App.js This file contains all the functionality provided by the application, as well as the HTML code that is rendered. The App component that

9https://reactjs.org/

75 is exported from this file is responsible for any API calls, displaying files and handling any user input. components/ In addition to the main App component, there are an additional three, smaller components that are used to handle and display the files. These are all located in this folder.

CurrentFile.js This component is used to display a file that the user has chosen, which for this prototype version means displaying the file’s title and URL, a user action menu and the full-size screenshot. File.js This component is used to display the archived files on the main page of the application. For each file, this entails showing a preview image, and like the CurrentFile component, the file’s title, URL and an action menu. FileMenu.js Both the CurrentFile and File component have the same action menu, which allows the user to see and copy a file’s ID or delete the file. This component separates this into a separate component that both the other components use. styles/style.css All the styling for the application is located in this file.

5.4.2 Functionality The application has two different pages; the home page, as shown in figure 5.3 and the display page, as shown in figure 5.4. The former is the landing page for the app and will display the user’s archived sites in a grid, as mentioned displaying a preview image, file title, URL and action menu for each file. If the user clicks one of the files, by either clicking the title or the preview image, the display page will open. Here, the title, URL and action menu are prominently displayed at the top of the page, before the full-size screenshot is shown. The header contains a link to the main page and a form field that allows the user to archive sites, or fetch files from the DHT. Throughout this section, these various components, and the functionality that is made possible through them will be described in more detail. As mentioned, a React app has a state that is continually updated as the user interacts with the system. Our app consists of one main component, the App, and three smaller components File, CurrentFile and FileMenu. Each of these has a number of state variables that keep track of all the data that is dynamically displayed in the app, such as the files, the text field and any error messages. These are updated whenever the user interacts with any element on the site, allowing the app to react accordingly.

Header The header contains two main parts; a link to the home page and a form field where it is possible to archive a site through its URL or retrieve a file from the DHT with its ID. As can be seen in the first wireframe for the design in figure 4.3, the link to the home page was initially intended to be

76 Figure 5.3: Front-end application home page a part of a menu, but as we limited the functionality of the system, the rest of the links were discarded, and the remaining link was made to redirect to the home page. However, the underlying menu structure has been kept in the code. The most important part of the header is the form, consisting of a text field and a button. If the user writes anything in the text field, a state variable is updated on every change to the field. Clicking the button will check if the given input is valid in the handleSubmit function, before either archiving the site or fetching the file. Both the archiving and fetching functions, saveURL and fetchFile respectively, will perform an API call with the input, parse the response and check whether the action was successful. If the API call was successful, the new file will be saved and displayed, and if the call was not successful, an error message will be displayed. The files are all saved as File components in an array that is displayed on the home page.

Home Page The main portion of the home page is used to display the user’s archived files in a grid. Each displayed file contains a preview image, fetched from the API, the title of the file, which is also a link to open the file, the original URL of the file, and a file menu. Clicking either the file title or image will open the display page, and clicking the original URL will open the original website in a new tab. When the user clicks the file title or image, the openFile function is triggered, which updates the current file, a state variable whose purpose is to keep track of the information about the selected file. In addition, the click will change the route of the page from the home page to the display page, which will show the current file.

77 Figure 5.4: Front-end application display page

File Menu At the bottom of each file element (see figure 5.5), there is a file menu consisting of two icons. The share icon, on the left, will trigger a pop-up right above the icon, as pictured in figure 5.6. Because the share ID is fairly long, it is displayed using a fixed size read-only input field, which allows the user to scroll to see the entire ID if needed. There is also a button that lets the user copy the whole ID to their clipboard. The delete icon, on the right, will also trigger a pop-up, but this one covers the whole page, as shown in figure 5.8. If the user clicks the "Cancel" button, or anywhere in the greyed out area outside of the pop-up, the pop-up will close. If they click the "Delete" button, the deleteFile function will be triggered, which removes the file from the list of files and makes an asynchronous API call to remove the file from the database. Additionally, this function reloads the site. This was done to prevent problems with the display site which will be explained below.

Figure 5.5: File menu

Display Page The main purpose of the display page (figure 5.4) is to show the file, and therefore there is not any extra functionality here that is not already available on the home page, aside from a link to return to the home page. There is a link to the original site, which will open in a new tab and the same

78 Figure 5.6: Share pop-up

Figure 5.7: Share pop-up, copied ID

file menu as described above. If the user deletes the file while on this page, the site will be redirected to the home page, showing the remaining files. The user can still archive sites while on this page, and the header will show all the informational messages, but the page will not change until the user exits to the home page. If the user reloads the page, the site redirects to the home page. This was done because the state is reset when the site updates, which means that the current file variable no longer will be available.

5.4.3 Visual Design The final visual design of the application closely resembles the wireframe presented in chapter 4 in look and feel. The colour scheme, layout and general components of the site are all the same, while features such as pop-ups and error messages have been added in the development of the front-end app. This section addresses the various components in turn, and discuss any changes or additions to the design.

Header As discussed in the previous section, one of the major changes in the header is the change from a menu to a link to the home page. The design was however not changed, both in the interest of keeping the menu structure, and because the remaining item serves well as an informational heading as well as a link. The thick underline was intended as an indication of which menu item was selected but ended up serving the purpose of emphasising the heading. As a result of these changes, the header at the top of the display page has been removed in the final design. Otherwise, both the text field that makes it possible to add a URL and the save button have been enlarged and stylised to have rounder edges. The text inside the text field has been changed to say "URL/Share ID to save" to inform the user that it is possible to archive a site from URL, or by its share ID. The mention of the share ID is also intended as a way to inform the user what the share ID can be used for.

79 Figure 5.8: Delete pop-up

Figure 5.9: Loading icon

Status Messages One of the more important additions to the visual design is the messages that are displayed to the user beside the header text field. These messages are intended to indicate to the user that the system is working and to inform them of any successes and errors along the way. When the user archives a site, the system works for a few seconds, and therefore a loading animation, shown in figure 5.9, is displayed to indicate to the user that the system is working. As mentioned in section 3.2.2, any action that takes more than 1 and less than 10 seconds to complete should be coupled with an indicator such as this, so that the user understands that the system is working and does not get distracted or frustrated. Once the site is archived successfully, a checkmark icon, shown in figure 5.10 will appear in the same place as the loading animation, to show that the archival was a success. This is coupled with the new file appearing in the grid and is only meant to act as an extra confirmation. There are also multiple error messages that may appear. If the user provides something that is not a URL or hash, "Invalid URL" (figure 5.11) will be displayed. If the user provides a valid hex value that is not a valid file hash, "Site not found in the network" (figure 5.12) will be displayed, and if they provide a hash that belongs to a file already in their archive "Site already saved" is displayed. The final error message will appear if the user provides a valid URL, but the API is unable to archive it, in which case "Unable to save site, please try again" is displayed. With all of these messages, the intention was to make the text short, but informative by

80 Figure 5.10: Site archived icon

Figure 5.11: Invalid URL error message telling the user what went wrong, and in some cases prompting further action, like asking them to try again.

Home Page Each of the file elements in the grid on the home page is marked by an upper and lower border in a light grey colour. This border was added to clearly separate the files, and to mark the boundaries of the grid elements. The white space between the title and URL of the file and its file menu is present to make room for longer titles, and therefore to ensure that the elements stay the same height no matter the size of the contents. The border also helps to group the items in the element together [25, p. 140]. To highlight the difference between the title link and the URL link, the title is both larger and a different colour. Additionally, hovering over the original URL will underline the link to further separate it from the title and image links that are only signified by the cursor changing.

File Menu The design choices regarding the file menu have remained the same since the first wireframe, even if one of them have been swapped out. In the first wireframe, a download icon was used, but this has since been swapped with a share icon. The download icon was chosen both as a placeholder and under the assumption that more icons would be added later. This was however not the case, and as the functionality of the prototype was limited and the sharing function prioritised, this change was made. As described above, clicking the share icon will trigger a pop-up, as displayed in 5.6. This type of pop up was chosen because it provides a quick and easy way to display what the user needs without having to change the environment they are interacting with [25, p. 414]. In the pop- up, it is specified that the hash is a share ID. Using the same word in the pop-up as in the text field in the header creates a connection between the two, and hints as to what the intention of a share ID is, and how it is used. As mentioned in the previous section, the share ID is displayed in a scrollable text field, because there is no need to read the entire ID in order to copy it. If the user clicks the "Copy" button, the button disappears and an informative text that reads "Copied!" is displayed, as shown in figure

81 Figure 5.12: Site not found in network

5.7. This indicates to the user that the text has been copied and that they can move on. In a later version of the system, this pop-up would likely also contain short cuts to social media to make it easier to share the ID with fewer steps.

Figure 5.13: Closeup of delete pop-up

Unlike the share pop-up, the delete pop-up (figure 5.8) covers the entire page and will break the flow of the user journey to a much larger degree. This is intentional, and a common feature of delete operations across software and web development [25, p. 572], because of the destructive and often permanent nature of delete actions. Therefore, a message asking the user to confirm the action is displayed (figure 5.13), forcing the user to confirm that they want to delete the file. The delete button in the pop- up is also coloured red, a colour commonly used to warn the user, causing them to think twice before they click [25, pp. 706–707].

Display Page The display page is fairly simple and does not show any information that is not available on the front page aside from the larger screenshot. As a result of this, the design is also simple, with all the elements of the page centred and displayed on one line each. The only new element on this page is the arrow button that will take the user back to the front page. This was added at the very top of the page, making it easy to see and to signify its meaning. The look and functionality of the file menu are the exact same here as on the front page, which adds a level of consistency.

5.4.4 Summary of Front-end The front-end was implemented using React, and is an in-browser app that displays the user’s files and allows the user to interact with the system, either to archive sites, or share or delete existing files. Visually, the design has not changed much since the initial wireframe, but some new features have been added and minor changes have been made. The files on the front

82 page are displayed in a grid, with a preview image, the title, original URL and a file menu for each file. Opening a file will open the display page, where the same information and a larger image of the site are displayed. One of the more notable changes in the visual design is the addition of indicators such as a loading animation and error messages that inform the user of what is happening in the system and lets them know when actions succeed and fail. Functionality has also been added to the file menu, which uses two pop-ups to either let the user copy a file’s share ID or delete the file.

5.5 Summary of Implementation

This chapter has outlined the implementation of the system, starting with the overall structure before moving on to a more detailed description of the code and the visual design. The goal was to give an overview of the system prototype, which was implemented with a two-tier architecture using Node.js and React. The final prototype is an application that runs a graphical user interface in the browser which allows the user to interact with the system. Communication between the front-end and back-end then happens through an API, which then allows the back-end to perform any actions specified by the user. Any peer and file handling is managed by the back-end, separated into two scripts in order to keep a clear separation between the different parts of the functionality. These are interconnected through a general handler to create one interface for communicating with the back-end that can be used both by the API and the automated tests. The GUI is presented through the front-end app, whose responsibility is handling any user input, and update accordingly. This entails both displaying the user’s files with accompanying actions such as sharing and deleting, and handling the site archival. Together, the back-end and front-end make up the final prototype, which allows the user to archive, view, share and delete sites as specified in chapter 4.

83 Chapter 6

Evaluation

In this chapter, both the core functionality and the evaluation metrics as presented in section 3.2.2 are addressed through a variety of tests. The tests are separated into automatic implementation tests and user tests, which together cover both the functionality and the graphical user interface of the system. The purpose of the implementation tests was to ensure that the core functionality works as described in section 4.1 and to see if and how well the system meets the evaluation metrics reliability and performance. User friendliness was evaluated by distributing a questionnaire to a small number of test users, and conducting an analysis of the GUI with the answers from the test and the design principles clarity, consistency and simplicity in mind.

6.1 Implementation Tests

This section addresses two of the evaluation metrics that were introduced in chapter 3, and the core functionality of the system as outlined in chapter 4 through automated tests. Each of the following sections looks at functionality, reliability and performance in turn, by describing the tests, how to run them and the results. Through this, it will be possible to determine whether or not the system meets the requirements, and to which degree the requirements are met. Functionality was tested by creating four peers within the same script, and repeatedly testing the core functionality like peer communication, site archiving and fetching files from the DHT. The reliability tests used Docker containers to create multiple peers and simulated an unreliable network to examine file availability over time. Performance was tested by timing the main user actions of the system, site archival and fetching from the DHT.

6.1.1 File Structure Overview All automated tests are located in the api/test/ folder. There are three main categories of tests, functionality, reliability and performance and each of the main test files contains all the tests of its type. Functionality

84 functionality-test.js This test ensures that the core functionality works as it is intended to, by performing the main user actions multiple times.

Reliability

reliability.py This file contains two reliability tests that run in similar, yet different ways. Both use Docker containers and a Docker network to run ten peers to simulate an unreliable net- work of peers. This is done by either disconnecting/connect- ing peers to the network, or stopping them entirely and creating new containers. setup.js This is a setup file intended to run in a Docker container. The script starts a peer, and for the initial peer runs a script that archives files and then continuously checks for the average number of copies of the files.

Performance

performance.js The performance tests will time the execution of the two main user actions; archiving a site and retrieving the file from the DHT. This requires the tester to run the test file on at least two terminals or machines to recreate a more realistic use situation. urls.txt This file contains 527 URLs fetched from several Norwegian digital news sites in March 2020. All the test programs use this list when archiving sites.

6.1.2 Functionality The functionality tests were created to make sure that the core functionality works. The following are considered core aspects of the functionality:

• Inter-peer communication

• Saving a file to the DHT

• Retrieving a file from the DHT

This test program creates four peers on one machine and runs all the tests locally. Using libp2p, it is possible to create and connect multiple peers within the same program without having to open multiple terminals manually. The peers will be available at different ports and communicate the same way that peers spawned in multiple terminals would. The result is, therefore, the same, but easier to implement, and fully automatic. One consequence of this is that every action is performed sequentially, which is unlike a real use situation. As these tests are only supposed to test functionality, this does not matter. Emulating a more realistic use situation will be a core aspect of the reliability tests, and is therefore not covered here.

85 Initially in the test program, the peers ping each other to test that communication between peers is possible, especially between the peers that were connected through the bootstrap peer. Following this, two of the peers tests that archiving a file is possible by archiving five different sites, using random URLs. The peer then checks that the file has been duplicated across all the connected peers. Finally, two peers check that they can fetch files from the DHT using the file’s hash by each fetching five files.

Running the Tests This test can be run by executing the following command: $ yarn func

Results The results from an example run of this test are shown in appendix A. As the test output shows, the peers are able to communicate, and archive and retrieve files. The files are also duplicated across the network of peers. However, the archiving of a file fails in one of the ten times it is run in this particular run of the test. This is an error that appeared semi-frequently in the automated tests, especially in the more complex tests that are run for reliability and performance, but it never occurred while manually testing the system via the front-end application. While the wording of this error does not reveal much in regards to why it occurs, it often occurs at the same time as various timeout errors, either in the Puppeteer code or while adding a file to the DHT, as in this particular case. This, coupled with the fact that the errors tend to occur during more computationally heavy tasks such as having multiple peers running at once on one machine seems to indicate that this is a result of a lack of processing power at the time of execution. The program has enough peers to add the value to, as it successfully adds files to the other peers both before and after the error occurs. Because of this, and the fact that the error rarely occurs when using the front-end app, there was made no significant efforts to fix this error in the current prototype of the system. As a result, it can be argued that the system’s core functionality works, but not necessarily as well as it should.

6.1.3 Reliability The reliability tests were implemented using Docker 1 and Python 2 scripts. Docker is a tool that allows for the creation, deployment and use of software through so-called containers. A container is a bundle containing all the software and dependencies needed to run a program. In this, they operate similarly to virtual machines, but unlike virtual machines, they share the host operating system kernel, which makes them both faster and more light-weight [67]. Python is a high-level programming language with

1https://www.docker.com/ 2https://www.python.org/

86 a simple and straightforward syntax that makes it fast and easy to both write and read code. There is also a Python library 3 available that allows the developer to run Docker commands from within a Python program, making it easy to keep track of and work with several containers at once. To simulate a P2P network with separate, but connected machines, we used a Docker bridge network 4, which allows containers that are connected to the network to communicate isolated from other containers. A bridge network also makes it possible to disconnect and connect running containers on the fly, making it a good fit for simulating an unreliable P2P network. Reliability was tested in two similar, yet different ways. The first test disconnected and connected the peers from the bridge network and the second test removed containers entirely and added new ones to the network. Each approach is described in more detail in the following sections. These tests seek to examine the file availability in a simulation of a real system state. This means that peers should be unreliable, and come and go at random intervals, much like the peers in a real P2P network. As such, there is no need for the remaining peers after the initial peer has been selected to perform any actions. The intention was to use these peers to create an unreliable network, and having them archive and fetch sites would merely add unnecessary complexity. Therefore, the only peer that performs any actions beyond connecting to the network is the initial peer, as described below. Both tests spawn ten docker containers and attach them to the pre- made bridge network distarch. In the Python script, the network and an array with the containers are fetched using the Docker library. The first container in the array is used as the initial peer and runs the setup file first. Every other peer connects to this peer and runs the same setup file. Because the tests run every peer on the same machine, it was not possible to run a larger and more realistic number of peers at one time. The result is a network of peers that is not realistic to an imagined real use situation where potentially hundreds of peers would be connected to the network, but for an initial test, it was enough to show whether or not the files remain available in an unreliable network over time. In both tests, peers leave and join the network at random intervals, which are between 20 minutes and one hour long. Twenty minutes was chosen as the lower bound because it was important that not every peer left the network too fast. This way, at least one peer containing a given file should remain alive, to successfully replicate the file at every hour. Similarly, it was also important that some peers left in the duration of the tests, and so the upper bound of one hour was chosen to make sure that there were some changes to the network at least every hour. setup.js This script runs the peers continuously while the test program takes care of disconnecting and connecting peers. Upon running the script,

3https://docker-py.readthedocs.io/en/stable/index.html 4https://docs.docker.com/network/bridge/

87 a peer is created and connected to the network. The initial peer, after a timeout of ten minutes to allow all the peers to connect, archives 20 files and saves their hashes in an array. To be able to run this script with a larger number of peers, it was decided to exclude the file handling from these tests. This decision was made because running Chromium in the Docker images used too much memory and caused the tests to crash. Therefore, only the URL is archived in the DHT, which does not have any impact on reliability. The important part of the reliability tests is to see that the object that is stored in the DHT is replicated and maintained. Following the archival, the script runs for five hours, printing the average number of copies for the files every fifteen minutes. This is logged to the console, which can then be accessed in the test program.

Running the Tests To run the tests, it is necessary to have Docker installed and to build a Docker image of the system. Additionally, both tests require a Docker network named distarch. Building the image and creating the network can be done with the following commands: $ docker build --tag distarchive:v.0.1 . $ docker network create distarch Following this, it is possible to run each of the tests, described in detail in each of the sections below.

Static Peers The goal of this test is to ensure that the files stay available on peers, even if they leave the network and rejoin at a later time. Static peers, in this context, therefore means that the same peers are connected to the network throughout the tests. To simulate an unreliable network, a random peer is disconnected, or connected to the network at random intervals, as described above. The following command will run the test: $ yarn reliability-static

Results As can be seen in figure 6.1, some of the files that were previously available on a peer stays that way. However, the files are not republished to match the number of peers along the way. This test, therefore, shows that files can still be accessible from peers that have been disconnected and reconnected to the network, but that this is not always the case. As the graph shows, the average number of file copies rises and sinks in tune with the number of peers, which may be because the containers that run the peers never stop, they just disconnect from the network. When the container is reconnected to the network, the same peer process is running, and the other peers in the network can keep communicating with it the same way they did before it disconnected. As for reliability, this test shows that a certain number of copies of a file is maintained in the DHT over time, but that the number of copies may not match the number of peers.

88 Figure 6.1: Static reliability test results

Dynamic Peers As the test using static peers does not test whether or not the values in the DHT are republished every hour to new peers, the goal of this test was to make sure that this is the case. This test uses dynamic peers, meaning that it removes random peers entirely from the system, and adds new ones throughout the test. This is once again done at random intervals, as described. The test can be run with the following command: $ yarn reliability-dynamic

Results The initial run of this test revealed that the values in the DHT seemed to not be republished every hour, which is an unfortunate side effect of using a pre-existing implementation of the DHT with incomplete documentation. This discovery led to an update of the code to include a simple function in the PeerHandler that runs as long as a peer is alive, and republishes the peer’s values once every hour. While this solution is not optimal and does not operate as a part of the DHT itself, it provided the basic functionality needed to achieve a higher level of reliability. As shown in figure 6.2, the average number of file duplicates increases to match the number of peers as more peers connect. However, it must be noted that this functionality seemed, like the errors discussed in section 6.1.2, to occasionally suffer under the more computationally expensive operations of the reliability tests. While the result of these tests is that the system is not as reliable as the goals for the project outline, it does highlight a few important issues. One of the most important ones is that although there is a lot of advantages to using a pre-existing implementation of a system feature, this comes at the cost of control and a deeper understanding of the system. Additionally, the

89 Figure 6.2: Dynamic reliability test results system does not perform well if the processor is under stress, which would be necessary to address in any further development of the system. All in all, this reliability test showed that the system was not reliable, which was then amended with a solution that increased the reliability, but still revealed a larger weakness in the system as a whole.

6.1.4 Performance The performance tests were implemented in JavaScript, with several functions to time and print the timing of the various user actions. Two user actions were timed, namely archiving a site and fetching it from the DHT. To separate the peer handling from the file handling, the archiving of a site was split into two separate functions; one to capture the site and one to store the file in the DHT. This was done because the archiving is the most time-consuming user action, and as it is performed in two steps it was useful to learn which one was the most time-consuming. In the tests, each action is performed and timed 20 times, before the average and median are calculated and printed along with the shortest and longest time. To be able to easily test the user actions, they were not measured from click, but rather by calling the back-end functions that perform the given action from a separate test program. As a result, there is an additional delay of a few milliseconds when performing the user actions in the front- end app, because of the communication between the front-end and back- end. This delay, however, is so short that it would not have any significant impact on the results. Additionally, all tests were performed in the same way, meaning that it is possible to predict what the actual response time from click would be. For the site archiving and storing there is also a short delay added from storing any data needed to perform the remaining

90 actions later in the test. Initially, the plan was to run the performance tests on multiple remote machines to capture a realistic delay in peer communication. However, as it was difficult to set up a larger network of machines, the performance tests were executed with two separate machines to get a more realistic delay in communication than running the test on one machine would give.

Running the Test This test requires at least two terminals or machines to run, to ensure that the files are copied to more than just the local machine. Like the CLI, this test can be run both locally on one machine with multiple terminals, or across multiple machines. To run locally, open as many terminals as desired and run the following commands: In the first terminal: $ yarn perf -f This step can be skipped, but to use more than two terminals, run the following command in as many terminals as desired: $ yarn perf In the last terminal, which will perform the timing: $ yarn perf -l To run the test on remote machines, the commands are largely the same, except it is necessary to add the IP address to connect to the initial peer: On the first machine: $ yarn perf -f which will print something like this: > peer started. listening on addresses: > /ip4/127.0.0.1/tcp/10333 QmcrQZ6RJdpYuGvZqD5QEHAv6qX4BrQLJLQPQUrTrzdcgm > /ip4/192.168.1.251/tcp/10333 QmcrQZ6RJdpYuGvZqD5QEHAv6qX4BrQLJLQPQUrTrzdcgm Then, using the IP address, in this case /ip4/192.168.1.251/tcp/10333, run the following command on any number of remote machines (this stepped can also be skipped here): $ yarn perf --ip /ip4/192.168.1.251/tcp/10333 On the last machine, the following command will run the tests: $ yarn perf -l --ip /ip4/192.168.1.251/tcp/10333

Archiving File From URL The most time-consuming user action, as shown in table 6.1, is archiving a site. With an average and median run time of about 4 seconds, this is above

91 Shortest 1.162 s Longest 8.641 s Average 4.1823 s Median 4.3975 s

Table 6.1: Time it takes to archive file the 1 second limit for keeping the user focused directly on the task without thinking about other things [4, p. 135]. However, it is also well below the 10 second limit for keeping the user overall focused on the task. While there is some variation, with the shortest run time being 1 second and the highest close to 9 seconds, there are no cases that are outside of the 1-10 second interval. The fact that the run time for file saving resides in this interval has some implications for the visual design. It is recommended that the user gets some visual feedback if the delay is longer than 1 second, to indicate that the system is working. To that end, we have implemented a spinning wheel animation that is visible when the user enters a new URL, as discussed in section 5.4.3. Currently, the prototype uses Puppeteer and Chromium to fetch a screenshot of the site, which is the part of the archival that is the most time- consuming. Storing the file in the DHT does not take long, which will be discussed below. Seeing as the choice to use Puppeteer was made with the temporary solution of fetching a screenshot of the given site in mind, and because the delay is not too long, there was no attempt to alter the code to use a faster solution. In a more developed version of the system, different tools for site archival would likely need to be examined and used, which would require a new test of the performance. As this functionality is not one of the main concerns of this project, the performance of the site archiving as it is was deemed satisfactory.

Saving File to DHT

Shortest 0.097 s Longest 0.839 s Average 0.31595 s Median 0.3005 s

Table 6.2: Time it takes to store file in DHT

Unlike the site capturing, the DHT, and its functionality is one of the main concerns of this project, and so it was decided to measure the DHT specific action of storing a file on its own. As can be seen in table 6.2, this action takes less than a second, on average 0.3 seconds. This time is well

92 below the 1 second limit for keeping the user focused, and this action is therefore considered performant.

Fetching File From DHT

Shortest 0.05 s Longest 0.379 s Average 0.1618 s Median 0.154 s

Table 6.3: Time it takes to fetch file

The second DHT action, fetching a file, performs as well, if not slightly better than the archiving. As shown in table 6.3, the average time for fetching a file from the DHT is about 0.1 second. This, and the results from the archiving above, shows that content routing with a DHT is a highly efficient way to replicate, store and fetch files from a network of peers.

6.1.5 Summary of Implementation Tests This section has described the automated tests that were created to test the system prototype, and what the results from these were. The tests primarily addressed three different areas of the implementation, specifically the core functionality, as well as the system’s reliability and performance. In the functionality tests, multiple peers were spawned in the same script, before peer communication, site archiving, duplication and fetching were tested repeatedly. The result of this test is that all the core functionality works well, except for the occasional failure of the site archival, likely caused by a timeout due to the heavy workload on the processor. The reliability tests used Docker containers and a Docker network to create multiple containers that could be easily disconnected from and connected to the network to simulate an unreliable network of peers. The initial reliability tests revealed a weakness in the DHT, namely that the values in the DHT were not republished every hour as the Kademlia algorithm specifies. To remedy this, a function was implemented in the peer handler to republish any files archived by a peer every hour. Both reliability tests show that this works as intended. Finally, the performance tests timed the main user actions, site archival and fetching from the DHT. The archiving was split in two to separate the performance of the file handler and the peer handler. The results show that capturing the screenshot of the site is the most time-consuming action, taking about 4 seconds on average, which has been compensated for by adding a loading icon in the GUI. However, as this action rarely took more than 10 seconds, and the functionality of the file handler inevitably would have to be changed in a later version of the system, this was deemed acceptable. On the other hand, both saving to and fetching from the

93 DHT took less than one second on average, which shows that the DHT is performant.

6.2 Graphical User Interface Evaluation

User friendliness, in this project, was evaluated through both performance and the visual design. Performance, and therefore the design principle responsiveness, has already been discussed above, and this section examines the GUI in light of the remaining principles. The GUI was evaluated two ways; first by conducting a few tests with real users through a questionnaire and then by giving a brief analysis of the system in relation to the design principles. The answers from the questionnaire were used in the analysis both to highlight strengths and weaknesses in the design.

6.2.1 Survey To get some user input, a small survey with an online questionnaire was used. The questionnaire in its entirety is available in appendix B. Questionnaires are often used to gather quantitative data and are distributed to a large number of users to be able to generate statistics from the responses, but they can also be used to gather qualitative data [68]. For this project, qualitative data were desirable to get detailed feedback that gives more insight into the user’s thought process. Therefore, the questions in the questionnaire were designed like semi-closed interview questions, where the interviewee is encouraged to elaborate rather than give short answers that are easy to measure. While user testing was never intended to be a large part of this project, the original plan was to conduct a focus group with actual users, but as the Coronavirus pandemic occurred towards the end of the project, planning any type of in-person user testing proved difficult. It was also decided to avoid any tests that would require the user to download, install and run the application on their machine, as it was desirable to also test with users who have no experience running software in this way and sticking to one test saved both time and effort. The result of this is that the user test was rather limited. Nevertheless, it provided some user input on the visual design which was very valuable to the analysis.

Questionnaire The online questionnaire is available in its entirety in appendix B. The questionnaire starts with a short introduction to the project and the purpose of the survey, as well as an explanation of the questions and a disclaimer that all answers are and will be deleted at the end of the project. There are five different sections in the questionnaire. The first four sections all start with one or more screenshots (see figure 6.3) of one part of the GUI, followed by questions that specifically address this part. Some questions ask the user to describe what they would do to perform a certain action,

94 while others ask about their expectations of the behaviour of the system. The final section allows the user to freely write any additional comments they might have.

Figure 6.3: Example of screenshots in questionnaire

The questionnaire was distributed to five test users between the ages of 20 and 27 that were acquaintances of the author. Two of the users had a computer science background, and the latter three did not. Ideally, users should represent, and be as diverse as possible within the target group [68, pp. 18–19]. However, finding and selecting a suitable test users takes both time and effort, and so for the scope and time frame of this project, we decided to keep the user test, and therefore the number of test users, small. One advantage of using acquaintances as test users is that they are easy to reach and more likely to respond. One disadvantage is that there is a higher chance of the answers being influenced by the test users’ relation to the author, which must be taken into account when analysing them. Despite this, the purpose of this survey was not necessarily to get objective measurements, but to get some thoughts around the user friendliness of the prototype’s GUI, and our test users were sufficient for this purpose. Some test users reported as feedback to the questionnaire that they felt like there were right and wrong answers to the questions. While this is difficult to avoid with a questionnaire like this because every part of the GUI is designed with a specific purpose in mind, it does highlight the fact that a questionnaire is not the best way to gather qualitative user input. The test format and the phrasing of the questions can make the users feel like they are taking a test, and that they should answer "correctly" rather than what they think, which is easier to avoid in a real-life setting where the users are allowed to interact with the system more freely. Because of

95 this, the users were encouraged to be honest, to uncover which parts of the design were problematic. One challenge of using screenshots to evaluate the system is that the users did not get to see any visual feedback that is present in the GUI. Perhaps the biggest issue here is that the error messages were not evaluated at all, as these are an intricate part of the user actions. This means that the user’s understanding of them is interconnected with how they interact with the system, and so they were left out of the questionnaire entirely. For that reason, they are not examined in the analysis, even if they are an important part of the vocabulary, and therefore the clarity of the GUI.

Results Overall, the feedback on the GUI was positive, and the test users seemed to understand how to use it and what the intention of the various elements was. Most importantly, all of the test users’ descriptions of how they would archive and share a site were accurate to the actual functionality of the system. When asked to describe in detail what they would expect to happen when clicking the icons, the test users also seemed to expect the pop-ups, particularly the one that asks the user to confirm before deleting a file. This shows that this behaviour is expected, and may indicate that pop- ups are a good solution for actions such as these. Some of the test users expressed confusion about the share icon, and the term "Share ID", as well as some of the links, which will be discussed in the next section.

6.2.2 Analysis This section will examine the GUI in relation to the same design principles that were addressed both in the comparisons of similar systems and in section 4.4. Like last time, responsiveness and therefore performance is excluded because this has already been explored in the previous section. The remaining principles, clarity, consistency and simplicity are analysed in connection with the data from the questionnaire. While all these principles have been addressed in section 4.4, it is interesting to examine whether or not the assumptions made in the design section reflects what the test users think. It should also be noted that we are analysing our own work, which increases the likelihood of bias affecting the analysis.

Clarity In our analyses, clarity mainly addressed vocabulary and symbols such as icons, as well as site hierarchy and menus. As the functionality of the prototype in this project is rather limited and does not have any elements or menus competing for the user’s attention, the latter two are not relevant to this project. However, both the icons and the vocabulary should be addressed, as well as some of the links on the home page. As discussed in section 4.4.1, the icons used for the file menu are fetched from Google’s Material Design icons, and have been designed to be used

96 for exactly this purpose, as their names "Share" and "Delete" suggests. The use of these icons for this purpose is also widely spread across the Internet and can be seen in some of the other systems in the comparison. For example, the share icon is used by Webrecorder, and the delete icon by Pocket (see figure 6.4). There was one test user who remarked that they did not know what the share icon meant because they had never interacted with one before. However, they wrote that "I personally don’t know what the network icon means, but by clicking on it I would instantly understand it." This indicates that while there might not be a universal understanding of the icon, its function is made evident through the design.

(a) Webrecorder (b) Pocket

Figure 6.4: Examples of share and delete icons used in other systems

Choosing the right vocabulary to use is not easy, as users all have different mental models and may not have the same understanding of what a word means. It was decided to use the word "sites" to refer to the archived sites, as shown through the "My sites" header/link in the GUI. However, the test users also used words such as "archived sites", "articles" and "pages" in their answers, which implies that there is some discrepancy in the test users’ vocabulary. This discrepancy should be examined further to find out which word is best to describe the user files and stick to it. One definite pain point in the system is the term "Share ID", with one user writing "I don’t know what that is." on the question "Where do you think you might find a file’s "Share ID"?". This indicates that this term either needs some explanation or that it should be renamed. Another aspect that evoked different answers from the test users was the function of the "My sites" link and the original URL provided on each file element. While most users expected that the "My sites" link would lead back to the home page, but two users answered differently. They wrote that they would expect a list over the archived sites to show up, and one user even specified that they would expect the list to be a drop-down menu. As for the original URL, the question "What would you expect to happen if you clicked the URL provided underneath each site?" revealed that only one user expected this to take them to the original website. One user answered that they would expect the archived site to open, or that they would be taken to the original site, and the remaining users all thought the link would lead to the archived site. One user also pointed out that "It wouldn’t be my first instinct to click on the URL unless it was underlined or looked like a hyperlink", which suggests that it is not clear what the intention of this link is. Altogether, the clarity of the design is good, and most of the test users understood the use of icons and vocabulary. However, there were some exceptions to this that should be taken into consideration. As mentioned,

97 it would be beneficial to conduct further tests where the users are allowed to interact with the system. The screenshots presented to the test users through the questionnaire have no on-hover visual feedback such as the changing cursor or hyperlink underlines that may help indicate to the users what the intention of the various elements is. In particular, the icons, links and error messages should be examined further in more extensive user tests.

Consistency The consistency of the design, in terms of colours, fonts and structure has not changed much since the initial wireframe. There were also no questions in the questionnaire that directly address consistency in the same way as clarity, and so this section will reflect on the final design compared to the wireframe. The same colours have been used, with the addition of red colour to mark error messages and the delete button in the pop-up. While this colour deviates from the main colour scheme, it was added to grab the user’s attention and differentiate these elements from the rest of the design [25, pp. 706–707, 27, p. 81]. The font has not been changed and remains the same across the various parts of the GUI. Colour, capital letters and text decoration such as underlines have been used to emphasise, or downplay, the various pieces of text.

Simplicity Simplicity, like consistency, was not directly addressed in the question- naire, and will therefore also mainly be related to the discussion in 4.4.3. The files are still the most prominent part of the GUI, and the amount of information for each file is the same as in the wireframe. One ques- tion that does address this in the questionnaire is "Is there any information you would wish to be displayed here that isn’t already?", which was asked about the display page specifically. The display page was not prototyped in the same manner that the home page, but as mentioned in section 5.4.3, the design was kept simple, and no new information is displayed here that was not already available on the home page. In response to the question, four of the test users wrote that they did not miss any information and that the design was both straightforward and succinct. One test user pointed out that they wished that there was a timestamp for when the site was archived on the display page, which should be considered in a more developed ver- sion of the system. Overall, the simplicity of the design is good, which seems to be reflected by the test users in the few answers that are related to this design principle. One test user also remarked in the additional comments that "I like that the design is simple. It makes it easier to navigate."

98 6.2.3 Conclusion of Graphical User Interface Evaluation This section has looked at the GUI of the prototype in light of the design principles clarity, consistency and simplicity. First, a small survey was conducted, which consisted of a questionnaire that was distributed to a small number of test users. The results from this questionnaire were then briefly outlined, before a more thorough analysis of the design principles, both in relation to the initial wireframe for the system and the answers from the questionnaire. The questionnaire results show that the test users were positive to the GUI and that they largely understood how to use the system and what the icons, words and elements meant. There was some confusion about the share icon and the use of the term "Share ID", as well as a few discrepancies in the understanding of the various links in the GUI, which all affect the clarity of the GUI. However, the majority of the users understood the intention of the various elements of the GUI, and their expectations reflected the assumptions made in sections 4.4 and 5.4.3, which indicates that the clarity of the prototype’s GUI is good. Consistency and simplicity were not examined as closely in the questionnaire, and since the design has not changed much since the initial wireframe, the conclusion is that the GUI is both consistent and simple. Yet, it must be emphasised that the limited functionality of the prototype is one reason that it satisfies all the design principles and that it would, therefore, be necessary to continually examine and keep the principles in mind in further development of the system.

6.3 Conclusion of Evaluation

This chapter has given an overview of the tests that were conducted to evaluate the system and the results from these. The system’s core functionality and the metrics reliability and performance were evaluated with automated tests, and the user friendliness of the GUI was evaluated with a short questionnaire that acted as the basis for analysis in light of the design principles clarity, consistency and simplicity. The following list gives an overview of the tests that have been performed and what the results were.

Functionality The purpose of the functionality tests was to check if the core functionality, as outlined in section 4.1, works as described. This test ran four peers locally and tested peer communication through pings, site archiving and fetching. As the results show, all of this works as intended, with one minor weakness in the site archiving that is likely due to a lack of processing power as the test runs multiple peers on one machine.

Reliability Reliability was tested by simulating an unreliable network of peers and checking file availability over time. Two different simulations were run, one where the peers were disconnected and

99 reconnected to the network, and one where peers were killed and new ones were added to the network. The results from these tests were somewhat mixed and revealed that the DHT implementation used in the system did not republish values every hour as it should. As a result of this, a rudimentary republish function was implemented in the peer handler. The final results show that the number of file copies rises and falls in tune with the number of peers, but that there may not be as many copies of a file as there are peers. Altogether, the tests show that the system is somewhat reliable and that this is an aspect that needs further development and testing.

Performance Performance was tested by measuring the performance of the two main user actions, archiving a site and fetching a file from the DHT. The former was further split in two to time the file handling and peer handling separately. The results show that both DHT actions are performed in less than 1 second, and is therefore performant. Archiving the site takes slightly longer, with an average run time of about 4 seconds, which has been compensated for in the design with a loading animation. As this is still below the 10 second limit for keeping the user’s attention, and because this functionality is still uncompleted, this was deemed acceptable, and the system is therefore considered to be performant.

User friendliness To evaluate the user friendliness, a short questionnaire was distributed to a small number of users, which was then used to analyse the GUI of the system. The results show that the test users understood how to use the system, and the feedback given was largely positive. Some things were unclear to some users, i.e. the intention of some of the links and the vocabulary used, which indicates that the GUI should be evaluated more thoroughly. Additionally, none of the more interactive elements of the GUI, like the loading animation and the error messages, was evaluated, which are all aspects of the design that could potentially affect the usability. Overall, the results from the test and analysis show that the GUI is user friendly, but that more testing is necessary.

As is evident from various results from these tests, it can be said that the system meets its goals. The remainder of this section seeks to examine the evaluation metrics in detail to uncover which parts of the system works well and which parts that do not work to a satisfactory degree. Each of the evaluation metrics will be addressed in turn and discussed in light of the test results. The shortcomings in terms of reliability are one of the system’s biggest weaknesses. With the added functionality to republish values every hour, the system meets the criteria, but as the reliability tests show, this is not a stable, long-term solution. This project shows is that a DHT is a good fit for a P2P file system, but the results from the reliability tests show that using a pre-existing implementation of a DHT may come with some critical consequences. In a more developed version of the system, the peer

100 handling has to be addressed, either by implementing or finding a more suitable DHT. Performance-wise, it is evident from the tests that the DHT is perform- ant, which underlines that this is a good structure for a P2P file system. However, the performance tests in this evaluation do not take into account that the machines in the system may, and most likely would be located across greater distances and there would be a lot more machines in the net- work. Both of these factors may have an impact on the performance, as the messages have to travel further and file routing takes more steps. The result of our performance tests is therefore that the DHT is performant to a degree, but it must be noted that additional tests with more machines are necessary to be able to conclude this. The performance of the site archiving is another shortcoming of the system. It was, however, deemed acceptable both because it did not exceed 10 seconds, and because this functionality would have to be revised in a more developed version of the system anyway. This waiting time, like other aspects of the GUI that demands user interaction to evaluate, was not evaluated in the user tests, which it probably should have been to back up the claim that the waiting time is acceptable. While the system can be deemed user friendly with support in both the user tests and the design principles, there are still some aspects that were not evaluated and may impact the user friendliness. As previously mentioned, this mainly entails the parts of the GUI that are interwoven with user interaction, like the waiting times, error messages and visual cues such as hover effects. The error messages are particularly important here, as error messages can be a cause of frustration with the system, especially if they are not clear enough. Some even argue that error messages should be eradicated [66, pp. 644–648], which means that this aspect of the GUI should be examined closer.

101 Part III

Conclusion and future work

102 Chapter 7

Conclusion

7.1 Summary

The goal of this thesis was to suggest and implement a new Internet archiving system, based on P2P technology and intended for both personal and social use, that is both reliable, performant and user friendly. The project is rooted in several disciplines, primarily the collaborative and robust nature of P2P systems, the cultural significance of Internet archiving, social use of a system through sharing, and the importance of user friendliness. The system is influenced by a long tradition of Internet archiving, and like many of the existing systems, it was motivated by the unreliable nature of the Internet and the client-server structure it is based on. As a solution to this, we decided to create a P2P system, which eliminates the need for a central server and can ensure that files stay available through replication, even in an unreliable network of peers. In this project, we have examined and compared both P2P and Internet archiving systems, both to be able to anchor our project in existing systems and to get an idea of what works and what does not in similar systems. Through this, it became evident that many good solutions already exist, but none of them combines the core elements of our proposed system, namely reliability through file duplication and user friendliness, both in terms of performance and the GUI. These comparisons and the subsequent analysis of them became the basis for our system design, influencing the system structure, and its GUI. The network of peers in our system is organised using a pre-existing implementation of the Kademlia DHT, which takes care of peer communic- ation and content routing. The application itself has a two-tier architecture, with a clear separation between the back-end and front-end. The back-end is implemented with Node.js, using the Express library to create an API for the front-end to communicate with. Additionally, there is a file handler, which takes care of archiving and fetching local files, and a peer handler, which is in charge of adding and fetching content from the DHT. The front- end is implemented with React, and displays the user files in the browser, allowing the user to archive sites and view, delete and share their existing files.

103 7.1.1 Results Our main evaluation metrics were reliability, performance and user friendliness.

Reliability Experimental results show that files remain available to an extent, even in an unreliable simulated environment where peers continuously leave and join the system. However, it appeared that the Kademlia implementation that we used does not actual republish values every hour, as expected. This was therefore implemented in the peer handler, which leaves the republishing of values up to the individual peer, rather than the DHT itself. The result of this is that the system is perhaps not as reliable as first intended, even if the test results show that files are republished as they are supposed to. Another limitation for the reliability tests was the inability to test this in a real environment with a larger amount of peers.

Performance The performance was evaluated by timing the two main user actions, archiving a site and fetching files from the DHT. Site archiving was split in two to separate the file handling from the peer handling. Therefore, one test shows how long it takes to capture the site and save the file locally, while the other shows how long storing the file in the DHT takes. The result from the site fetching test shows that archiving a site takes on average about 4 seconds, which is slow enough that it was necessary to add a loading animation to the GUI, to indicate to the user that the system is working. However, it is not so slow that the user will lose their focus completely, and so it is still acceptable. Both storing and fetching a file from the DHT takes on average less than 1 second and is therefore sufficiently performant.

User Friendliness The user friendliness was not evaluated as closely by actual users as initially intended, and the evaluation of the GUI was mainly based on and examined in light of the design principles that we analysed existing systems by. However, a few test users answered a short survey to gather some user insight. The feedback from these tests and the result of our analysis show that the system is user friendly. Given more time, it would be preferable to perform more extensive user testing to receive more feedback, but for an initial prototype, the results are satisfactory.

Final Conclusion The prototype we have created throughout this project illustrates that it is possible to create a P2P Internet archiving system that is reliable to a degree, user friendly, and facilitates for sharing. Most importantly, the project shows how and which benefits can come from combining P2P technology

104 and personal Internet archiving systems. The goal of the project was met through our prototype, and it is a good starting point for a more developed system that can facilitate for a community based on stable, archive-based content sharing where any user can be certain that the copies they save, share and receive will not change over time or disappear.

7.2 Limitations

Throughout the previous part of the thesis, there were mentions of shortcomings that should be addressed in a more developed version of the system. Perhaps the most obvious of these is the decision to use screenshots instead of text, images and styling to portray the archived sites. Other examples include the decision to not use a proper database, not allowing the user to choose how much storage space to give up and whether or not to archive a file locally, and the lack of file encryption. However, as the system implemented in this thesis is a prototype whose intention was to reach specific goals, these compromises were made along the way to save time and effort and to keep the focus of the project on the goals. One decision that has had a major impact on both the development and the final results of the project was the decision to use a pre-existing implementation of a DHT. Using libp2p to implement the peer handler made the implementation much easier and allowed us to quickly set up the system, but it turned out to have a large impact on the reliability of the system. The reliability tests revealed that the number of copies for a given file never republished, which resulted in poor reliability and had to be remedied with a function in the peer handler. Another limitation was the scope and execution of the usability tests. The original plan was to perform user tests in person with a focus group, but as the final part of the project coincided with the Coronavirus pandemic, it ended up being difficult to perform tests where the users interacted with the system. The latter was also made difficult by the fact that the system has to be run using a terminal and a package manager, and not all the test users had prior experience with running programs this way. Therefore, the user friendliness tests were limited and probably biased due to few and homogeneous test users, and the fact that we were evaluating our own work.

7.3 Perspective

Our project concludes that it is possible to create a P2P Internet archiving system for personal and social use that is both reliable to a degree, performant and user friendly. Such a system is also necessarily rooted in the history of both P2P and Internet archiving systems, and in this particular case, it is also directly inspired by existing systems from both categories. This thesis has been about creating the system and examining whether it works, but it is also necessary to reflect on the advantages and

105 disadvantages of the system in a broader perspective, which will be the goal of this section. One advantage of having a P2P structure has not been emphasised in this thesis is the added layer of privacy that comes with a system that has decentralised control. Archiving your files on a central server, or even distributed servers controlled by one organisation, comes not only with the need to trust in their ability to keep the servers online, but also in their ability to keep the files safe from things like tampering, hackers and the organisation itself. Given that the files are encrypted, and that there are other tampering measures in place such as content addressing, a P2P system can add security that may be desirable to users. The system was originally intended to be implemented as a browser extension, and though the prototype from this project ended up as a more rudimentary version of the system, this is still considered a good end goal for the system. One of the test users even remarked this in the additional comments section of the questionnaire: "I think this would be a handy tool to have as an extension on my Chrome." Implementing the system as a browser extension would also make it possible to archive sites on the fly as the user visits them, by implementing a widget for the browser menu bar. While there are many advantages of such a system, it is difficult to predict whether it would be used by a substantial number of users. This thesis has determined that it is possible to create the system, but whether it is a system that users want is a question that would have to be answered before any further development. Understanding what the users need is an important first step in any software development process, and it is a question that this thesis does not answer. It must be determined whether users distrust the existing client-server Internet archiving systems and so if there is a space to be filled by a P2P system. One prediction that can be made is that the potential users of the system may be people who already know what P2P technology is, and understand the benefits of this as opposed to centralised client-server solutions. Other users, who do not have any prior knowledge of P2P systems may be intimidated by the unfamiliar structure, and seek other, more traditional solutions instead. To that end, it is important to consider the target group, as user friendliness in the entire process of using the system, including onboarding, is very important if the system is also aimed at more inexperienced users. However, it may be desirable to narrow the target group down and create a system for those already familiar with the technology and its ways.

7.4 Future Work

The prototype that is the ultimate result of this project is fairly rudimentary, and as the evaluation shows, would benefit from further testing and implementation changes. The first step in this process would be extensive user testing, both in terms of the system’s user friendliness, but also concerning whether or not this is a system that the users want and which

106 users might be interested, as discussed above. In terms of user friendliness, the system does well according to our evaluation criteria, but it still has a long way to go when it comes to accessibility and responsiveness. Both of these are important to consider when creating software and should be addressed in a more developed version of the system. After this, some important updates to the functionality of the system are file encryption, a new DHT implementation and making the system a browser extension. File encryption addresses both tampering and licensing as discussed in this thesis and privacy as mentioned in the previous section. Any copyright issues with both the archiving and sharing of Internet content also needs to be addressed in greater detail. A new DHT structure can improve the reliability, and change the network connection code to something less dependent on one initial peer. Finally, implementing the system as a browser extension can improve its usability and make it possible to test the system in a more realistic use situation.

107 Bibliography

[1] Roger S Pressman. Software engineering: a practitioner’s approach. Palgrave Macmillan, 2005. [2] Petar Maymounkov and David Mazieres. ‘Kademlia: A peer-to- peer information system based on the xor metric’. In: International Workshop on Peer-to-Peer Systems. Springer. 2002, pp. 53–65. [3] Kevin Leffew. A Brief Overview of Kademlia, and its use in various decentralized platforms. 15th Feb. 2019. URL: https://web.archive.org/ web / 20191206104320 / https : / / medium . com / coinmonks / a - brief - overview- of- kademlia- and- its- use- in- various- decentralized- platforms- da08a7f72b8f (visited on 06/12/2019). [4] Jakob Nielsen. Usability engineering. 1st ed. Elsevier, 1994. [5] George Coulouris et al. Distributed systems: concepts and design. 5th ed. Pearson Education, 2012. [6] Jörg Eberspächer and Rüdiger Schollmeier. ‘First and Second Genera- tion of Peer-to-Peer Systems’. In: Peer-to-Peer Systems and Applications. Ed. by Ralf Steinmetz and Klaus Wehrle. 1st ed. Springer-Verlag Ber- lin Heidelberg, 2005, pp. 35–56. [7] Antony Rowstron and Peter Druschel. ‘Pastry: Scalable, Decentral- ized Object Location, and Routing for Large-Scale Peer-to-Peer Sys- tems’. In: IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing. Ed. by Rachid Guerraoui. Springer. Berlin, Heidelberg, 2001, pp. 329–350. [8] Ben Yanbin Zhao, John Kubiatowicz, Anthony D Joseph et al. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. 2001. [9] Jörg Eberspächer and Rüdiger Schollmeier. ‘Past and Future’. In: Peer-to-Peer Systems and Applications. Ed. by Ralf Steinmetz and Klaus Wehrle. 1st ed. Springer-Verlag Berlin Heidelberg, 2005, pp. 17–22. [10] Jennifer Zahn. The life and death of LimeWire. 4th Nov. 2010. URL: https: //web.archive.org/web/20191008120111/https://marquettewire.org/ 3777966/tribune/marquee/the-life-and-death-of-limewire-mr1-se2-je3/ (visited on 08/10/2019). [11] Peter Druschel and Antony Rowstron. ‘PAST: A large-scale, persist- ent peer-to-peer storage utility’. In: Proceedings Eighth Workshop on Hot Topics in Operating Systems. IEEE. 2001, pp. 75–80.

108 [12] Ernesto Van der Sar. Netflix Dominates Internet Traffic Worldwide, BitTorrent Ranks Fifth. 17th Nov. 2018. URL: https://web.archive.org/ web / 20190930121343 / https : / / . com / netflix - dominates - internet - traffic - worldwide - bittorrent - ranks - fifth - 181116/ (visited on 30/09/2019). [13] Johan Pouwelse et al. ‘The bittorrent p2p file-sharing system: Meas- urements and analysis’. In: International Workshop on Peer-to-Peer Sys- tems. Springer. 2005, pp. 205–216. [14] Dongyu Qiu and Rayadurgam Srikant. ‘Modeling and perform- ance analysis of BitTorrent-like peer-to-peer networks’. In: ACM SIGCOMM computer communication review. Vol. 34. 4. ACM. 2004, pp. 367–378. [15] The web of tomorrow needs IPFS today. URL: https://web.archive.org/ web/20191014102207/https://ipfs.io/ (visited on 14/10/2019). [16] What is IPFS? URL: https://web.archive.org/web/20191014102912/ https://docs.ipfs.io/introduction/overview/ (visited on 14/10/2019). [17] What is libp2p? URL: https://web.archive.org/web/20191014102947/ https : / / docs . libp2p . io / introduction / what - is - libp2p/ (visited on 14/10/2019). [18] Paul Gil. Top Torrent Sites. 4th Oct. 2019. URL: https://web.archive. org/web/20191008132916/https://www.lifewire.com/top-torrent-sites- alternatives-to-kat-2483512 (visited on 08/10/2019). [19] Giovanni Neglia et al. ‘Availability in BitTorrent Systems’. In: IEEE INFOCOM 2007-26th IEEE International Conference on Computer Communications. IEEE. 2007, pp. 2216–2224. [20] Chris Hoffman. How Does BitTorrent Work? 13th Apr. 2018. URL: https: / / web . archive . org / web / 20191007100715 / https : / / www . howtogeek . com / 141257 / htg - explains - how - does - bittorrent - work/ (visited on 07/10/2019). [21] . ‘Incentives build robustness in BitTorrent’. In: Workshop on Economics of Peer-to-Peer systems. Vol. 6. 2003, pp. 68–72. [22] Michael Piatek et al. ‘Do incentives build robustness in BitTorrent’. In: Proc. of NSDI. Vol. 7. 4. 2007. [23] Anirudh Ramach, Atish Das Sarma and Nick Feamster. ‘BitStore: An incentive-compatible solution for blocked downloads in BitTorrent’. In: In NetEcon+IBC. 2007. [24] Tom McCourt and Patrick Burkart. ‘When creators, corporations and consumers collide: Napster and the development of on-line music distribution’. In: Media, Culture & Society 25.3 (2003), pp. 333–350. [25] Wilbert O Galitz. The essential guide to user interface design: an introduction to GUI design principles and techniques. 3rd ed. John Wiley & Sons, 2007. [26] Debbie Stone et al. User interface design and evaluation. Elsevier, 2005.

109 [27] Michael O Leavitt, Ben Shneiderman et al. ‘Based web design & usability guidelines’. In: Background and Methodology (2006). [28] Bradley Mitchell. Gnutella P2P Free File Sharing and Download Network. 16th Oct. 2019. URL: https://web.archive.org/web/20191029112912/ https://www.lifewire.com/definition- of- gnutella- 818024 (visited on 29/10/2019). [29] Ernesto Van der Sar. uTorrent is the Most Used BitTorrent Client By Far. 5th Apr. 2020. URL: https://web.archive.org/web/20200610150441/ https://torrentfreak.com/utorrent-is-the-most-used-bittorrent-client-by- far-200405/ (visited on 10/06/2020). [30] Mark Scanlon. ‘Study of Peer-to-Peer Network Based Cybercrime Investigation: Application on Botnet Technologies’. PhD thesis. University College Dublin, Oct. 2013. URL: https://www.researchgate. net/figure/Screenshot-of-Napster-Downloads-can-be-seen-at-the-top- with-uploads-at-the-bottom_fig6_321745039. [31] Fiona Fui-Hoon Nah. ‘A study on tolerable waiting time: how long are web users willing to wait?’ In: Behaviour & Information Technology 23.3 (2004), pp. 153–163. [32] Ion Stoica et al. ‘Chord: A scalable peer-to-peer lookup service for internet applications’. In: ACM SIGCOMM Computer Communication Review. Vol. 31. 4. ACM, 2001, pp. 149–160. [33] Eng Keong Lua et al. ‘A survey and comparison of peer-to-peer overlay network schemes.’ In: IEEE Communications Surveys and tutorials 7.1-4 (2005), pp. 72–93. [34] Eli Edwards. ‘Ephemeral to enduring: the Internet Archive and its role in preserving digital media’. In: Information Technology and Libraries 23.1 (2004), pp. 3–8. [35] About the Internet Archive. URL: https : / / web . archive . org / web / 20191014102144/https://archive.org/about/ (visited on 14/10/2019). [36] About the Memento Project. 23rd Jan. 2015. URL: https://web.archive. org/web/20191104112035/http://mementoweb.org/about/ (visited on 04/11/2019). [37] Archivebox. URL: https://web.archive.org/web/20191014102106/https: //archivebox.io/ (visited on 14/10/2019). [38] About Archive-It. URL: https://web.archive.org/web/20191014102124/ https://archive-it.org/ (visited on 14/10/2019). [39] Webrecorder About. URL: https://web.archive.org/web/20191014103144/ http://webrecorder.io/_faq (visited on 14/10/2019). [40] Ilya Kreymer et al. Webrecorder pywb 2.2. 10th Apr. 2019. URL: https: / / web . archive . org / web / 20191014112751 / https : / / github . com / webrecorder/pywb (visited on 14/10/2019). [41] Mat Kelly and Alam Sawood. InterPlanetary Wayback (ipwb). 4th Mar. 2019. URL: https://web.archive.org/web/20191014112816/https:// github.com/oduwsdl/ipwb (visited on 14/10/2019).

110 [42] Jonathan Poltak Samosir et al. WorldBrain’s Memex. 27th May 2019. URL: https://web.archive.org/web/20191014112839/https://github. com/WorldBrain/Memex (visited on 14/10/2019). [43] Michele C. Weigle. On the importance of web archiving. 19th Sept. 2018. URL: https://web.archive.org/web/20191014112932/https://items. ssrc.org/parameters/on-the-importance-of-web-archiving/ (visited on 14/10/2019). [44] Nettarkivering. URL: https://web.archive.org/web/20191014114252/ https://www.nb.no/samlingen/nettarkivet/nettarkivering/ (visited on 14/10/2019). [45] Maarten van Steen and Andrew S. Tanenbaum. Distributed Systems. 3rd ed. Maarten van Steen, 2017. [46] Credits: Thank You from The Internet Archive. URL: https://web.archive. org / web / 20200613084347 / https : / / archive . org / about / credits . php (visited on 13/06/2020). [47] Jocelyn Mackie. Overview of Website Copyright Law. 3rd Oct. 2017. URL: https://web.archive.org/web/20191113095305/https://www.termsfeed. com/blog/website-copyright-law/ (visited on 13/11/2019). [48] Terms of Service. 12th Feb. 2018. URL: https://web.archive.org/web/ 20191113092223/https://getpocket.com/tos (visited on 13/11/2019). [49] National Research Council et al. The digital dilemma: Intellectual property in the information age. National Academies Press, 2000. [50] About Pocket. URL: https://web.archive.org/web/20191014103324/https: //getpocket.com/about (visited on 14/10/2019). [51] Wayback Machine General Information. 23rd Aug. 2018. URL: https:// web.archive.org/web/20191014124947/https://help.archive.org/hc/ en- us/articles/360004716091- Wayback- Machine- General- Information (visited on 14/10/2019). [52] Mat Kelly et al. ‘InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives’. In: Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries. Hamburg, Germany, June 2016, pp. 411–416. [53] Todd Spangler. Global Piracy in 2017: TV and Music Illegal Activity Rose, While Film Declined. 21st Mar. 2018. URL: https://web.archive.org/ web/20191014114123/https://variety.com/2018/digital/news/piracy- global- 2017- tv- music- film- illegal- streaming- 1202731243/ (visited on 14/10/2019). [54] James Allen-Robertson. Digital Culture Industry. 1st ed. Palgrave Macmillan, 2013. [55] Guy Douglas. ‘Copyright and Peer-To-Peer Music File Sharing: The Napster Case and the Argument Against Legislative Reform’. In: Murdoch University Electronic Journal of Law 11.1 (Mar. 2004). URL: http://www.murdoch.edu.au/elaw/issues/v11n1/douglas111.html.

111 [56] Raymond Shih Ray Ku. ‘The Creative Destruction of Copyright: Napster and the New Economics of Digital Technology’. In: The University of Chicago Law Review 69.1 (2002), pp. 263–324. URL: http: //www.jstor.org/stable/1600355. [57] Jemima Kiss. BitTorrent: Copyright lawyers’ favourite target reaches 200,000 lawsuits. 9th Aug. 2011. URL: https://web.archive.org/web/ 20131204002125/http://www.theguardian.com/technology/pda/2011/ aug/09/bittorrent-piracy (visited on 03/06/2019). [58] William Uricchio. Cultural citizenship in the age of P2P networks. na, 2004. [59] Dan Noyes. Top 10 Twitter Statistics. Apr. 2019. URL: https : / / web . archive . org / web / 20190527095718 / https : / / zephoria . com / twitter - statistics-top-ten/ (visited on 27/05/2019). [60] Dan Noyes. The Top 20 Valuable Facebook Statistics. May 2019. URL: https://web.archive.org/web/20190527054614/https://zephoria.com/ top-15-valuable-facebook-statistics/ (visited on 27/05/2019). [61] Jan H Kietzmann et al. ‘Social media? Get serious! Understanding the functional building blocks of social media’. In: Business horizons 54.3 (2011), pp. 241–251. [62] Stefan Stieglitz and Linh Dang-Xuan. ‘Emotions and information diffusion in social media—sentiment of microblogs and sharing behavior’. In: Journal of management information systems 29.4 (2013), pp. 217–248. [63] Chei Sian Lee and Long Ma. ‘News sharing in social media: The effect of gratifications and prior experience’. In: Computers in human behavior 28.2 (2012), pp. 331–339. [64] Alexandra Weilenmann, Thomas Hillman and Beata Jungselius. ‘Instagram at the museum: communicating the museum experience through social photo sharing’. In: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM. 2013, pp. 1843–1852. [65] Peep Laja. 8 Web Design Principles to Know in 2019. 17th Apr. 2019. URL: https : / / web . archive . org / web / 20191014114345 / https : / / conversionxl . com / blog / universal - web - design - principles/ (visited on 14/10/2019). [66] Alan Cooper et al. About face: the essentials of interaction design. 4th ed. John Wiley & Sons, 2014. [67] Roderick Bauer. What’s the Diff: VMs vs Containers. 28th June 2018. (Visited on 14/06/2020). [68] Bill Gillham. Developing a questionnaire. A&C Black, 2008. [69] Ralf Steinmetz and Klaus Wehrle, eds. Peer-to-Peer Systems and Applications. 1st ed. Springer-Verlag Berlin Heidelberg, 2005.

112 Appendices

113 Appendix A

Output From Functionality Test

peer started. listening on addresses: /ip4/127.0.0.1/tcp/10333 QmcrQZ6RJdpYuGvZqD5QEHAv6qX4BrQLJLQPQUrTrzdcgm /ip4/192.168.1.251/tcp/10333 QmcrQZ6RJdpYuGvZqD5QEHAv6qX4BrQLJLQPQUrTrzdcgm peer started. listening on addresses: /ip4/127.0.0.1/tcp/65093 QmeFxniYqpGypKTdERnmJbhhf5YJwwJHKuXt7mW6g9TwmP /ip4/192.168.1.251/tcp/65093 QmeFxniYqpGypKTdERnmJbhhf5YJwwJHKuXt7mW6g9TwmP peer started. listening on addresses: /ip4/127.0.0.1/tcp/65094 QmY7LDtu7Tt8NE9qjR3ME5SdMpdHX4dmjr9SAwo9Ujb6Qc /ip4/192.168.1.251/tcp/65094 QmY7LDtu7Tt8NE9qjR3ME5SdMpdHX4dmjr9SAwo9Ujb6Qc peer started. listening on addresses: /ip4/127.0.0.1/tcp/65095 QmTRvFWfbi7U5jknxrUAmWqQBgkr7oMCLzxxpb9YQc7d1t /ip4/192.168.1.251/tcp/65095 QmTRvFWfbi7U5jknxrUAmWqQBgkr7oMCLzxxpb9YQc7d1t testing ping: > peer2 pings peer3 > peer2 pinged peer3 in 1ms > peer3 pings peer4 > peer3 pinged peer4 in 1ms > peer4 pings peer2 > peer4 pinged peer2 in 1ms testing store data: > peer4 saves file > file url: https://www.vg.no/rampelys/i/AdM07r/ herodes-falsk-om-jahn-teigen-vi-var-aldri-uvenner > file saved with hash e5c00281e0a803b156a331f74176b84c910af53aae4947ff23d0ac0b471e51f7 > file saved on 4 machines testing store data: > peer4 saves file

114 > file url: https://www.vg.no/nyheter/innenriks/i/mR4VBl/ trine-skei-grande-gaar-av-trekker-seg-som-partileder-og-statsraad -og-tar-ikke-gjenvalg > file saved with hash d4215fe9bdac0fa5819c50afcedfeeb4c97a66c42412d37f2dddf3f80021ebb2 > file saved on 4 machines testing store data: > peer4 saves file > file url: https://www.nrk.no/sport/tomme-tribuner-pa-norge-serbia. --gradighet-i-koronaens-tid.-1.14938017 > file saved with hash dcaa217f2af270c1037e0576919f7eef8494475cf5503c8c966b4244fe5753d1 > file saved on 4 machines testing store data: > peer4 saves file > file url: https://www.nettavisen.no/stort-sexpress-og-mye-nakenbilder -blant-ungdommer-under-16-ar-mener-politiet-vi-ma-bare-innse-at -ungene-starter-med-seksualitet-mye-tidligere-enn-for/s/5-21-709733 > file saved with hash 9b4a6174d9f91876ade8f21efb9e11f27c659a8ab5dee0b094eda6c42db7186d > file saved on 4 machines testing store data: > peer4 saves file > file url: https://www.tv2.no/nyheter/11288429/ > file saved with hash e8d5465f945a4c4c672ba6c2998b9d778cc67d11ff6855a9a95688bb772876d0 > file saved on 4 machines testing store data: > peer2 saves file > file url: https://www.aftenposten.no/amagasinet/i/Wb1G6K/ Du-drikker-nesten-garantert-bobler-fra-feil-glass > file saved with hash 6b4c0a759337ff0065adaea2d35165a04e35adfdb03fe7eae83a846bd5c985db > file saved on 4 machines testing store data: > peer2 saves file > file url: https://www.godt.no/artikkel/24799163/ restaurantanmeldelse-av-signalen > file saved with hash 703decc949055530e54fee2a5325c0bb0b5e4c91806db646c2a0f3b2dd7764a6 > file saved on 4 machines testing store data: > peer2 saves file > file url: https://www.vg.no/nyheter/innenriks/i/OpJexl/ slik-soner-norges-farligste-ungdommer Error: Failed to put value to enough peers: 0/3 at Object.put (/Users/tonjeroyeng/Documents/Project/ distarchive/api/node_modules/libp2p-kad-dht/src/

115 content-fetching/index.js:126:31) at async Object.storeData (/Users/tonjeroyeng/Documents/ Project/distarchive/api/src/javascripts/peerHandler/ index.js:98:3) at async Object.saveFile (/Users/tonjeroyeng/Documents/ Project/distarchive/api/src/javascripts/index.js:18:18) at async /Users/tonjeroyeng/Documents/Project/ distarchive/api/test/functionality-test.js:49:15 { code: ’ERR_NOT_ENOUGH_PUT_PEERS’ } > error saving file testing store data: > peer2 saves file > file url: https://www.nrk.no/norge/siste-nytt-om -koronaviruset-i-norge-1.14938353 > file saved with hash 21b837b738f2430951588864ba71cfa5956fde9a0bb513221c11a25fd68b10ad > file saved on 4 machines testing store data: > peer2 saves file > file url: https://www.nrk.no/mr/barn-pa-kristiansund -sjukehus-mogleg-koronasmitta -1.14939900 > file saved with hash 960036a683a54f98dc3cf1a4cc25c92a778adecae44841025d772f0a9cc050c6 > file saved on 4 machines testing find file: > peer4 fetches file with hash > hash : 960036a683a54f98dc3cf1a4cc25c92a778adecae44841025d772f0a9cc050c6 > found file with title: Barn paa Kristiansund sjukehus mogleg koronasmitta - NRK More og Romsdal - Lokale nyheter, TV og radio testing find file: > peer4 fetches file with hash > hash : 9b4a6174d9f91876ade8f21efb9e11f27c659a8ab5dee0b094eda6c42db7186d > found file with title: Nettavisen - Stort sexpress og mye nakenbilder blant ungdommer under 16 aar, mener politiet: - Vi maa bare innse at ungene starter med seksualitet mye tidligere enn for testing find file: > peer4 fetches file with hash > hash : 960036a683a54f98dc3cf1a4cc25c92a778adecae44841025d772f0a9cc050c6 > found file with title: Barn paa Kristiansund sjukehus mogleg koronasmitta - NRK More og Romsdal - Lokale nyheter, TV og radio testing find file: > peer4 fetches file with hash > hash :

116 703decc949055530e54fee2a5325c0bb0b5e4c91806db646c2a0f3b2dd7764a6 > found file with title: Restaurantanmeldelse av Signalen: En fergetur verdig - Godt.no testing find file: > peer4 fetches file with hash > hash : e8d5465f945a4c4c672ba6c2998b9d778cc67d11ff6855a9a95688bb772876d0 > found file with title: Avinor vurderer aa stenge norske flyplasser testing find file: > peer3 fetches file with hash > hash : 960036a683a54f98dc3cf1a4cc25c92a778adecae44841025d772f0a9cc050c6 > found file with title: Barn paa Kristiansund sjukehus mogleg koronasmitta - NRK More og Romsdal - Lokale nyheter, TV og radio testing find file: > peer3 fetches file with hash > hash : 960036a683a54f98dc3cf1a4cc25c92a778adecae44841025d772f0a9cc050c6 > found file with title: Barn paa Kristiansund sjukehus mogleg koronasmitta - NRK More og Romsdal - Lokale nyheter, TV og radio testing find file: > peer3 fetches file with hash > hash : dcaa217f2af270c1037e0576919f7eef8494475cf5503c8c966b4244fe5753d1 > found file with title: Tomme tribuner paa Norge-Serbia. Graadighet i koronaens tid. - NRK Sport - Sportsnyheter, resultater og sendeplan testing find file: > peer3 fetches file with hash > hash : 21b837b738f2430951588864ba71cfa5956fde9a0bb513221c11a25fd68b10ad > found file with title: Siste nytt om koronaviruset i Norge - NRK Norge - Oversikt over nyheter fra ulike deler av landet testing find file: > peer3 fetches file with hash > hash : 6b4c0a759337ff0065adaea2d35165a04e35adfdb03fe7eae83a846bd5c985db > found file with title: Du drikker nesten garantert bobler fra feil glass cleanup, removing local files

117 Appendix B

Questionnaire

118 Distributed Internet Archive Evaluation The purpose of this survey is to gather user input on the visual design of the system I have made in my master thesis to evaluate the user friendliness. The system is an application in the browser that lets you archive permanent copies of web sites (particularly articles), like a cross between a bookmarking service and the Internet archive. The sites will be saved as they are at the time, and will not be subject to change, even if the original site changes.

Each part of the survey starts with a screenshot, followed by questions about how you would use the system, and what your expectations of the system would be.

All answers are anonymous and will be deleted at the end of this project, by September 2020.

It is okay to answer the survey in Norwegian.

This is the main page of the system, which displays the archived sites. Main page

Close-up of header form

119 Close-up of site item

1. How would you archive a site to the system? Describe where you would click and what information you think you would need to provide.

120 2. What would you expect to happen if you clicked "My sites" in the header? It is the text in the top left of the header, if the image is blurry.

3. Where would you click to open the file "How to Think Without Googling - Forge"?

4. What would you expect to happen if you clicked the URL provided underneath each site?

5. Where do you think you might find a file's "Share ID"?

This is a close-up of the file menu for each file. File menu

6. What do you think the left icon communicates (network icon)?

121 7. Describe in detail what you would expect to happen when clicking the network icon.

8. What do you think the right icon communicates (trashcan icon)?

9. Describe in detail what you would expect to happen when clicking the trashcan icon.

This pop-up shows when you click the share icon. Share pop-up

Share file

122 10. How would you share a file with someone? Describe where you would click and what you would do.

When you open a site, this page is displayed, where the site is available along with any Display information about it. page

11. What would you expect to happen if you clicked "My sites" on this page?

12. What would you expect to happen if you clicked the arrow at the top?

123 13. Is there any information you would wish to be displayed here that isn't already?

Additional comments

14. Write any additional comments on the design here. Any feedback is much appreciated.

This content is neither created nor endorsed by Google.

Forms

124