<<

DEGREE PROJECT IN COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020

Automatic fingerprinting of websites Using clustering and multiple bag-of-words models

Automatisk fingeravtryckning av hemsidor Med användning av klustring och flera ordvektormodeller

ALFRED BERG

NORTON LAMBERG

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH

Automatic fingerprinting of websites

Using clustering and multiple bag-of-words models

Automatisk fingeravtryckning av hemsidor

Med användning av klustring och flera ordvektormodeller

Alfred Berg Norton Lamberg

Examensarbete inom Datateknik, Grundnivå, 15 hp Handledare: Shahid Raza Examinator: Ibrahim Orhan TRITA-CBH-GRU-2020:060

KTH Skolan för kemi, bioteknologi och hälsa 141 52 Huddinge, Sverige

Abstract

Fingerprinting a website is the process of identifying what technologies a website uses, such as their used web applications and JavaScript frameworks. Current fingerprinting methods use manually created fingerprints for each technology it looks for. These fingerprints consist of multiple text strings that are matched against an HTTP response from a website. Creating these fingerprints for each technology can be time-consuming, which limits what technologies fingerprints can be built for. This thesis presents a potential solution by utilizing unsupervised machine learning techniques to cluster websites by their used and JavaScript frameworks, without requiring manually created fingerprints. Our solution uses multiple bag-of-words models combined with the dimensionality reduction technique t-SNE and clustering algorithm OPTICS. Results show that some technologies, for example, , achieve a precision of 0.731 and recall of 0.485 without any training data. These results lead to the conclusion that the proposed solution could plausibly be used to cluster websites by their web application and JavaScript frameworks in use. However, further work is needed to increase the precision and recall of the results.

Keywords Clustering, fingerprinting, OPTICS, t-SNE, headless browser, bag-of-words, unsupervised machine learning

Sammanfattning

Att ta fingeravtryck av en hemsida innebär att identifiera vilka teknologier som en hemsida använder, såsom dess webbapplikationer och JavaScript-ramverk. Nuvarande metoder för att göra fingeravtryckningar av hemsidor använder sig av manuellt skapade fingeravtryck för varje teknologi som de letar efter. Dessa fingeravtryck består av flera textsträngar som matchas mot HTTP-svar från hemsidor. Att skapa fingeravtryck kan vara en tidskrävande process vilket begränsar vilka teknologier som fingeravtryck kan skapas för. Den här rapporten presenterar en potentiell lösning genom att utnyttja oövervakade maskininlärningstekniker för att klustra hemsidor efter vilka webbapplikationer och JavaScript-ramverk som används, utan att manuellt skapa fingeravtryck. Detta uppnås genom att använda flera ordvektormodeller tillsammans med dimensionalitetreducerings-tekniken t-SNE och klustringsalgoritmen OPTICS. Resultatet visar att vissa teknologier, till exempel Drupal, får en precision på 0,731 och en recall på 0,485 utan någon träningsdata. Detta leder till slutsatsen att den föreslagna lösningen möjligtvis kan användas för att klustra hemsidor efter de webbapplikationer och JavaScript-ramverk som används. Men mera arbete behövs för att öka precision och recall av resultaten.

Nyckelord Klustring, fingeravtryckning, OPTICS, t-SNE, huvudlös webbläsare, ordvektor, oövervakad maskininlärning

Acknowledgment

We would like to give a special thanks to Tom Hudson for offering technical counsel during the writing of this thesis, and for providing the basis of the data collection tool that we could continue to develop upon.

Table of contents 1 Introduction ...... 1 1.1 Problem statement ...... 1 1.2 Goal of the project ...... 1 1.3 Scope of the project and limitations ...... 2 2 Theory and background...... 3 2.1 Fingerprinting ...... 3 2.2 Headless browser ...... 4 2.3 JavaScript window object ...... 5 2.4 HTML document ...... 5 2.5 Supervised and unsupervised learning ...... 6 2.6 Clustering ...... 6 2.6.1 Partitional clustering algorithms...... 6 2.6.2 Hierarchical clustering algorithms...... 7 2.6.3 Density-based clustering ...... 8 2.6.4 Clustering performance evaluation ...... 11 2.7 Dimensionality Reduction and sample size...... 12 2.7.1 Sample Size ...... 12 2.7.2 Curse of dimensionality ...... 13 2.7.3 SVD and truncated SVD ...... 13 2.7.4 t-SNE ...... 14 2.8 Feature extraction ...... 15 2.8.1 Bag-of-Words ...... 15 2.8.2 N-Gram ...... 16 2.9 Related Works ...... 16 3 Methodology ...... 19 3.1 Supervised learning or unsupervised learning ...... 19 3.2 Data collection ...... 19 3.3 Feature extraction ...... 22 3.3.1 Dimensionality Reduction ...... 24 3.3.2 Clustering algorithm ...... 24 3.4 Alternative method ...... 25

3.5 Architecture ...... 26 3.5.1 Hardware ...... 26 3.5.2 ...... 27 4 Results ...... 29 4.1 Labeling the data for comparison ...... 29 4.2 Results of our method ...... 29 4.2.1 Dataset ...... 30 4.2.2 K-Means clustering ...... 32 4.2.3 OPTICS clustering ...... 33 4.3 Observations ...... 34 4.3.1 Truncated SVD ...... 34 4.3.2 Wappalyzer false negative ...... 35 4.3.3 Empty and almost empty data from sites...... 36 4.4 Runtime...... 38 4.5 Evaluation results ...... 39 4.5.1 Wordpress ...... 40 4.5.2 jQuery ...... 42 4.5.3 Drupal ...... 44 4.5.4 ASP.NET ...... 46 4.5.5 AddThis ...... 48 5 Discussion ...... 51 5.1 Clustering results ...... 51 5.2 Accuracy and reliability of evaluation ...... 54 5.2.1 Dataset inconsistencies ...... 55 5.2.2 Significance of the size of dataset ...... 57 5.3 Societal, ethical, environmental and economical impact ...... 57 6 Conclusion ...... 59 References ...... 61 Image references ...... 65 Appendix 1 ...... 67 Appendix 2 ...... 69 Appendix 3 ...... 77

1 | 1 Introduction

1 Introduction

Today, there are a vast number of websites, using a great variety of technologies. The ability to be able to identify and categorize properties associated with these technologies is useful in several fields. One example of a property is the technology stack that is in use by a website, such as detecting which websites use Wordpress or certain frameworks. The process of taking such a property and making a unique identifier based on it is called fingerprinting. One field where this could be useful is in the security field, where fingerprinting could be useful to find groups of web applications utilizing similar technologies. Knowing the technologies that are in use by a company can be useful if a weakness is found within one of these technologies, as this information can be used to find which web applications should be prioritized to patch. Similarly, finding out how widespread certain technologies are on the market is useful for market research.

1.1 Problem statement Categorizing the technologies used by websites via fingerprinting is nothing new. However, in comparison to manual fingerprinting methods such as the ones used by Wappalyzer [1], an automated method of fingerprinting could allow a larger part of the web to be fingerprinted while potentially reducing the time required to perform the fingerprinting. However, there is currently no readily available tool that fingerprint websites based on the used web servers and web applications without using previously defined fingerprints for each technology.

Thanks to the development of clustering algorithms, it is possible to take unlabeled data and then group that data into clusters. These clustering algorithms can handle large amounts of data and group them based on their likeness to one another. Clustering could possibly be used to create an automated method for fingerprinting websites that can group web servers and web applications without needing training data about any technologies. Additionally, this would also open up the possibility to fingerprint new or internal frameworks, such as if a company or service utilizes its own internal technologies, as well as speeding up the fingerprinting process by removing the need to build fingerprints for each technology.

1.2 Goal of the project The goal of this thesis is to determine if it is possible to create a tool that can, without using previously defined fingerprints for each technology, fingerprint websites based on the technologies used. The types of technologies in focus are different web applications and web servers. The precision, recall, and silhouette coefficient of said fingerprints are determined and their usefulness compared to pre-existing fingerprinting technologies.

2 | 1 Introduction

In order to achieve this, a tool needs to be created that can gather the relevant data from websites that the fingerprints are to be based on. After that, another tool has to be built that groups the websites based on the data collected in groups of the and web application in use. Our results are compared with existing methods with a test on unlabeled data. A large number of websites are chosen to run the tool on, which tests the tools ability to operate on live data, but has the drawback that the ground truth of what the technology stack looks like is not known.

The results from our fingerprinting method need to be compared with pre-existing fingerprinting tools. The metrics used to compare the clustering to pre-existing tools are precision and recall. Additionally, the silhouette coefficient is used to determine how well separated the clusters are with the compatible clustering algorithms. In order to assist in the evaluation, spot-checking the results is used.

1.3 Scope of the project and limitations Fingerprinting in this thesis refers to the method of taking a list of websites and being able to determine parts of the technology stack behind the websites by analyzing the response to HTTP requests. Pre-existing tools such as Wappalyzer [1] do this by searching for specific strings in the responses, such as “href='/wp- content” for an indication that Wordpress might be used. These strings are built up by people manually analyzing pages that run the technology that the fingerprint is being built for.

This thesis focuses on ways to automatically group websites by the technology used, instead of creating specific fingerprinting strings like Wappalyzer for each technology. Additionally, the focus is on grouping the websites by the web server in use (such as Apache, , or Tomcat) and the web application in use (such as Wordpress, Magento, Drupal, etc.). There is not any focus on fingerprinting applications that are not on the HTTPS port 443.

Another limitation is that the focus is on the root path of the sites that are chosen to be fingerprinted. There is not any crawling for paths or domains in applications involved in this thesis. This means that applications that are on the path www.example.com/ are fingerprinted, but applications that are on paths like www.example.com/blog are not. There is also a limit on how many different automatic fingerprinting techniques can be reasonably built, tested, and evaluated in the time span allotted for this thesis.

3 | 2 Theory and background

2 Theory and background

This chapter presents the theoretical framework for the background of the problem and the proposed solution. It also presents related works within fingerprinting and clustering.

2.1 Fingerprinting In a broad sense, fingerprinting refers to creating a unique identifier for different items. If the items are the same, the fingerprint will also be the same [2]. In this thesis, fingerprinting refers to the technology in use by websites. Two sites will have the same fingerprint if they both use the same technologies. There is a lack of readily available academic papers that discuss fingerprinting web applications and web servers based on the HTTP response that they return. However, there are open source tools such as Wappalyzer [1] and WhatWeb [3] that can achieve this.

Wappalyzer uses a large JSON file to define the fingerprints for the applications that it supports. In Figure 1, the Wappalyzer [1] JSON object containing the fingerprint for the Magento web application can be seen. This fingerprint is based on multiple different properties of the returned response from the web server. One such property is the name of the cookie. If the name of the cookie contains “frontend” then it is a sign that the site might be running Magento. All the supported properties are [1]:

● The cookie names and/or values ● Pattern matching (regex) against the HTML returned ● The favicon of the application (usually seen as the picture of each tab in the ) ● Method names from the JavaScript code in the page ● Pattern matching for the JavaScript URLs in the page ● The value of the HTML meta tag ● Pattern matching against the URLs in the HTML page

4 | 2 Theory and background

Figure 1: Wappalyzers fingerprint for the Magento web application. Picture taken from Wappalyzers GitHub.

2.2 Headless browser A headless browser is like a normal browser such as Chrome or , but without any GUI (graphical user interface) to interact with [4]. Instead of a GUI, it is possible to interact with it by writing code, for example, via the Golang framework chromedp [5]. An example of a program that makes use of a headless browser is Wappalyzer [1]. A headless browser makes it possible to get the full functionality of a browser, such as executing JavaScript and making XHR requests when visiting websites. This functionality is useful when, for example, testing or automating tasks on modern web applications that make heavy use of JavaScript [6]. An alternative to a headless browser is using a raw HTTP client, which can be achieved with a tool like curl [7] that sends one single HTTP request and outputs the response. Tools like this are generally faster, but the evaluation of JavaScript and the additional requests done by the page are lost.

5 | 2 Theory and background

2.3 JavaScript window object

Figure 2: A few of the available methods in the JavaScript window object.

The JavaScript object “window” as seen in Figure 2, is a global object that exists on all pages that the browser visits. It contains many standard methods, variables, and constructors [8]. One of these methods is the window.alert() method, which creates a popup alert box in the browser. It is possible for scripts running on a website to extend the window object with new variables or methods. If all of the standard methods and variable names that are available on a website are removed, there would only be the custom methods and variables added by the site left. The window object could possibly be used when fingerprinting websites and is further discussed in section 3.2 “Data collection”.

2.4 HTML document

Figure 3: A simplified example of an HTML document.

An HTTP request to a web server can respond with an HTML document in the body of the response, such as the one seen in Figure 3. The browser then parses the HTML and creates a (DOM) that JavaScript can interact with and edit. HTML5 and DOM are defined by the standards created by the “Web Hypertext Application Technology Working Group” [9]. In these standards, it is specified that any HTML element can have the unique identifier (ID) attribute, the

6 | 2 Theory and background

ID value must be unique for the whole page, and an element cannot have multiple ID attributes. One use for the id attribute is for JavaScript to be able to query the document for a specific element with a particular id. The class attribute is similar to the ID attribute in that it can be on any element, but the class value does not have to be unique for the document. The values of the class attribute are space- separated, which means that the h1 element in Figure 3 has the values “fancy-title” and “main-title”. The class values can be used by JavaScript to query all elements that have a specific class value and style them in a specific way. Both the id and class attributes could be interesting when fingerprinting sites since they are well used by many frameworks such as Bootstrap [10].

2.5 Supervised and unsupervised learning Supervised and unsupervised learning are two different types of machine learning. They are used to learn and predict the relations between data. In “An Introduction to Statistical Learning” by Gareth James et al. [11] describes supervised learning as requiring some form of training data that consists of labeled data. This labeled data is then used as a reference point for the supervised learning algorithm to determine what kinds of categories exist, and what data belongs in which category. This training data makes the supervised learning method able to put new data into one of the previously seen categories.

In contrast to supervised learning, unsupervised learning does not utilize any training data set. Unsupervised learning algorithms instead attempts to observe relations between data, independently of previous categorization or labeling. One field where unsupervised learning is commonly used is in clustering, where relations between data are observed, and the data is consecutively grouped based of these observations.

2.6 Clustering Clustering is the method of taking in multiple objects that each consist of a numeric vector, and then grouping the objects into clusters that are most similar to each other with regards to the distance between the vectors [11]. This distance can be measured using Euclidean distances, pairwise distances between data points, or other similar metrics. The created clusters can then be utilized to gain further insight into the data. There are multiple clustering algorithms available, and they mainly belong to two main groups, partitional and hierarchical. One of the main features of clustering algorithms that this thesis aims to use is its ability to work with unlabeled data.

2.6.1 Partitional clustering algorithms Partitional clustering algorithms require the analyst to predefine a targeted amount of clusters to be created in order to get started, generally denoted as K [11]. These

7 | 2 Theory and background

algorithms then take a data set of N data points where N is equal to or greater than K and creates K numbers of clusters, where each cluster contains at least one data point, and each data point N belongs to exactly one cluster K.

There exists a subset of partitional clustering algorithms, known as fuzzy partitioning. In fuzzy partitioning algorithms, data points can belong to more than one cluster [12], but this thesis does not cover fuzzy partitioning.

2.6.1.1 K-means algorithm The partitional clustering algorithm K-means works by randomly assigning all data points to precisely one cluster. The K in K-means notes the total number of clusters that will be formed.

In “An Introduction to Statistical Learning with Applications in R” by Gareth James et al. [11], the K-means algorithm is described as an algorithm that works by first defining a number K of centroids, centroids being the center of a cluster. These centroids are at first randomly distributed and each data point is allocated to a cluster with the goal of keeping each cluster as small as possible. This process is done iteratively with each iteration attempting to distribute the centroids in such a way that the clusters shrink in size. This process is then iterated upon until there is no change in the centroids positioning, indicating that a local optimum based on the initial randomly distributed centroids has been reached.

The inherent drawback of this method is that the end result depends upon the initial random distribution of the centroids, and a certain set of random distributions may be unable to find the global optimum, no matter the amount of iterations [11]. This problem can be alleviated by running the method several times with different starting distributions, which is a feasible solution as one of the main benefits of K-means clustering is its speed [13] [14].

2.6.2 Hierarchical clustering algorithms Hierarchical clustering does not rely on a predefined K value, but instead creates a dendrogram, a tree-based representation of the data set [11]. Dendrograms show the relations between data points at all levels without first limiting the results to a particular number of clusters. In Figure 4, three depictions of the same dendrogram can be seen. Each depiction shows a different cut at a different height resulting in different clusterings, where the colors of the data points represent different clusters.

8 | 2 Theory and background

Figure 4: A dendrogram showing the clustering process of a hierarchical algorithm. Picture from “An Introduction to Statistical Learning”.

Clusters that are conjoined lower in the graph tend to be more similar, while clusters that are created by conjoining higher up in the graph might not only just be vaguely similar, but in extreme cases, have no similarities at all. In a dendrogram, only vertical proximity dictate likeness of data points, their horizontal proximity is arbitrary for determining likeness.

The hierarchical clustering algorithm achieves these results by first defining a distance metric. The algorithm treats each data point as its own cluster, and then iteratively joins the two most similar clusters until the entire graph consists of a single cluster [11]. The dendrogram results in the possibility to determine the number of clusters after the algorithm is finished, leading to between one and as many clusters as there are data points from the same dendrogram.

The inherent problem with hierarchical clustering comes from the ambiguity of determining what number of clusters accurately reflects the ground truth. Another problem is that depending on the chosen feature and an incorrectly chosen number of clusters, the results may end up nonsensical. An example of this is attempting to cut a binary dissimilarity in three distinct clusters.

2.6.3 Density-based clustering Martin Ester et al. [15] describes density-based clustering algorithms as algorithms which create clusters by grouping areas with a dense concentration of data points, while classifying data points in sparse areas as noise or outliers. A visual example, as seen in Figure 5, depicts the density of data points in a separate 3-dimensional graph. The density-based algorithm creates a “cut” through the height of the graph to extract the number of separate islands or clusters. Only the data above the

9 | 2 Theory and background

cutting point will be considered when deciding where to form a cluster. When the decision to create a cluster has been made, nearby data will be grouped into the cluster, even if the data would be below the cutting point. The density level required merely decides on where to create a cluster, not exactly which data points belong to the cluster.

Figure 5: A visual representation of density based clustering. The plots on the left show the result of the clustering, and the 3-dimensional graph on the right shows how the cutting points placement affects the clustering process. Picture from “Density-based clustering”.

2.6.3.1 DBSCAN and OPTICS DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm based on density. This algorithm creates clusters by grouping together dense groups of data points. DBSCAN was created by Martin Ester et al. [16]. The algorithm has three different kinds of labels for the data points, core

10 | 2 Theory and background

points, border points, and noise. For a point to be classified as a core point, it has to have N number of data points within the distance . If a data point does not fulfill those criteria, but is within distance D of a core point, it is classified as a border point. If none of these apply, the data point is classified as noise. Both N and D are user-provided parameters to the algorithm.

Each core point then creates a cluster, if a core point is within distance D of another core point, they are merged into one cluster. All border points within the distance D of the formed cluster are then added to the cluster. All points not in a cluster after this are considered noise [16].

Figure 6 shows a dataset where DBSCAN is run on with a minimum neighbor amount (N) of three. The three center data points surrounded by solid circles become core points as they all have three or more data points within their surrounding solid circles. The two data points surrounded by dashed lines become border points, as they are within the distance D of a core point, but do not have three neighboring points on their own to be considered a core point. The final data point surrounded by a dotted line then becomes classified as noise. This process results in a single cluster containing all data points except for the noise data point.

Figure 6: Shows how DBSCAN on a dataset decides between core points and border points with a minimum neighbors number of 3, where the core points are surrounded by solid lines and border points by dashed lines. With a single noise data point being surrounded by a dotted line. Picture from “Study of Protein Interfaces with Clustering”.

One of the issues with density-based clustering, such as DBSCAN, is that clusters can have varying density. If only one density is used for the whole algorithm, clusters can be missed [17]. For example, with DBSCAN if any particular core data point neighbors another core data point, those core data points will be merged into a single cluster. Because of this, no distinction between the two core data points can be made as they are combined into a single cluster, whereas a solution closer to the ground truth could have been to have two separate clusters.

11 | 2 Theory and background

Looking at Figure 7, there are two potential distinct clusters in the red and blue circle, while the black ellipse shows another potential clustering. The black cluster is formed due to the connecting data points in between the red and blue circles, whereas both the red and blue circle would have made clusters close to the ground truth. Jörg Sander and Hans-Peter Kriegel, that helped develop DBSCAN, later created a new algorithm built upon DBSCAN called OPTICS(Ordering Points To Identify the Clustering Structure) [17] et al. OPTICS aims to solve the varying density issue by retaining the core distance of objects, and allow for the creation of subclusters based on the density of core data points.

Figure 7: Shows how two potential distinct clusters can be grouped together by linked core data points. The black ellipse shows one potential clustering where the linking data points qualify the entirety for a single cluster. The blue and red ellipses show another potential clustering.

One drawback of the OPTICS algorithm is that in some of its implementations the time complexity or big O notation, which relates to the number of data points (n), can be [18]:

푂(푛2 ) (1)

This time complexity limits the usability of the algorithm on large datasets, as the runtime scales exponentially based on the number of data points.

2.6.4 Clustering performance evaluation The variables used to compare different clustering algorithms in this thesis are their precision, recall, and when applicable, the silhouette coefficient. Both precision and recall requires knowledge of the ground truth to evaluate the clustering. TP stands for true positive, or if the clustering algorithm puts the sample in the correct cluster. FP stands for false positive and means that the

12 | 2 Theory and background

algorithm identified a sample incorrectly as another type. FN stands for false negative and means that the sample should have been in a particular cluster, but it is not. The precision is used to determine how accurate an algorithm is in regards to how many false positives it generates. Precision is defined as follows [11]:

푝푟푒푐푖푠푖표푛 = 푇푃 (2) 푇푃 + 퐹푃

Recall operates similarly, but instead of determining accuracy based on how many false positives are created, it instead determines accuracy based on the number of false negatives generated. Recall follows a similar formula as precision, but replaces the false positives with false negatives, which means that the clustering algorithm incorrectly identified a sample as not belonging to a specific cluster [11]:

푟푒푐푎푙푙 = 푇푃 (3) 푇푃 + 퐹푁

Lastly, the silhouette coefficient [19] can be used to determine how consistent the results of a partitional clustering algorithm are. A perfect score of 1 is achieved if all of the clusters are well separated and there are no points between the generated clusters. While the lowest score of -1 indicates that the clustering algorithm has mislabeled points. To calculate the silhouette coefficient the ground truth does not need to be known.

2.7 Dimensionality Reduction and sample size Care has to be taken when extracting the features to cluster on. If the features extracted have too much noise, it can lead to overfitting [11], where two similar samples may result in being in entirely separate clusters. There is also the problem of the curse of dimensionality [20], where the dimensional space required to present any amount of data needed rises exponentially with the number of dimensions.

This section briefly touches on the ambiguity in determining the minimum sample size and explains the curse of dimensionality. It also describes the dimensionality reduction technique truncated SVD and the visualization technique t-SNE.

2.7.1 Sample Size The sample size used when clustering can have an impact on its results [21], but there are no hard fast rules for determining the sample size. In a report written by Sara Dolnicar [22], she mentions, “There are no rules-of-thumb about the sample size needed for cluster analysis.” Nonetheless, in her results, she concludes that with a low sample size it can be hard to find any clusters, especially in high dimensional data.

13 | 2 Theory and background

2.7.2 Curse of dimensionality The term curse of dimensionality was coined by Richard Bellman [20] and refers to occurrences in higher dimensional datasets that are not present in lower dimensions. The curse of dimensionality can manifest itself in different ways, for example, having more feature dimensions than there are data points can lead to the problem that the Euclidean distance between data points can become too similar [23]. Generally, the Euclidean distance is used to cluster on, which can result in difficulties creating clusters, as the distance between all samples are too similar to one another. This similar distance between data points issue can lead to overfitting, which generally results in a worse overall accuracy as clusters and generalizations become too tailored to the initial data used, resulting in new data points that should reasonably fit in the cluster to be excluded. A visual example of this can be seen in Figure 8, where the green line represents the line an overfitted method would use to determine what belongs to a cluster. Meanwhile, the black line represents the line made by a non-overfitted method and is what would have been preferred since this leads to a better generalization.

Figure 8: A visual representation of how overfitting can lead to unexpected results. The red and blue points represent the training data. It can be seen that the green line, which is the results of a model, follows the training data precisely while the black line shows a better generalization that is expected to work better with new data. Picture by Ignacio Icke.

2.7.3 SVD and truncated SVD An SVD (Singular value decomposition) is a factorization of a matrix containing either real or complex numbers [24]. Truncated SVD is an approximation that discards some columns, corresponding to the largest values of one of its factors, which is a diagonal matrix of non-negative real numbers on its diagonal. It then calculates the remaining rows. Truncated SVD can as such be used to perform a

14 | 2 Theory and background

dimensionality reduction by reducing the number of columns of the original matrix.

2.7.4 t-SNE To make the results distinguishable by human eyes, some technique that visualizes high dimensional data is needed, one such technique is t-SNE (t-Distributed Stochastic Neighbor Embedding) developed by Laurens van der Maaten et al. [25].

T-SNE is a variation of SNE (Stochastic Neighbor Embedding). It works by taking the high dimensional Euclidean distances between different data points and then converting them into conditional probabilities that can be used to represent various similarities between these data points in a lower dimension. First, it creates a probability distribution of pairs of high-dimensional objects to ensure that similar data points have a higher probability of being picked than dissimilar data points. Then it defines a similar distribution on a lower dimension map and reduces the divergence between the probability distributions.

One of the main benefits of t-SNE over SNE or other techniques is that it is partially designed to resolve a so called “crowding problem” that exists in SNE and many other reduction techniques [25]. This problem stems from the fact that a lower-dimension representation cannot always exactly represent a higher- dimension manifold, and can result in very dissimilar clusters of data points to inherent positions very far away in a low-dimensional representation. This causes the second probability distribution to essentially crush together all data points near the center of the low-dimensional space, potentially resulting in a loss of global identity between created clusters. By resolving this crowding problem, t-SNE tends to create clearer, easier to read representations, compared to many other reduction techniques. It does this without placing different data centers so far away from one another as to limit readability, but still keeping them distinct.

Scikit-learn recommends using another dimensionality reduction method[26], such as PCA, to first reduce the number of dimensions to a better suited number of dimensions for t-SNE, for example 50, as was used by van der Maaten [27]. T-SNE is best at representing data in a 2 or 3-dimensional space. It is also a very processor heavy technique, and can take hours where other methods such as PCA can finish in minutes, or even seconds [26]. The technique can also allow for different results based on different initializations, and as such multiple restarts with different initializations are recommended.

There are however drawbacks with t-SNE. One of them is that random noise might look like clusters in t-SNE [28]. The parameters for t-SNE are perplexity, epsilon, and number of steps. Changing these parameters can result in significant changes

15 | 2 Theory and background

to the results. It is recommended to have a large steps value to ensure t-SNE reaches a stable state [28].

2.8 Feature extraction As the input to the clustering algorithm is a vector of numbers, there needs to be a step before the clustering process where the data is processed and turned into a vector. This step is called feature extraction. In this section, various ways to extract these features will be handled.

2.8.1 Bag-of-Words

Figure 9: A visual representation of how a bag-of-words model extracts how often terms are used.

Bag-of-words is a model to turn text into a vector by looking for multiple character sequences in the sample and counting their occurrence. The output of the bag-of- words model for a single document is a vector. In Figure 9, the input document “The quick brown fox jumps over the lazy dog” can be seen being put through a bag-of-words model. The bag-of-words vocabulary contained the words “the” “fox” and “cat”. After letting the model processes the text, it creates the vector [2,1,0], indicating that “the” was used twice, and “fox” was used once, whereas “cat” was not used at all in the text.

Bag-of-words is a common method [14] [29] to extract features. A risk from using the bag-of-words model is that the dimensionality can quickly get very high if each occurrence of a word is added to the vocabulary of the model. A solution for this is to define a minimum document frequency (MIN_DF). If a feature does not exist in enough of the input text documents, in regards to the MIN_DF value, it will not be used in the vocabulary. This selection process leads to the problem that the features

16 | 2 Theory and background

left out will not be taken into account in the clustering. Another drawback is that the orders of appearance of the features in the document are not taken into account.

2.8.2 N-Gram An N-gram, also called shingles, is a way to break down words or sentences into smaller parts [30]. The N in N-gram notes the length of the new parts. N-grams can, among other things, be based on characters or words. For example, if the word “error” is used and broken down into a character based 3-grams, it becomes the parts “err”, “rro”, and “ror”. N-grams can be used to get some more that the bag-of-words model loses. For example, a bag-of-words model that uses whole words as features would have “wp-content” and “wp-data” as different features, while a character based 3-gram would be able to catch the “wp-” string being used. However, a potential drawback of character based N-grams is that it can add noise since the same 3-gram could be used in entirely different words.

2.9 Related Works In a study conducted by Kai Yang et al. [31], device detection was conducted to identify and fingerprint IoT devices. The device detection was done by sending query packets to remote hosts and analyzing the responses by comparing devices by their IP and TCP/UDP packet headers, as well as 20 other common protocols. They utilize web crawlers to crawl commercial websites in order to obtain their features. The input to the model is then converted into binary values in the preprocessing step. These binary values were then used by a neural network to generate device fingerprints, which would provide class labels for the different IoT devices in the three different categories: their type, the vendor, and the type of product. With this method, they were able to achieve 94.7% accuracy when generating fingerprints based on only the IoT device types and 91.4% accuracy when generating fingerprints based on type, the vendor, and the type of product.

Lotta Järvstråt’s master thesis on a functionality classification filter for websites [29] takes data from websites in the HTML format of the site and extracts the URLs and content. The websites were then classified according to their function, such as being a blog, a news site or forum. Multinomial logistic regression with Lasso was used as a means to reduce the number of variables. Infrequent terms in the data set were removed to reduce noise from the results and make the process less computationally heavy. The thesis showed a potential 99.61% accuracy in classifying the function of a website using these methods. This method was then compared to another method using the topic model Latent Dirichlet Allocation (LDA). LDA was used to reduce the number of variables into a smaller number of topics, but utilizing only this method only achieved a best case accuracy of 97.62%. A point was made that overfitting could be a reason for inaccurate results. When

17 | 2 Theory and background

the method was rerun with over a three months difference between feature extraction and fetching the data, it resulted in an accuracy of 90.72%. This result shows that overfitting likely was an issue. Where combining both the methods and using both LDA and a multinomial logistic regression classifier resulted in an impressive 99.70% accuracy.

These results show that the techniques can be used on a wide variety of base data, from values in an IP header to entire site contents. Feature extractions can then be run on the collected dataset to prepare them for further calculations in order to categorize the samples into groups based on their likeness.

18 | 2 Theory and background

19 | 3 Methodology

3 Methodology

This chapter explains how the project was carried out. First, a literature study was conducted to discover what methods and models exist for fingerprinting and clustering, and how they could be used to solve the problem presented in this thesis. This literature study consisted of reviewing related works, research, and previous studies. Using the information gathered from the literature study, a prototype was built, which was then compared against the existing fingerprinting tool Wappalyzer [1]. Figure 10 shows an overview of the implementation and evaluation of our prototype, from deciding on the dataset used to the evaluation of the clustering results. An alternative method, as well as the architecture used, is also described in this chapter.

Figure 10: An overview of our method. Each box represents a step in the process. Clustering includes both dimensionality reduction and creating clusters.

3.1 Supervised learning or unsupervised learning Supervised learning and unsupervised learning are two categories of machine learning. A supervised approach would require a labeled training dataset. The training dataset could be collected by using existing fingerprinting technologies such as Wappalyzer [1]. The main issue with the supervised approach is that each technology would have to be in the training dataset and be correctly labeled. Getting a training dataset consisting of all technologies on the web is unachievable due to the diverse and quickly changing nature of the web. An unsupervised approach, and in particular clustering, makes it possible to group technologies without the need to have a training dataset containing each technology. The drawback of an unsupervised approach is that there is no guarantee that the collected data will split into the expected groups. Since a supervised approach did not fit the goal of this thesis, an unsupervised approach was picked.

3.2 Data collection The dataset of web servers used in this project was the Tranco list [32], which contains 1 million of the most popular websites globally. The Tranco list is specifically made to be a dataset for internet wide research and hardened against manipulation. One of the benefits of working with the Tranco list, over creating a specific dataset by for example picking random websites by randomizing IP addresses or looking at a specific span of IP addresses, was that the Tranco list

20 | 3 Methodology

contains real websites that are popular and in use. Picking random IP addresses might result in the web server not serving a page if no hostname is given. Another benefit was that the Tranco list contains a rather varied list of web servers from different areas of the world, which makes it a diverse set of web servers. The Tranco list also makes it easy to reproduce our results, since it is possible to download the same list used if the date or id is given. For our purposes, only web servers using HTTPS on port 443 were used, sites not using HTTPS were discarded. The reason was that many sites using HTTP on port 80 redirected to an HTTPS version of the same site, which meant that using both HTTPS and HTTP would result in some pages being collected twice.

In order to be able to perform any clustering, data first has to be collected from the list of web servers. To collect the data, a headless chrome browser was used. The feature types collected when the headless chrome browser visits a site were:

● The values from the class attribute. ● The values from the html id attribute. ● The names of the methods and variables in the JavaScript window object. ● The names of the cookies used by the site. ● All of the additional requests, for example, images and API-calls.

These feature types were picked from looking at how the Wappalyzer [1] fingerprints were built and from comparing a few sites manually to find what differentiated them.

The feature types collected were then put in separate in a folder for each site. The reason for using a headless browser for collecting these values instead of a raw HTTP client was that the headless browser runs all of the JavaScript that the page uses. The JavaScript on the page can add content to the DOM and make additional requests. An HTTP client would not be able to gather this data without also evaluating the JavaScript. The headless browser will follow both HTTP header redirects and JavaScript redirects from the page. If the website uses redirects, the data will be collected on the final site at the end of the redirects.

In Figure 11, an overview of how our dataset was structured can be seen. Each folder represents one site from the Tranco list. In each folder, there were five files that contain the collected data from our headless browser tool. The “wappalyzer.data” file contains the fingerprints that Wappalyzer collected from the site and is further explained in chapter 4 “Results”.

21 | 3 Methodology

Figure 11: An overview of the structure of our dataset. Each folder contains their own copies of the files to the right.

In Figure 3, an HTML document can be seen where the extracted list of id’s would be “text”, and the extracted list of class values would be “fancy-title” and “main- title”. This data would then be put in the files “htmlID.data” and “htmlClass.data”, from Figure 11, respectively.

Figure 3: A simplified example of an HTML document.

The files “requests.data”, “htmlWindow.data”, and “cookieNames.data” are different from the other files since they do not contain data directly extracted from the HTML document. The file “requests.data “contains all the URLs to the requests that the browser makes when visiting the page. These URLs can include paths for JavaScript files, image files, or post requests. An example of a URL from the file could be “https://example.com/images/image.jpg”. Each URL was tokenized by removing the domain and splitting on the ‘/’ and ‘?’ characters. This tokenization results in the tokens “images” and “image.jpg” from the previous example. The file “htmlWindow.data”, contains the variables and method extracted from the JavaScript window object, as explained section 2.3 “JavaScript window object”. The “cookieNames.data“ file contains all the names of the cookies that can be extracted with JavaScript from the page.

22 | 3 Methodology

3.3 Feature extraction To create the vectors for each website, which is needed for clustering, feature extraction was utilized. In order to create this vector for our model, a combination of 5 bag-of-words models was used. The bag of words approach was picked due to its relative simplicity and that it is a well-used and documented method. It is also deterministic, which makes replicating and testing this method easier as it is predictable. One drawback of the bag-of-words model is that it loses the ordering of the features from the input document. This drawback is not applicable in this thesis since the ordering of the features used in this thesis does not matter. Whole words were picked as features for the bag-of-words models. All the files from the data collection in our dataset first had to be tokenized and then processed to decide the vocabulary for the bag-of-words models. The selection process to decide what the vocabulary of a bag-of-words model was based on how many of the other documents contained that specific feature. For example, if the id “text” can be found on five sites and the minimum document frequency (MIN_DF) was ten sites, it will not be included in the model.

The result of the tokenization was then processed again to extract the feature vector for each document by using the vocabulary. An alternative to using whole words as features could have been to use n-grams, for example, character based 3-grams such as “aaa “, “aab” and so on. 3-grams would have the advantage of removing the vocabulary building step since there are only 46656 possible 3-grams for all the alphanumeric letters. However, since we used a total of 5 bag-of-words models, the total dimensionality would be five times higher, at 233280. This many dimensions would likely run into the issues explained in section 2.7.2 “Curse of dimensionality”. 3-grams was not picked due to this reason. As it was not possible to add each new unique token from the tokenization step into the vocabulary of the bag-of-words model, a selection of which tokens to add has to be made. The reason for not being able to extend the vocabulary for each new feature was that the dimensionality of the final vector would get too big. Had this many dimensions been used, it would also lead to the issues of the curse of dimensionality. This part of the method was similar to how Lotta Järvstråt [29] picked the vocabulary for her bag-of-words model. However, a difference from her work was that our thesis uses multiple bag-of-words models and combines the results.

23 | 3 Methodology

Each bag-of-words model has a separate vocabulary from the tokenization step. In Figure 12, a visualization of how the feature vector for “example.com” is created can be seen. It can be seen that example.com contains the classes “fancy-title”, “main- title” and the id “text” after the tokenization step. The class data was then passed to the bag-of-words model for the class. Each bag-of-words model contains a vocabulary, but while the class value “fancy-title” exists on “example.com”, it is not in the vocabulary for the class model since not enough other sites had a class attribute with the value “fancy-title”. The class data was then turned into the vector 0,1,0 since “main-title” was in the vocabulary and it exists once on the page example.com. This whole process is then repeated for the other bag-of-words models and data. The final vector for “example.com” was the combined vectors from each bag-of-words model. In this thesis, in order to make one feature comparable to another feature, each feature was scaled to make the standard deviation 1. While the result of the bag-of-words models was centered around 0.

Figure 12: An example of how the features from https://example.com are turned into a feature vector with two bag-of-words models.

24 | 3 Methodology

3.3.1 Dimensionality Reduction After having obtained the feature-vectors, they still need to be subject to dimensionality reduction. For our model, we used truncated SVD followed by T- SNE to bring down the dimensionality. The reason for the dimensionality reduction was that the dimensionality from the bag-of-words models was still too large to cluster on, with upwards of over 14000 dimensions.

The framework used to perform both truncated SVD and t-SNE was scikit-learn [33]. The reason that the vectors were first subject to truncated SVD was that t-SNE works better with an already lower amount of dimensions, as t-SNE is a slower algorithm. Scikit-learn recommends around 50 dimensions in order to speed up the process as well as to help reduce noise [34]. Truncated SVD was chosen over other alternatives due to its unsupervised nature, its simplicity, and ease to set up. After truncated SVD had been used to bring down the dimensions, t-SNE was used to perform data visualization in order to make the results more comprehensible. T- SNE was picked for its ability to deal with the crowding problem. While the t-SNE algorithm loses some information compared to, for example, truncated SVD, solving the crowding problem was deemed more important and was worth the loss of information, and in some cases, the loss of information can be beneficial. Such as when data points have a very large Euclidean distance from each other, t-SNE tends to create representations where the global distance between these groups of data points is reduced, making the results easier to cluster.

3.3.2 Clustering algorithm Once the feature vectors have been built for the sites, a clustering algorithm has to be picked that groups the most similar sites into clusters. This thesis focuses on the two clustering algorithms, K-means and OPTICS, both of which are explained in section 2.6 “Clustering”.

The K-means algorithm was chosen due to its speed and simplicity. However, an issue for this thesis is that the number of centroids must be specified when using K- means, which could not be done because the data collected was unlabeled. This unlabeled data leads to the correct number of clusters not being known. This was solved by incrementing the K value and comparing the silhouette coefficients and picking the K with the largest silhouette coefficient value.

Something that was taken into consideration when using the K-means algorithm was the initialization values of the centroids. It is possible for one initialization to reach a local optimum with clustering results that were far away from the global optimum. For this reason, the K-means algorithm was run multiple times with different initial values for the centroids.

25 | 3 Methodology

The OPTICS algorithm was picked due to its ability to deal with clusters of uneven sizes and that there was no need to specify the number of clusters, only the minimum cluster size [17]. The OPTICS algorithm can also handle clusters that are not circular. OPTICS was picked over DBSCAN since DBSCAN has issues separating smaller subclusters close to other clusters due to the single density value based approach that DBSCAN uses. The parameters N and D were chosen. Where N is the number of neighboring data points required to be considered a core point, and D is the distance a nearby data point needs to be to a core point to be considered neighboring for OPTICS. N was chosen as the smallest MIN_DF in use, since our method likely cannot cluster any technology that occurs less than MIN_DF. The D was chosen by running OPTICS multiple times with different values of D and picking the best result.

A hierarchical clustering algorithm was not picked due to the issue of determining the number of clusters. This issue was difficult to solve as it was not known how many technologies there are in the dataset being used. A hierarchical algorithm would also not offer any tangible benefit over other clustering algorithms for our method.

3.4 Alternative method This section describes a theoretical alternative method for data collection and feature extraction. This alternative method does not rely on the HTTP body, but on the HTTP headers instead. It does this by sending various malformed requests to the site using a raw HTTP client instead of a headless browser. The reason for using a raw HTTP client was due to the ease of generating malformed requests in any specific configuration desired. An example of a malformed HTTP request was an HTTP request where the method, which usually is a GET, POST, or another method defined in the HTTP standard [35] is changed to a non-standard HTTP method “AAA”. Two other examples of malformed requests were to send a nonexistent HTTP version in the request such as “GET / HTTP/99” or requests with unusually long paths of thousands of characters, like “GET /AAAAAAAAAAAAAA... HTTP/1.1”. The hypothesis is that different web servers and frameworks will handle these malformed requests differently and give different kinds of responses. Data from the responses to each request was then saved, this could be data such as the HTTP status code, the used header names, and the content length. The collected data was then clustered using the already mentioned clustering and dimensionality reduction methods.

To use the malformed requests method, a different dataset than the Tranco list should be used. The dataset should consist of websites belonging to companies that have given the explicit permission to allow security tests to be performed on them. The reason behind this is that this method could be considered intrusive due to the

26 | 3 Methodology

malformed requests that the application might not be expecting. However, it is unlikely that the malformed request in and of themselves would uncover any vulnerabilities. Due to this ethical factor and time constraints, it was decided not to pursue this method.

3.5 Architecture This section gives an overview of the architecture of the tools that were built. This includes the hardware it was run on and the software used.

3.5.1 Hardware Two different machines were used, as can be seen in Figure 13. One was a server in Cloud [36] of the type n1-standard-4 with 15GB of RAM and three vCPUs with a base frequency of 2.3 GHz each. The other machine used was a desktop computer with 16GB of RAM and one CPU with 3.40 GHz base frequency.

Figure 13: An overview of the hardware and how the dataset were combined.

27 | 3 Methodology

The Google Cloud machine, located in Iowa, USA, was used to collect the Wappalyzer data and the headless browser data. The desktop computer was used to extract the features from the dataset, dimensionality reduction, and cluster the dataset.

3.5.2 Software The tools in this project were written in python [37] and Golang [38]. Wappalyzer [1] was then used to collect the fingerprints that our results were compared to.

The program that uses the headless browser was written in Golang. The Golang program uses Chromedp [5] to control the chrome headless browser. The base of this program was already built by Tom Hudson [39], but it was extended in this project to include the functionality to collect the data that was needed to cluster the websites.

The program for feature extraction and clustering was written in python. Scikit- learn [33] was a large part of the python script and was used for its pre-built clustering algorithms and data pre-processing tools. The plots were built with the Matplotlib [40] library.

28 | 3 Methodology

29 | 4 Results

4 Results

A few different metrics are used to evaluate the result of our method and to make it possible to compare them. This chapter begins with an overview of the results followed by observations that were made from the clustering and finally an evaluation of our results compared to Wappalyzer.

4.1 Labeling the data for comparison The comparison for our results is made against the pre-existing fingerprinting tool Wappalyzer [1]. Wappalyzer is run on the exact same sites as the ones that are clustered by our tool. Since the ground truth is not known, Wappalyzer results were used as a comparison ground instead. There is also the possibility that the Wappalyzer fingerprints can have false positives and false negatives. Spot checking is carried out on the results both to try to detect flaws in our method, but also to detect potential false positives and false negatives. It is also important to note that Wappalyzer can fingerprint multiple technologies for one site, for example, a Wordpress site can also run JQuery.

4.2 Results of our method Figure 14 shows the distribution of the sites when plotted in two dimensions by first reducing the data to 50 dimensions with truncated SDV and then reducing the 50 dimensions to two with t-SNE. The color of each dot is determined from the Wappalyzer results. The dots that are colored blue run Wordpress, green Drupal, and cyan ASP.NET. The red dots are sites not running any of the aforementioned technologies according to Wappalyzer. Since Wappalyzer can have multiple fingerprints per site, it is important to note that the colors are applied in the order blue, green, and then cyan. This means that if according to Wappalyzer a site runs both Wordpress and ASP.NET, the color will be blue for Wordpress.

It can clearly be seen that each of the technologies has one main cluster where that technology has high density, but it can also be seen that each technology has outliers that are outside the main cluster. These outliers are most notable for the cyan ASP.NET and blue Wordpress. It is noticeable for ASP.NET as it is quite spread out but has a small main cluster of high ASP.NET density at the lower center of the image.

30 | 4 Results

Figure 14: The result of the dimensionality reduction using truncated SVD and t-SNE. Each point is a site from the Tranco list. The colors come from the Wappalyzer fingerprint results; blue Wordpress, green Drupal, cyan ASP.NET.

4.2.1 Dataset The list of sites that our method was run on comes from the Tranco list [32]. Due to the long runtime of our method, the total number of sites the clustering method was run on needed to be limited. The process of gathering the web sites from the Tranco list was allowed to run overnight, with half of the web servers being picked from the top of the Tranco list and the other half from the end of the list. The Tranco list was fetched the 14th of April 20201. These web servers were then trimmed down to remove all HTTP results as a lot of the HTTP servers redirected

1 https://tranco-list.eu/list/XWGN/1000000

31 | 4 Results

to an HTTPS version, resulting in a lot of duplicate results. Furthermore, web servers with invalid or expired SSL certificates were removed, which resulted in a total of 23 457 different sites.

Different cutoff points for which features should be included were set for each bag- of-words model. If a feature occurs in fewer sites than what is defined by the minimum document frequency (MIN_DF), it will not be included in the model. The MIN_DF was set with the total number of features in each bag-of-words model within the range of 400-6000. The reason for doing this was to make sure that there was not a large bias for any particular bag-of-words model. The MIN_DF value and number of features for each bag-of-words model can be seen in Table 1. The sum of the total number of features used is 14 246.

Table 1: Each column represents a bag-of-words model. The first row describes the minimum document frequency for each model and the second row how many features there are in each model.

Cookie Class Id Path Window

MIN_DF 30 90 30 30 70

Number of 444 3240 2734 2660 5168 features

32 | 4 Results

4.2.2 K-Means clustering

Figure 15: A plot of the silhouette score for different K values when using K-means on our result. The X-axis shows the K values and the Y-axis the silhouette score.

Picking the number of centroids (K) for the K-means algorithm proved to be difficult. The reason for this can be seen in Figure 15. Figure 15 contains a plot of the silhouette score on the Y-axis for different K values between 2 and 200 on the X-axis. The silhouette coefficient can range from -1 to 1, where a higher value represents better separated clusters. There was no significant difference in the silhouette coefficient with the changes of K. Due to this, it was not possible to pick any meaningful K value that separated the data into clusters. Figure 16 shows how the K-means algorithm clustered our data using a K of 42, which had the highest silhouette coefficient at 0.53. In Figure 16 it can be seen that clusters of data points that a human would classify as one cluster is split multiple times into several clusters.

33 | 4 Results

Figure 16: An image of how K-means divided our results into clusters with a K (number of clusters) of 42. Each color is a separate cluster.

4.2.3 OPTICS clustering Compared to the K-Means clustering algorithm, OPTICS performed better on the vectors from our method. It can be seen in Figure 17 that the data points are generally well separated into clusters by the OPTICS algorithm. The light red points are classified as noise that, according to OPTICS, does not belong to any cluster. OPTICS was run with the settings of “clustering_method” set to DBSCAN and an “eps” (D) of 1.75, and a “min_samples” (N) as our smallest MIN_DF (35). These settings resulted in OPTICS forming 67 different clusters.

34 | 4 Results

Figure 17: An image of how OPTICS divided the results into clusters. Each color except light red is a separate cluster. The points colored light red are classified as noise. In total, there are 67 clusters.

4.3 Observations This section highlights some observations that were made when spot checking the results of our method.

4.3.1 Truncated SVD Figure 18 show our dataset after having been reduced to two dimensions by the dimensionality reduction algorithm truncated SVD. There are noticeably some extreme outliers on the right side of the graph where X is around 500. These are 242 sites that either are the homepage of the Google , , or sites that redirect to the engine or Blogger. This is an example of overfitting and is further discussed in chapter 5 “Discussion”.

35 | 4 Results

Figure 18: The results of reducing our dataset to two dimensions with truncated SVD. Note the outliers on the bottom right side.

4.3.2 Wappalyzer false negative Figure 19 is a zoomed in image of the largest cluster predominantly containing Wordpress sites. Note that all of the red points inside this cluster are sites that Wappalyzer did not fingerprint Wordpress on. A spot check was carried out of 10 randomly chosen red points. The spot check included running Wappalyzer against the sites again and manually visiting the sites and inspecting them. The spot check showed that 8 of the 10 sites were running Wordpress, while two of the sites were not. One of these sites waited about 10 seconds before it redirects the browser to another site.

36 | 4 Results

Figure 19: A zoomed in image of the Wordpress cluster. According to Wappalyzer, blue points are sites running Wordpress, while red points are not running Wordpress.

4.3.3 Empty and almost empty data from sites In Figure 20 and Figure 21, zoomed in pictures of two clusters can be seen. The sites in Figure 20 have in common that the headless browser failed to extract any information from the sites. There are a total of 2011 sites in this cluster. Figure 21 contains the sites where there is only a small amount of data extracted (e.g., only one cookie name and no other data) and contains 633 sites. Chapter 5 “Discussion” discusses why these clusters might have formed and potential ways to avoid creating them.

37 | 4 Results

Figure 20: Zoomed in picture of the round cluster of points where the headless browser failed to collect any data. It is on the top right side of the original dimensionality reduction results. There are 2011 points inside the round cluster.

38 | 4 Results

Figure 21: Zoomed in picture of the round cluster of sites where very little data was collected. It is on the right side of the original dimensionality reduction result. There are a total of 633 sites in this cluster.

4.4 Runtime A test for how long the runtime was for our method was then conducted. The test was done ten times with the data from the chrome headless browser loaded into the machines RAM. The test included the feature extraction for all bag-of-words models, scaling the models, combining the models, running the dimensionality reduction, and then clustering. From the tests, the mean value was calculated to be 496.48 seconds and the median to 489.30 seconds. The full results can be seen in Appendix 1.

39 | 4 Results

4.5 Evaluation results In total, Wappalyzer fingerprinted 596 different technologies among the 23 457 sites of our dataset. A list of all of these technologies and their occurrences can be found in appendix 2. Due to the fact that our largest MIN_DF value is 90, it is reasonable to assume that the technologies that occur more than 90 times are more reliably fingerprinted since they have the most features available. In total, 164 (27.7%) technologies occurred more than 90 times. Images of how each of these 164 technologies are located in our clustering result can be found in appendix 3. A selection of these results is presented in this section. This selection was also used when calculating the precision and recall metrics. The clusters from the OPTICS algorithm were used when calculating the precision and recall scores. The labels collection method detailed in section 4.1 “Labeling the data for comparison” will be used when calculating these values for the technologies. Due to time constraints, the precision and recall could only be calculated for some technologies. The technologies that had their precision and recall calculated were manually chosen. For each technology chosen, the main clusters were picked manually.

40 | 4 Results

4.5.1 Wordpress Wordpress is the most common content management system [41] found by Wappalyzer in our dataset. Wappalyzer fingerprinted Wordpress on 3481 sites in our dataset. In Figure 22, it can be seen that our clustering method formed one main cluster with a high density of Wordpress in it, but multiple smaller clusters of Wordpress can still be found spread around outside of the main cluster. There are also some sites spread out in other clusters. The recall score in Table 2 reflects this by being quite low while the precision score for the main cluster is high, which means most of the sites in the main cluster run Wordpress.

41 | 4 Results

Figure 22: Image of where Wappalyzer fingerprinted Wordpress as blue points. Red points are sites where Wappalyzer did not fingerprint Wordpress.

Table 2: The precision and recall score for Wordpress of our clustering.

Precision 0.832

Recall 0.320

42 | 4 Results

4.5.2 jQuery jQuery is the most popular technology that Wappalyzer fingerprinted. jQuery is a JavaScript library [42]. Wappalyzer fingerprinted jQuery on 13893 sites in our dataset. In Figure 23, it can be seen that our method did not form any particular cluster. It is quite uniformly spread out over the entire clustering, with the exception of a cluster on the lower left side that has no sites with jQuery fingerprinted inside that cluster. Since there is no distinct main cluster, there is no way to determine what cluster to check the precision and recall of. This result is discussed further in chapter 5 “Discussion”.

43 | 4 Results

Figure 23: Image of where Wappalyzer fingerprinted jQuery as blue points. Red points are sites where Wappalyzer did not fingerprint jQuery.

44 | 4 Results

4.5.3 Drupal Drupal is another popular content management system [43] similar to Wordpress. Wappalyzer fingerprinted Drupal on 1208 sites in our dataset. In Figure 24, it can clearly be seen that most of the Drupal sites are in one cluster. But it can also be seen that there is another cluster close to the main cluster with a low density of Drupal sites. This second cluster and the main cluster are considered by OPTICS to be one single cluster since they are within close distance of one another, which lowers the precision score seen in Table 3.

45 | 4 Results

Figure 24: Image of where Wappalyzer fingerprinted Drupal as blue points. Red points are sites where Wappalyzer did not fingerprint Drupal.

Table 3: The precision and recall score for Drupal of our clustering.

Precision 0.731

Recall 0.485

46 | 4 Results

4.5.4 ASP.NET ASP.NET is a framework for building web applications [44]. Wappalyzer fingerprinted ASP.NET on 1553 sites in our dataset. It can be seen in Figure 25 that ASP.NET is quite spread out in our clustering, but with one cluster on the bottom middle part of the picture with a high density of sites running ASP.NET. As the technology is very spread out, the recall gets lower, as can be seen in Table 4. The precision of ASP.NET is the highest of the technologies tested at 0.873.

47 | 4 Results

Figure 25: Image of where Wappalyzer fingerprinted ASP.NET as blue points. Red points are sites where Wappalyzer did not fingerprint ASP.NET.

Table 4: The precision and recall score for ASP.NET of our clustering.

Precision 0.873

Recall 0.109

48 | 4 Results

4.5.5 AddThis AddThis is a JavaScript tool [45]. Wappalyzer fingerprinted it on 655 sites in our dataset. Out of the technologies where precision and recall was calculated, AddThis had the best overall clusterings of our method. In Figure 26, it can be seen that there are only a small number of sites outside of the main cluster, and 84.6% of the sites in the main cluster runs AddThis. Both the high recall and precision can be seen in Table 5.

49 | 4 Results

Figure 26: Image of where Wappalyzer fingerprinted AddThis as blue points. Red points are sites where Wappalyzer did not fingerprint AddThis.

Table 5: The precision and recall score for AddThis of our clustering.

Precision 0.846

Recall 0.528

50 | 4 Results

51 | 5 Discussion

5 Discussion

This section discusses how successful the results were and what can be learned from them. It also discusses the decisions made when choosing methods, and what other decisions could have been made instead. This section also discusses the limiting factors that forced some of the decisions. Finally, the extent of this thesis’ contribution on a societal, environmental, ethical, and economical level, as well as the possibility of future work will be discussed.

5.1 Clustering results

Figure 16: An image of how K-means divided our results into clusters with a K (number of clusters) of 42. Each color is a separate cluster.

52 | 5 Discussion

Figure 17: An image of how OPTICS divided the results into clusters. Each color except light red is a separate cluster. The points colored light red are classified as noise. In total there are 67 clusters.

When looking at the results, it is clear that the different clustering algorithms produce different quality results. When comparing the K-means clustering in Figure 16 and OPTICS clustering in Figure 17, the K-means clustering creates clearer borders between its clusters but manages to split data points that seem to belong to a single cluster into several smaller ones. Meanwhile, the OPTICS algorithm creates distinct, well separated clusters, but leaves out some nearby data points and classifies them as noise.

53 | 5 Discussion

However, since K-means require the number of clusters to be defined before it is run, combined with the difficulty of determining the number of clusters, K-means ended up being poorly suited for our specific method. Since K-means works best if the data is spherical and grouped in equally sized groups, it is understandable that K-means is not producing optimal results from the data generated by SVD and T- SNE seen in Figure 14, which is neither spherical nor clusters of equal size.

Figure 14: The result of the dimensionality reduction using truncated SVD and t-SNE. Each point is a site from the Tranco list. The colors come from the Wappalyzer fingerprint results; blue Wordpress, green Drupal, cyan ASP.NET.

Meanwhile, the OPTICS algorithm creates clusters that are more distinct from one another, with very few clusters bordering one another without visible empty space between their border data points. These clear distinctions between clusters happen

54 | 5 Discussion

since OPTICS can handle clusters with varying densities. OPTICS manages this by only creating core points in dense regions, and disregarding data points in sparse areas when creating clusters. This method ensures that similar data points are grouped together into coherent clusters. Similarly, in Figure 17, we can see the light red data points that surround most of the clusters, which are defined as noise. These data points are not close enough to any other cluster and are not numerous enough to create their own cluster. These noise data points could still potentially use the same technologies as nearby clusters. However, raising the distance parameter for OPTICS can result in clusters with different technologies being merged into one cluster, such as with the Drupal result. As such, a distance parameter between these two extremes needs to be chosen, as one distance cannot perfectly satisfy all cases.

5.2 Accuracy and reliability of evaluation Our results show that our method works for fingerprinting certain web applications and JavaScript frameworks. The goal of this thesis was to fingerprint web servers and web applications, but instead, our method succeeded in fingerprinting web applications and JavaScript frameworks. There are however cases where it fails, such as with the popular jQuery framework. This failure is likely due to the JQuery framework being so prevalent, meaning that no conclusive cluster can be made around it, as over half of all results are running JQuery. Our method can also only pick up on a few features that JQuery is in use, such as the “$” attribute in the window object. There is room for improvement for both the precision and recall of the results.

Section 4.5 “Evaluation results” shows that our model, in general, has a lot higher precision than recall. The low recall means that our method fails to put all the occurrences of a technology into one cluster. The generally high precision shows that for the technologies that precision and recall were calculated for, each of the sites in the technology’s main cluster are predominantly running that technology.

Our model heavily relies on the content of the HTML page and the JavaScript files that are included. This dependency means that technologies that do not change the HTML page content, such as web servers, are not fingerprinted. An alternative approach that might be able to fingerprint web servers is proposed in section 3.5 “Alternative Method”. This alternative approach would then send malformed requests to web servers, which would likely generate different responses depending on the web server and configuration in use. These responses could then be used to cluster these sites. This alternative method was not explored in this thesis due to ethical issues, as the malformed requests could be seen as malicious. This ethical issue would appear since the web servers and applications might crash when receiving malformed requests. Because of this, sending these malformed requests

55 | 5 Discussion

to sites that have not allowed tests of this nature to be run on them could be considered unethical. Another reason this alternative method was not explored was due to time constraints, as the main method still had more avenues to explore and improve, such as having a larger sample size for the spot checking.

A theory as to why the results of Wordpress and ASP.NET is quite spread out but still has a main cluster for the technology is that they are both configurable but comes with default settings. This could mean that the main clusters for the technologies are the sites with few settings changed from the default settings, while the outliers run the technology with changed settings compared to the default. Due to time constraints, this theory could not be tested in this thesis.

One particular potential for improvement can be seen in the result of section 4.3.1 “Truncated SVD”, where our dataset was reduced down to two dimensions with truncated SVD. The results in section 4.3.1 “Truncated SVD” show that extreme outliers exist, such as the instance of sites redirecting to Google and Blogger. A theory why this happened is that more sites than the largest MIN_DF for our bag- of-words models redirected to the same site. Because of this, all the features in use on these sites were picked up and added to the bag-of-words vocabularies. All of these new features resulted in our model overfitting for these cases. A potential way to avoid this kind of overfitting could be to run truncated SVD and create clusters of the outliers, and then run truncated SVD again but without the previous clusters of outliers. Another possibility is to raise the MIN_DF value, but this would result in our model not being able to pick up the technologies and frameworks that occur fewer times in the dataset than MIN_DF as their features would not be picked up by the bag-of-words models.

5.2.1 Dataset inconsistencies Inconsistencies in both the dataset from our headless chrome data collection program and the Wappalyzer data were observed.

With the Wappalyzer data, it was discovered that it had sometimes failed to accurately fingerprint the running technology. An example of this can be seen in section 4.3.2 “Wappalyzer false negative” where sites running Wordpress are not fingerprinted as running Wordpress by the first run of Wappalyzer, but manually inspecting the sites and rerunning Wappalyzer showed that they are in fact running Wordpress. A reason as for why this happened could be that the request to the site timed out for Wappalyzer, while our data collection succeeded. Both Wappalyzer and our data collection program were set to have a 10-second timeout, but they were run at about a 24 hours time difference from each other, which could also affect the results. It is possible that this is widespread over the data collected with Wappalyzer and does not only apply to the Wordpress sites. If this is true, it would

56 | 5 Discussion

mean that the real precision of our method might be higher, while the recall could be lower.

Similarly, for our data collection program, it can be seen in section 4.3.3 “Empty and almost empty data from sites” that the data collection program fails to collect any data from some sites, this is likely due to the requests to the sites timing out. In Figure 27, the Wappalyzer results for Wordpress is added as the color blue in the picture for the empty cluster from section 4.3.3 “Empty and almost empty data from sites”. It can be seen in Figure 27 that Wappalyzer has fingerprinted some of these sites as running Wordpress. These inconsistencies also indicate that some sites time out for our data collection program, but not for Wappalyzer, and sometimes the other way around.

Figure 27: Image from section 4.3.3 “Empty and almost empty data from sites” of the cluster with the results from our data collection being empty. The color in the picture comes from Wappalyzer, where sites running Wordpress are blue, while sites not running Wordpress are red.

57 | 5 Discussion

These timeouts could come from either that some sites sometimes are slower than 10 seconds to load, or that there was some bottleneck while collecting the data. It could also be a combination of both. To rule out the bottleneck theory, both the Wappalyzer and data collection program would have to be run again on a more powerful machine and network, or slow down the rate of requests being sent. Since this potential problem was discovered near the thesis's deadline, there was not enough time to rerun the data collection, Wappalyzer, and then reevaluate the results to rule out the bottleneck theory. Another possibility for improvement would be to run both Wappalyzer and our data collection at the same time to remove the time difference between them.

5.2.2 Significance of the size of dataset As mentioned in section 2.7.1 “Sample Size”, determining a minimum sample size is no exact science, with no static rule of thumb existing. Having a larger dataset could potentially give more precise results, but it would come at the cost of processing speed. With a mean run time of 496.48 seconds and a limited time frame allotted to the thesis, a larger sample size would have come at the cost of other areas, such as fewer technologies that could have been tested for precision and recall. A larger dataset would likely also mean more technologies being used among the sites in the dataset, which would lead to the dimensionality getting even higher. A way to counteract this would be to raise the MIN_DF value in our method.

A potential avenue of exploration could be to use our method on a smaller dataset, for example, all subdomains of a company. The smaller dataset could lead to our method being able to cluster internal frameworks in use by the company. If this is the case, it would be a large advantage for our method over a tool like Wappalyzer since it is not feasible for Wappalyzer to build a fingerprint for technologies only in use by one or a few companies due to the manual effort needed to build a fingerprint. This avenue was not explored as targeting a company with our method would give results that could not be published in this thesis without permission from said companies. Instead, our method was used on the internet-wide public Tranco list as to avoid targeting any specific company.

5.3 Societal, ethical, environmental and economical impact There are few environmental areas that a thesis like this could impact. Similarly, the impact of this thesis on a societal level is hard to quantify. However, on an ethical level, tests like these where several requests are sent out trying to scope out the technologies in use of a website can be viewed as the potential beginning of an attack. The ethical aspect would especially be a problem with the alternative malformed requests method. This issue was alleviated in this thesis by using a diverse dataset of popular websites to avoid targeting any particular company and

58 | 5 Discussion

not performing the malformed requests method. Fingerprinting could potentially be used for nefarious purposes, for instance, if a weakness in a specific technology was found, having the information of which other sites also use that technology could be used to attack them. Similarly, fingerprinting can also be used to help find the sites that need to be patched.

From an economic viewpoint, creating a dedicated tool to fingerprint websites without using previously defined fingerprints could save on work hours, which could increase the number of technologies that could be fingerprinted. For market research, being able to quickly get an overview of the market landscape can be useful. Specifically in regards to what technologies are commonly used and by whom which helps give insight into what technologies are worth focusing on in terms of further development, and potentially if new technologies are appearing and taking over parts of the market.

59 | 6 Conclusion

6 Conclusion

Overall, the thesis did achieve partial success, as the core goal of the thesis was achieved in creating a tool that can, without using previously defined fingerprints for each technology, fingerprint websites based on the technologies used. It did however achieve a lower precision and recall than what was hoped for, specifically in regard to recall.

The method developed uses multiple bag-of-words models with different vocabularies for id values, class values, paths, cookie names, and the non-default values from the JavaScript window object. The vectors from the bag-of-words models are then combined, followed by a dimensionality reduction consisting of truncated SVD and t-SNE. The result of the dimensionality reduction is then clustered using K-means or OPTICS.

The results indicate that using a density-based clustering algorithm such as OPTICS yielded better results than using a partitional clustering algorithm such as K-means for creating clusters on unlabeled data. Another finding was that on an average home workstation, the runtimes are relatively high when performing the entire process of data collection, feature extraction, dimensionality reduction, and clustering. The results are promising, but further development to improve the precision and recall is needed.

There are still a lot of research and development opportunities that exist for automatic fingerprinting of websites and web applications. One potential issue that could be evaluated is whether malformed requests can be used in the data collection and feature extraction steps to get results from, for example, web servers that do not modify the HTML of the site. Other potential areas for improvement are testing more feature extraction techniques, testing other clustering algorithms, and dimensionality reduction techniques to compare their effectiveness, increasing the sample size and evaluating the effects of a larger data set.

60 | 6 Conclusion

61 | References

References

[1] Wappalyzer. Wappalyzer [Internet]. 2010. [cited 2020 Apr 13]. Available from: https://github.com/AliasIO/wappalyzer

[2] Broder A.Z., Some applications of Rabin’s fingerprinting method. Sequences II, New York: Springer; 1993

[3] Horton A, Coles B. WhatWeb - Next generation web scanner [Internet]. 2010 [cited 2020 Apr 13]. Available from: https://github.com/urbanadventurer/WhatWeb

[4] Bidelman E. Getting Started with Headless Chrome [Internet].Google; 2017 [updated 2019-01-14; cited 2020 May 12] Available from: https://developers.google.com/web/updates/2017/04/headless-chrome

[5] Shaw K. A faster, simpler way to drive browsers supporting the Chrome DevTools Protocol [Internet]. Chromedp; 2017 [updated 2020-05-6; cited 2020 May 12] Available from: https://github.com/chromedp/chromedp

[6] Page Automation with PhantomJS [Internet]. Ariya Hidayat; 2020 [cited 2020 May 13] Available from: https://phantomjs.org/page-automation.html

[7] CURL [Internet]. Daniel Stenberg; 1999 [updated 2020-05-13; cited 2020 May 13] Available from: https://github.com/curl/curl

[8] Window - Web [Internet]. Mozilla; c 2020 [updated 2020-02-22; cited 2020 May 13] Available from: https://developer.mozilla.org/en- US/docs/Web/API/Window

[9] HTML Living Standard [Internet]. Whatwg; c 2020 [updated 2020-04-13; cited 2020 May 13] Available from: https://html.spec.whatwg.org/

[10] Build fast, responsive sites with Bootstrap [Internet]. Bootstrap; [date unknown] [cited 2020 May 13] Available from: https://getbootstrap.com/

[11] James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning. corr. at 8. printing. New York: Springer; 2013

[12] Jang JSR, Sun CT, Mizutani E. Neuro-Fuzzy and soft computing A computational approach to learning and machine intelligence. Upper Saddle River: Prentice Hall; 1997

62 | References

[13] Hinz J. Clustering the Web : Comparing Clustering Methods in Swedish [Internet] [Dissertation]. 2013. Available from: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-95228

[14] Ranby E. A comparison of clustering techniques for short social text [Internet] [Dissertation]. 2016. Available from: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-196735

[15] Kriegel H.P., Kröger P, Sander J, Zimek A. Density-based clustering. WIREs Data Mining Knowl Discov. 2011; volume 1 (issue 3):231-240

[16] Ester M, Kriegel H. P, Sander J , Xu X. A Density-Based Algorithm for Discovering Clusters in Large Spatial with Noise. AAAI. 1996; 2: 226-231

[17] Ankerst M, Breunig M.M., Kriegel H.P., Sander J. OPTICS: Ordering Points To Identify the Clustering Structure. ACM Sigmod Record. 1999; 28(2):49-60

[18] Scikit-learn. sklearn.cluster.OPTICS [Internet]. Scikit-learn. c 2019 [cited 2020 May 7]. Available from: https://scikit- learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html

[19] Rousseeuw P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987; 20:53-65

[20] Bellman R. Dynamic programming. 6 ed. Princeton:Princeton university press; 1972

[21] Siddiqui K. Heuristics for Sample Size Determination in Multivariate Statistical Techniques. WASJ. 2013;27(2):285-287

[22] Dolnicar S. A Review of Unquestioned Standards in Using Cluster Analysis for Data-Driven Market Segmentation. Melbourne: ANZMAC 2002; 2002

[23] Steinbach M. Ertöz L Kumar V. The Challenges of Clustering High Dimensional Data. In: Wille L, editor. New Directions in Statistical Physics. New York: Springer-Verlag Berlin Heidelberg; 2004. 273-309

[24] Manning C.D, Raghavan Prabhakar, Schütze H. Introduction to Information Retrieval. 1 ed. Cambridge;Cambridge University Press; 2009

[25] Maaten L, Hinton Geoffrey. Visualizing Data using t-SNE. JMLR. 2008; 9: 2579-2605

63 | References

[26] Scikit-learn. t-distributed Stochastic Neighbor Embedding (t-SNE) [Internet] ©2019 [cited 2020 april 8] Available from: https://scikit- learn.org/stable/modules/manifold.html#t-sne

[27] Maaten L. Accelerating t-SNE using Tree-Based Algorithms. JMLR; 2014. 15:1- 21

[28] Wattenberg M. Viégas F. Johnson I. How to Use t-SNE Effectively. Distill. 2016; Available from: http://doi.org/10.23915/distill.00002

[29] Järvstråt L. Functionality Classification Filter for Websites [Internet] [Dissertation]. 2013. Available from: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-93702

[30] Broder A, Glassman S, Manasse M, Zweig G. Syntactic clustering of the Web. Computer Networks and ISDN Systems. 1997; 29(8–13):1157-66

[31] Yang K. Li Q. Sun L Towards automatic fingerprinting of IoT devices in the cyberspace. Computer Networks. 2019; 148:318-27

[32] Le Pochat V, Van Goethem T, Tajalizadehkhoob S, Korczyński M, Joosen W. TRANCO: A Research-Oriented Top Sites Ranking Hardened Against Manipulation. San Diego, CA, USA; Network and Distributed System Security Symposium (NDSS); 2019. 26. Available from: https://dx.doi.org/10.14722/ndss.2019.23386

[33] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al. Scikit-learn: Machine Learning in Python [Internet] JMLR; 2011. 12(85);2825−2830 [Cited 2020 may 13] Available from: http://jmlr.csail.mit.edu/papers/volume12/pedregosa11a/pedregosa11a.pdf

[34] Scikit-learn. sklearn.manifold.TSNE [Internet] ©2019 [cited 2020 may 13] Available from: https://scikit- learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

[35] Fielding R, Reschke J. Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. IETF RFC. 2014;7231 Available from: https://tools.ietf.org/html/rfc7231

[36] Google. Google cloud [Internet] 2020 [cited 2020 may 13] Available from: https://cloud.google.com/

[37] Python. Python [Internet], 2001 [cited 2020 may 13] Available from: https://www.python.org/

64 | References

[38] Golang. Go [Internet] 2012 [cited 2020 may 13] Available from: https://golang.org/

[39] Chan J. Meet the team: Tom Hudson - Collaboration is the way forward [Internet] Detectify AB 2020 [cited 2020 may 13] Available from https://blog.detectify.com/2020/04/15/meet-the-team-tom-hudson- collaboration-is-the-way-forward/

[40] Hunter J. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 2007;9(3) 90-5

[41] Wordpress, Wordpress 2003 [cited 2020 may 13] [Internet] Available from: https://wordpress.org/

[42] Jquery. Jquery [Internet] ©2020 [cited 2020 may 13] Available from: https://jquery.com/

[43] Drupal. Drupal [Internet], 2001 [cited 2020 may 13] Available from: https://www.drupal.org/

[44] Microsoft. ASP.NET [Internet], ©2020 [cited 2020 may 13] Available from: https://dotnet.microsoft.com/apps/aspnet

[45] AddThis. AddThis [Internet]. ©2020 [cited 2020 may 13] Available from: https://www.addthis.com/

65 | Image references

Image references

Figure 1: Wappalyzer. Wappalyzer [Internet]. 2010. [cited 2020 Apr 13]. Available from: https://github.com/AliasIO/wappalyzer/blob/master/src/apps.

Figure 4: James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning. corr. at 8. printing. New York: Springer; 2013. Figure 10.9, p 392

Figure 5: Kriegel H.P., Kröger P, Sander J, Zimek A. Density-based clustering. WIREs Data Mining Knowl Discov. 2011; volume 1 (issue 3):231-240. Figure 1, p 232

Figure 6: Bergqvist J, Study of Protein Interfaces with Clustering [master's thesis]. Linköping: Lindköping university; 2018 [cited 2020 may 13]. Available from: http://liu.diva-portal.org/smash/get/diva2:1260491/FULLTEXT01.pdf

Figure 8: Icke I. Overfitting [Figure]. 2008 [cited 2020 may 14]. Available from: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting.svg. (CC BY-SA 4.0)

66 | Image references

67 | Appendix 1

Appendix

Appendix 1

Runtime of our model in seconds. Described more in detail 4.4:

519.8347115516663

508.6881308555603

467.43011951446533

478.28732109069824

472.72682905197144

472.19122409820557

545.0504710674286

529.9625012874603

500.3084943294525

470.3107054233551

68 | Appendix 1

69 | Appendix 2

Appendix 2

Technologies found by Wappalyzer in list of 23 457 sites (list of sites described in 4.1.1)

The number represents how many times the technology appeared, followed by the name of the technology.

70 | Appendix 2

1 Adobe GoLive 1 Scientific 1 A-Frame 1 Seravo 1 Akka HTTP 1 Shopware 1 AMP Plugin 1 sIFR 1 Analysys Ark 1 Simple Analytics 1 Aurelia 1 SMF 1 BoldGrid 1 Stackla 1 Bolt 1 Sulu 1 1 TwistedWeb 1 Chevereto 1 TypePad 1 ClickFunnels 1 1 cPanel 1 Virgool 1 CS Cart 1 VuePress 1 decimal.js 1 W3Counter 1 DedeCMS 1 Whooshkaa 1 Discourse 1 Yahoo! Ecommerce 1 ef.js 1 Yahoo! Tag Manager 1 EPiServer 1 Zend 1 Epoch 2 Adobe ColdFusion 1 FAST ESP 2 AngularDart 1 Flarum 2 ApexPages 1 Freespee 2 Asciinema 1 Gallery 2 Atlassian Jira 1 Gauges 2 Bablic 1 GoDaddy 2 Backdrop 1 Grav 2 Bootstrap Table 1 GX WebManager 2 CDN77 1 Hello Bar 2 CherryPy 1 HTTP/2 2 ClickHeat 1 ImpressPages 2 1 iWeb 2 Countly 1 Jahia DX 2 Dart 1 JavaScript Infovis Toolkit 2 Doxygen 1 JS Charts 2 Fat-Free Framework 1 JSEcoin 2 Gridsome 1 LiveAgent 2 ikiwiki 1 LiveStreet CMS 2 IntenseDebate 1 mini_httpd 2 Jalios 1 MoinMoin 2 1 Moon 2 Lithium 1 MyBB 2 math.js 1 Odoo 2 1 Paths.js 2 mod_python 1 PhotoShelter 2 MongoDB 1 phpwind 2 Moodle 1 PostgreSQL 2 1 RoundCube 2 Mura CMS 1 Sapper

71 | Appendix 2

2 Netsuite 3 Sails.js 2 Octopress 3 Sphinx 2 Olark 3 Twitter Flight 2 OpenBSD httpd 3 Vignette 2 Open Web Analytics 3 WP-Statistics 2 Oracle 3 Xajax 2 Oracle 4 Atlassian Jira Issue Collector 2 PrestaShop 4 Bonfire 2 Salesforce 4 Eloqua 2 shine.js 4 Flat UI 2 Slimbox 4 Gentoo 2 Slimbox 2 4 IBM WebSphere Portal 2 Supersized 4 JBoss Application Server 2 SUSE 4 JBoss Web 2 4 K2 2 Typecho 4 Livefyre 2 VTEX 4 Mobify 2 Websocket 4 Mobirise 2 Weebly 4 MODX 2 Wink 4 Oracle Commerce Cloud 2 X-Cart 4 Orchard CMS 2 XOOPS 4 osCommerce 2 XpressEngine 4 Powergap 2 Zanox 4 2 Zimbra 4 Reveal.js 3 ArcGIS API for JavaScript 4 Starlet 3 Business 4 ThinkPHP 3 Commerce Server 4 TornadoServer 3 DatoCMS 4 Transifex 3 Docusaurus 4 Weglot 3 Fedora 4 YUI Doc 3 Get Satisfaction 5 ADPLAN 3 GoCache 5 Adverticum 3 Google Charts 5 Bigcommerce 3 govCMS 5 3 GrowingIO 5 Contensis 3 jQuery Sparklines 5 Go 3 Kampyle 5 Hexo 3 Less 5 Highstock 3 Mermaid 5 imperia CMS 3 Mint 5 KeyCDN 3 mod_fastcgi 5 Pimcore 3 nopCommerce 5 Pygments 3 October CMS 5 Rickshaw 3 OpenCms 5 ShinyStat 3 Open Journal Systems 5 Solve Media 3 Percussion 5 3 phpBB 5 Tumblr 3 Ramda 6 Blogger

72 | Appendix 2

6 DokuWiki 10 Sensors Data 6 Ghost 11 Act-On 6 ip-label 11 Material 6 Medium 11 Ant Design 6 NVD3 11 DoubleClick Campaign Manager 6 OpenLayers (DCM) 6 Oracle Recommendations On 11 Microsoft Word Demand 11 Moment Timezone 6 SPIP 11 Now 6 styled-components 11 SyntaxHighlighter 6 Tilda 11 UserVoice 6 Titan 12 CodeMirror 6 Yieldlab 12 7 CFML 12 KineticJS 7 Day.js 12 7 DM Polopoly 12 Marked 7 eZ Publish 12 Open AdStream 7 Freshmarketer 12 three.js 7 Froala Editor 13 basket.js 7 IBM DataPower 13 FreeBSD 7 Immutable.js 13 Liquid Web 7 13 Mautic 7 MaxCDN 13 Po.st 7 mod_dav 13 SDL Tridion 7 PDF.js 14 Bold Chat 7 14 Braintree 7 RD Station 14 CakePHP 7 ShareThis 14 GOV.UK Frontend 7 SilverStripe 14 Sqreen 7 Wix 15 BEM 7 15 DNN 8 15 Kendo UI 8 Bulma 15 Woopra 8 IPB 16 Ahoy 8 Oracle HTTP Server 16 Cufon 8 Play 16 EasyEngine 8 Riot 16 Jekyll 8 Scala 16 9 Awesomplete 16 RightJS 9 Concrete5 16 Semantic-ui 9 F5 BigIP 17 AMP 9 FrontPage 17 9 Prefix-Free 17 mod_jk 9 Quill 17 OpenGSE 9 Section.io 17 Reddit 9 Tencent Waterproof Wall 18 AT Internet XiTi 10 Cross 18 Discuz! X 10 ExtJS 18 Dynatrace 10 Methode 18 Freshchat

73 | Appendix 2

18 Kestrel 33 SoundManager 18 mod_wsgi 34 OpenX 18 XRegExp 35 EdgeCast 19 Adzerk 35 LivePerson 19 AlloyUI 35 Materialize CSS 19 amCharts 36 Linkedin 19 Fireblade 36 Lite 19 Marionette.js 36 XenForo 19 RxJS 37 Inspectlet 19 Webtrends 37 scrollreveal 20 Plone 38 Kentico CMS 21 AdOcean 38 Liferay 22 mod_ssl 38 Navegg 22 Signal 38 Pure CSS 23 Bloomreach 38 .svg 23 Chorus 40 List.js 23 SiteGround 40 24 Ember.js 40 Platform.sh 24 Glyphicons 40 Rubicon Project 24 Squiz Matrix 40 script.aculo.us 24 Usabilla 41 Brightspot 25 JavaServer Pages 41 Flywheel 25 NextGEN Gallery 41 Google Code Prettify 25 TinyMCE 41 jQuery-pjax 26 Element UI 41 Oracle Commerce 26 MediaWiki 41 particles.js 27 Apollo 42 Yandex.Direct 27 Carbon Ads 43 IBM WebSphere Commerce 27 DataLife Engine 43 Zipkin 27 Oracle Dynamic Monitoring Service 44 27 SumoMe 44 MobX 27 Webtrekk 44 Smart Ad Server 27 WP 45 Hybris 28 AdRiver 46 SweetAlert2 28 Craft CMS 47 MathJax 28 ExpressionEngine 47 PayPal 28 Squarespace 48 Raphael 29 Datadome 48 29 VigLink 49 Zendesk Chat 30 Riskified 50 Magento 30 vBulletin 50 Shopify 31 1C-Bitrix 51 31 mod_perl 51 Socket.io 31 Swiftype 52 Hogan.js 32 HeadJS 54 PubMatic 32 Hugo 54 Twitter typeahead.js 32 Sitefinity 55 SweetAlert 32 Tawk.to 55 Webflow 32 WordPress Super Cache 56 Bounce Exchange

74 | Appendix 2

58 Amazon ECS 95 BugSnag 58 Arc Publishing 97 Salesforce Commerce Cloud 58 BuySellAds 99 Zepto 59 Dojo 102 Highcharts 59 Servlet 104 Ruxit 59 LiveChat 105 DataTables 59 UIKit 105 prettyPhoto 60 Enhanced 108 CodeIgniter eCommerce 109 Gatsby 60 INFOnline 111 Nuxt.js 60 parcel 111 61 jQuery Mobile 112 Contentful 61 Mouse Flow 116 DoubleClick Ad Exchange (AdX) 61 RackCache 117 WooCommerce 61 Sizmek 118 62 GitHub Pages 119 Elementor 62 Microsoft SharePoint 120 Fastly 65 Ionicons 121 Azure CDN 66 121 DreamWeaver 66 Microsoft HTTPAPI 124 All in One SEO Pack 67 Heap 127 Pardot 67 Plesk 130 Prototype 68 Cowboy 130 Sucuri 68 Erlang 135 Fingerprintjs 69 CKEditor 138 Intercom 69 Unbounce 139 Clipboard.js 70 Highlight.js 139 Red Hat 71 Statcounter 139 Revslider 73 FlexSlider 139 YUI 74 Docker 141 75 Angular 141 Mixpanel 75 .js 144 76 Algolia 147 Ensighten 76 Clicky 147 Zone.js 76 Leaflet 155 Parse.ly 77 MooTools 156 Flickity 79 156 Outbrain 80 Netlify 159 Mustache 81 D3 160 82 AT Internet Analyzer 165 Envoy 83 Gravity Forms 167 Tengine 83 Prism 168 AddToAny 84 TYPO3 CMS 169 Automattic 87 Kinsta 169 Gravatar 87 W3 Total Cache 171 Gemius 88 Amplitude 171 WordPress VIP 91 LiteSpeed 175 Sitecore 94 Litespeed Cache 185 AdRoll 94 Pinterest 187 Disqus

75 | Appendix 2

188 Stripe 456 animate. 189 Google PageSpeed 462 Hammer.js 194 Python 497 Vue.js 195 MediaElement.js 502 Google Plus 195 Yandex.Metrika 515 Sentry 201 Chart.js 528 Optimizely 205 CentOS 529 Chartbeat 205 VideoJS 531 218 Azure 532 Akamai 219 SWFObject 532 FancyBox 226 Amazon S3 554 Ruby 231 Next.js 579 RequireJS 239 587 Quantcast 247 Google Web Server 594 Polyfill 250 Liveinternet 622 Criteo 262 Marketo 633 Adobe DTM 264 HubSpot 655 AddThis 269 TrackJs 656 273 Incapsula 694 Node.js 276 AppNexus 711 OWL Carousel 283 Hotjar 722 Swiper Slider 284 MariaDB 727 ZURB Foundation 284 Pantheon 743 Underscore.js 296 Acquia Cloud 864 296 Percona 877 Moment.js 298 Visual Website Optimizer 947 Babel 326 MailChimp 981 Yoast SEO 332 Crazy Egg 1002 Slick 333 Backbone.js 1021 Twitter 338 Google Cloud 1022 Twitter Emoji (Twemoji) 346 1052 YouTube 351 Amazon ELB 1166 Prebid 353 Select2 1208 Drupal 354 GSAP 1268 comScore 365 Tealium 1405 Amazon Cloudfront 367 Segment 1408 reCAPTCHA 372 OpenSSL 1487 React 380 Amazon EC2 1553 Microsoft ASP.NET 389 SiteCatalyst 1609 Google AdSense 395 WP Engine 1746 IIS 410 Typekit 1748 New Relic 413 Express 1754 Windows Server 430 Lightbox 1877 Java 437 AngularJS 1998 DoubleClick for Publishers (DFP) 439 Handlebars 2123 443 Matomo 2137 Amazon Web Services 448 Lua 2144 448 OpenResty 2924 jQuery Migrate 452 Adobe Experience Manager 2995 jQuery UI

76 | Appendix 2

3006 Modernizr 3096 Font Awesome 3481 WordPress 3583 MySQL 3899 Apache 3985 Facebook 4186 Bootstrap 5828 Google Font API 6523 Nginx 7305 PHP 7920 CloudFlare 9481 Google Analytics 10827 Google Tag Manager 13893 jQuery

77 | Appendix 3

Appendix 3

The graphs generated for each technology that had 90 or more occurrences according to Wappalyzer. The blue data points represent the sites that used the technology.

78 | Appendix 3

79 | Appendix 3

80 | Appendix 3

81 | Appendix 3

82 | Appendix 3

83 | Appendix 3

84 | Appendix 3

85 | Appendix 3

86 | Appendix 3

87 | Appendix 3

88 | Appendix 3

89 | Appendix 3

90 | Appendix 3

91 | Appendix 3

92 | Appendix 3

93 | Appendix 3

94 | Appendix 3

95 | Appendix 3

96 | Appendix 3

97 | Appendix 3

98 | Appendix 3

99 | Appendix 3

100 | Appendix 3

101 | Appendix 3

102 | Appendix 3

103 | Appendix 3

104 | Appendix 3

105 | Appendix 3

TRITA CBH-GRU-2020:060

www.kth.se