Automatic Fingerprinting of Websites Automatisk Fingeravtryckning Av
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020 Automatic fingerprinting of websites Using clustering and multiple bag-of-words models Automatisk fingeravtryckning av hemsidor Med användning av klustring och flera ordvektormodeller ALFRED BERG NORTON LAMBERG KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH Automatic fingerprinting of websites Using clustering and multiple bag-of-words models Automatisk fingeravtryckning av hemsidor Med användning av klustring och flera ordvektormodeller Alfred Berg Norton Lamberg Examensarbete inom Datateknik, Grundnivå, 15 hp Handledare: Shahid Raza Examinator: Ibrahim Orhan TRITA-CBH-GRU-2020:060 KTH Skolan för kemi, bioteknologi och hälsa 141 52 Huddinge, Sverige Abstract Fingerprinting a website is the process of identifying what technologies a website uses, such as their used web applications and JavaScript frameworks. Current fingerprinting methods use manually created fingerprints for each technology it looks for. These fingerprints consist of multiple text strings that are matched against an HTTP response from a website. Creating these fingerprints for each technology can be time-consuming, which limits what technologies fingerprints can be built for. This thesis presents a potential solution by utilizing unsupervised machine learning techniques to cluster websites by their used web application and JavaScript frameworks, without requiring manually created fingerprints. Our solution uses multiple bag-of-words models combined with the dimensionality reduction technique t-SNE and clustering algorithm OPTICS. Results show that some technologies, for example, Drupal, achieve a precision of 0.731 and recall of 0.485 without any training data. These results lead to the conclusion that the proposed solution could plausibly be used to cluster websites by their web application and JavaScript frameworks in use. However, further work is needed to increase the precision and recall of the results. Keywords Clustering, fingerprinting, OPTICS, t-SNE, headless browser, bag-of-words, unsupervised machine learning Sammanfattning Att ta fingeravtryck av en hemsida innebär att identifiera vilka teknologier som en hemsida använder, såsom dess webbapplikationer och JavaScript-ramverk. Nuvarande metoder för att göra fingeravtryckningar av hemsidor använder sig av manuellt skapade fingeravtryck för varje teknologi som de letar efter. Dessa fingeravtryck består av flera textsträngar som matchas mot HTTP-svar från hemsidor. Att skapa fingeravtryck kan vara en tidskrävande process vilket begränsar vilka teknologier som fingeravtryck kan skapas för. Den här rapporten presenterar en potentiell lösning genom att utnyttja oövervakade maskininlärningstekniker för att klustra hemsidor efter vilka webbapplikationer och JavaScript-ramverk som används, utan att manuellt skapa fingeravtryck. Detta uppnås genom att använda flera ordvektormodeller tillsammans med dimensionalitetreducerings-tekniken t-SNE och klustringsalgoritmen OPTICS. Resultatet visar att vissa teknologier, till exempel Drupal, får en precision på 0,731 och en recall på 0,485 utan någon träningsdata. Detta leder till slutsatsen att den föreslagna lösningen möjligtvis kan användas för att klustra hemsidor efter de webbapplikationer och JavaScript-ramverk som används. Men mera arbete behövs för att öka precision och recall av resultaten. Nyckelord Klustring, fingeravtryckning, OPTICS, t-SNE, huvudlös webbläsare, ordvektor, oövervakad maskininlärning Acknowledgment We would like to give a special thanks to Tom Hudson for offering technical counsel during the writing of this thesis, and for providing the basis of the data collection tool that we could continue to develop upon. Table of contents 1 Introduction .......................................................................................................1 1.1 Problem statement ....................................................................................... 1 1.2 Goal of the project ....................................................................................... 1 1.3 Scope of the project and limitations ............................................................. 2 2 Theory and background....................................................................................3 2.1 Fingerprinting ............................................................................................... 3 2.2 Headless browser ........................................................................................ 4 2.3 JavaScript window object ............................................................................. 5 2.4 HTML document .......................................................................................... 5 2.5 Supervised and unsupervised learning ........................................................ 6 2.6 Clustering ..................................................................................................... 6 2.6.1 Partitional clustering algorithms............................................................. 6 2.6.2 Hierarchical clustering algorithms.......................................................... 7 2.6.3 Density-based clustering ....................................................................... 8 2.6.4 Clustering performance evaluation ...................................................... 11 2.7 Dimensionality Reduction and sample size................................................ 12 2.7.1 Sample Size ........................................................................................ 12 2.7.2 Curse of dimensionality ....................................................................... 13 2.7.3 SVD and truncated SVD ...................................................................... 13 2.7.4 t-SNE ................................................................................................... 14 2.8 Feature extraction ...................................................................................... 15 2.8.1 Bag-of-Words ...................................................................................... 15 2.8.2 N-Gram ............................................................................................... 16 2.9 Related Works ........................................................................................... 16 3 Methodology ....................................................................................................19 3.1 Supervised learning or unsupervised learning ........................................... 19 3.2 Data collection ........................................................................................... 19 3.3 Feature extraction ...................................................................................... 22 3.3.1 Dimensionality Reduction .................................................................... 24 3.3.2 Clustering algorithm ............................................................................ 24 3.4 Alternative method ..................................................................................... 25 3.5 Architecture ................................................................................................ 26 3.5.1 Hardware ............................................................................................. 26 3.5.2 Software .............................................................................................. 27 4 Results ............................................................................................................ 29 4.1 Labeling the data for comparison ............................................................... 29 4.2 Results of our method ................................................................................ 29 4.2.1 Dataset ................................................................................................ 30 4.2.2 K-Means clustering .............................................................................. 32 4.2.3 OPTICS clustering ............................................................................... 33 4.3 Observations .............................................................................................. 34 4.3.1 Truncated SVD .................................................................................... 34 4.3.2 Wappalyzer false negative .................................................................. 35 4.3.3 Empty and almost empty data from sites............................................. 36 4.4 Runtime...................................................................................................... 38 4.5 Evaluation results ....................................................................................... 39 4.5.1 Wordpress ........................................................................................... 40 4.5.2 jQuery .................................................................................................. 42 4.5.3 Drupal .................................................................................................. 44 4.5.4 ASP.NET ............................................................................................. 46 4.5.5 AddThis ............................................................................................... 48 5 Discussion ...................................................................................................... 51 5.1 Clustering results ......................................................................................