JAISCR, 2020, Vol. 10, No. 4, pp. 243 – 253 10.2478/jaiscr-2020-0016

BROWSER FINGERPRINT CODING METHODS INCREASING THE EFFECTIVENESS OF USER IDENTIFICATION IN THE WEB TRAFFIC

1, 2 3 Marcin Gabryel ∗, Konrad Grzanek , Yoichi Hayashi

1Department of Computer Engineering, Czestochowa University of Technology, al. Armii Krajowej 36, 42-200 Cze¸stochowa, Poland

2Information Technology Institute, University of Social Sciences, 90 - 113 Lodz Clark University, Worcester, MA 01610, USA

3Department of Computer Science, Meiji University, Japan

∗E-mail: [email protected]

Submitted: 14th October 2019; Accepted: 29th April 2020

Abstract

Web-based browser fingerprint (or device fingerprint) is a tool used to identify and track user activity in web traffic. It is also used to identify computers that are abusing online advertising and also to prevent credit card fraud. A device fingerprint is created by extract- ing multiple parameter values from a browser API (e.g. type or browser version). The acquired parameter values are then used to create a hash using the hash func- tion. The disadvantage of using this method is too high susceptibility to small, normally occurring changes (e.g. when changing the browser version number or screen resolution). Minor changes in the input values generate a completely different fingerprint hash, mak- ing it impossible to find similar ones in the database. On the other hand, omitting these unstable values when creating a hash, significantly limits the ability of the fingerprint to distinguish between devices. This weak point is commonly exploited by fraudsters who knowingly evade this form of protection by deliberately changing the value of device pa- rameters. The paper presents methods that significantly limit this type of activity. New algorithms for coding and comparing fingerprints are presented, in which the values of parameters with low stability and low entropy are especially taken into account. The fin- gerprint generation methods are based on popular Minhash, the LSH, and autoencoder methods. The effectiveness of coding and comparing each of the presented methods was also examined in comparison with the currently used hash generation method. Authentic data of the devices and browsers of users visiting 186 different websites were collected for the research. Keywords: browser fingerprint, device fingerprint, LSH algorithm, autoencoder

1 Introduction By collecting information about the user, the pages they visit, or their planned purchases, it is possi- The process of identifying the users and their ble to develop a useful profile, for example, when behavior on the Internet is rather commonplace. 244 Marcin Gabryel, Konrad Grzanek, Yoichi Hayashi communicating advertising content to them. The once the browser version is changed, the same user mechanism for user identification is that of storing will be recognized as two different persons. a unique identifier in a cookie on their device [1]. The aim of this paper is to propose new meth- This type of method is widely criticized for the pos- ods of encoding fingerprint browser features and sibility of privacy violation. This problem has been a method of comparing them, taking into account noticed by the European Parliament and for several changes in the values of these features. The paper years now, information about the use of cookies has presents a research project showing the dynamics of had to be placed on websites [2]. However, despite changes taking place in the fingerprint during sub- the obvious need for privacy, unambiguous identi- sequent visits of a given user to the website. The fication of the device or the browser is extremely features will be divided into a group of stable pa- helpful in ensuring security and preventing abusive rameters and a group of unstable ones. This knowl- practices on the Internet. Numerous fraudulent ac- edge will be used to develop two new methods of tivities include credit card payment scams [3], ar- comparing fingerprints: tificial generation of Internet traffic, the so-called abusive traffic [4], generating click fraud [5, 6] or – a new method of hash generation using the Min- automating the collection of web site contents [7]. Hash technique compatible with the LSH algo- Based on various reports [8, 9], losses due to fraud rithm, in online advertising alone can range from 6.5 to 19 billion dollars. Fraud prevention is primarily about – a dedicated neural network with an auto-encoder identifying the perpetrator and blocking their ac- structure to encode the abovementioned two tions. Unfortunately, the HTTP protocol used on groups of fingerprint features. the Internet only allows the user to be identified by The encoding effectiveness of both methods the imperfect cookie mechanism. It is very easy will be experimentally tested using the data ob- to get around it by removing or ignoring cookies. tained over a period of 3 months from 186 different Other methods, such as trying to eliminate fraud by websites. blocking IP numbers of devices, do not give the ex- pected results. One public IP number can be used This paper is organized as follows. Section by multiple users simultaneously. An IP number 2 briefly discusses the related work highlighting can also be assigned dynamically. Connections via the proposed methods: issues connected with the proxy servers or VPN are also common. Hence, browser fingerprint, its practical applications and the best solution is to identify the user with the so- the possibilities offered by neural networks with the called browser fingerprint [10]. autoencoder structure for data encoding. Then, in Section 3, the definition of the browser fingerprint A browser fingerprint is a set of information is given, the characteristics of the features selected about a given browser, device, operating system and for its creation are assessed and their changes over environmental and location settings of the user [11]. time are analyzed. Section 4 discusses the proposed Due to a great variety of this information, finger- algorithms for comparing browser fingerprints. The print can be treated as a unique identifier. Unfortu- research tests showing their effectiveness are pre- nately, the values of the parameters collected may sented in Section 5. Section 6 concludes the paper change over time. This happens when updating the and offers suggestions for future work. browser or operating system version, resizing the screen, installing a new plugin, etc. Once a finger- print has been obtained, on average, it remains valid 2 Related works for several days [12]. That is why developing a way of comparing fingerprints taking into account the In paper [13], it was noticed for the first time changes occurring in them poses a rather demand- that there is a high likelihood of identifying the user ing challenge. Using hash functions to encode and through using various parameters extracted from search for a fingerprint is no longer relevant, as hash the browser being used. However, the scale of the functions generate completely different hash values research at that time remained rather limited. Only for minor input changes, which is particularly un- in the next study [10], where the research was con- desirable in the case of fingerprinting. For example, ducted on a larger number of users, was it shown Marcin Gabryel, Konrad Grzanek, Yoichi Hayashi BROWSER FINGERPRINT CODING METHODS . . . 245 communicating advertising content to them. The once the browser version is changed, the same user that fingerprinting can become a unique identifier. Methods of detecting fraud on the Internet with mechanism for user identification is that of storing will be recognized as two different persons. Works on browser or device fingerprinting usually the use of deep learning neural networks are also a unique identifier in a cookie on their device [1]. The aim of this paper is to propose new meth- contain comprehensive information on specific fea- one of the subjects of many academic papers. In This type of method is widely criticized for the pos- ods of encoding fingerprint browser features and tures comprising the fingerprint [3, 10, 11, 12, 14]. paper [21], the authors present a method of detect- sibility of privacy violation. This problem has been a method of comparing them, taking into account The studies also focus on the analysis of stability ing credit card fraud. The main contribution of their noticed by the European Parliament and for several changes in the values of these features. The paper of particular parameters [3, 12], examining differ- work is the development of a fraud detection sys- years now, information about the use of cookies has presents a research project showing the dynamics of ent types of browsers [11, 12] or applied security tem that employs a deep learning architecture to- had to be placed on websites [2]. However, despite changes taking place in the fingerprint during sub- features limiting the possibility of obtaining finger- gether with an advanced feature engineering pro- the obvious need for privacy, unambiguous identi- sequent visits of a given user to the website. The print features [12]. There are also analyses of creat- cess based on homogeneity-oriented behavior anal- fication of the device or the browser is extremely features will be divided into a group of stable pa- ing unique fingerprints over the years [15]. Despite ysis. In [22], an ensemble neural network adapted helpful in ensuring security and preventing abusive rameters and a group of unstable ones. This knowl- many recent changes in browsers, browser finger- as a hacking detection system to protect the com- practices on the Internet. Numerous fraudulent ac- edge will be used to develop two new methods of printing is still effective in user identification. The puter system against cyber-attacks is presented. The tivities include credit card payment scams [3], ar- comparing fingerprints: possibilities of creating cross-browser fingerprints presented ensemble neural network consists of an tificial generation of Internet traffic, the so-called when one user uses several browsers on one device autoencoder, a deep belief neural network, a deep abusive traffic [4], generating click fraud [5, 6] or – a new method of hash generation using the Min- are also investigated [16]. neural network and an extreme learning machine. automating the collection of web site contents [7]. Hash technique compatible with the LSH algo- The subject of practical use of browser finger- The system’s task is to monitor the activity within Based on various reports [8, 9], losses due to fraud rithm, printing appears in the literature in several con- a network of connected computers so as to analyze in online advertising alone can range from 6.5 to 19 the activity of intrusive patterns. In [23], an attempt – a dedicated neural network with an auto-encoder texts including [10, 14], bot and fraud billion dollars. Fraud prevention is primarily about prevention [7] and augmented authentication [3]. was made to detect fraud in biometric systems. To identifying the perpetrator and blocking their ac- structure to encode the abovementioned two detect this type of fraud, the authors propose a novel groups of fingerprint features. companies commonly use this tions. Unfortunately, the HTTP protocol used on technique to detect bots and unusual activity on method for fingerprint spoofing detection using the the Internet only allows the user to be identified by Deep Boltzmann Machines (DBM) for the extrac- The encoding effectiveness of both methods websites [17, 18]. In [7], it is shown that fin- the imperfect cookie mechanism. It is very easy tion of high-level features from images. will be experimentally tested using the data ob- gerprinting is a good method of detecting crawler to get around it by removing or ignoring cookies. tained over a period of 3 months from 186 different robots, but at the same time it can be bypassed with A special kind of neural networks are autoen- Other methods, such as trying to eliminate fraud by websites. little effort. The paper [19] presents the capabilities coders. They make it possible to use their deepest blocking IP numbers of devices, do not give the ex- offered by browser fingerprinting that can be used layer to encode input data. An example here is the This paper is organized as follows. Section pected results. One public IP number can be used in order to verify the software and hardware stack so-called semantic hashing published in [24], where 2 briefly discusses the related work highlighting by multiple users simultaneously. An IP number of a mobile or desktop . The presented system this technique of efficient information retrieval is the proposed methods: issues connected with the can also be assigned dynamically. Connections via can, for example, distinguish between traffic sent by presented. A document is fed to the input of a neu- browser fingerprint, its practical applications and proxy servers or VPN are also common. Hence, an original smartphone running an original browser ral network which generates a small binary vector. the possibilities offered by neural networks with the the best solution is to identify the user with the so- from an emulator or desktop client deceptively sim- Two similar documents will have two identical or autoencoder structure for data encoding. Then, in called browser fingerprint [10]. ulating the same configuration. very similar hashes. By indexing a given document Section 3, the definition of the browser fingerprint A browser fingerprint is a set of information with this hash, it is possible to find other similar is given, the characteristics of the features selected A number of publications address issues related about a given browser, device, operating system and documents almost immediately – one only should for its creation are assessed and their changes over to the abuse of online advertising and e-commerce. environmental and location settings of the user [11]. calculate the hash of a given document and search time are analyzed. Section 4 discusses the proposed They mainly concern the problem of click frauds Due to a great variety of this information, finger- for documents containing the same hash (or hashes algorithms for comparing browser fingerprints. The and credit card payments. Paper [4] proposes two print can be treated as a unique identifier. Unfortu- that differ from each other by from one to two bits). research tests showing their effectiveness are pre- novel inference techniques which can isolate click nately, the values of the parameters collected may sented in Section 5. Section 6 concludes the paper fraud attacks. One of them detects patterns of click change over time. This happens when updating the and offers suggestions for future work. reuse within an ad network clickstream and the sec- 3 Generating and analyzing the browser or operating system version, resizing the ond method, the bait-click defense, leverages the screen, installing a new plugin, etc. Once a finger- vantage point of an ad network to inject a pattern browser fingerprint parameters print has been obtained, on average, it remains valid 2 Related works of bait clicks into a user’s device. Further on, the 3.1 Definition for several days [12]. That is why developing a way authors in [20] deal with the problem of detecting of comparing fingerprints taking into account the In paper [13], it was noticed for the first time Internet merchant fraud. Goods or services offered A browser or device fingerprint is a set of data changes occurring in them poses a rather demand- that there is a high likelihood of identifying the user and sold at cheap rates, but never shipped is a sim- related to the user device. It contains information ing challenge. Using hash functions to encode and through using various parameters extracted from ple example of this type of fraud. The authors sug- about the hardware, operating system and browser, search for a fingerprint is no longer relevant, as hash the browser being used. However, the scale of the gest a framework to detect such fraudulent sellers and its configuration [11]. The information is col- functions generate completely different hash values research at that time remained rather limited. Only with the help of the support vector machine ap- lected only directly from the browser of a user by for minor input changes, which is particularly un- in the next study [10], where the research was con- proach. Javascript and by a web server. The user remains desirable in the case of fingerprinting. For example, ducted on a larger number of users, was it shown 246 Marcin Gabryel, Konrad Grzanek, Yoichi Hayashi unaware that they are being identified, as the use of For b = 2 it is the Shannon entropy and the re- browser fingerprinting leaves no trace. sult is in bits. The fingerprint’s features, along For a quick search and comparison, the col- with the calculated entropy, are presented in Ta- lected data set is given onto the input of a hash algo- ble 1. The data came from 131,326 users rithm. The hash is an alphanumeric string of fixed who made 365,209 visits to different websites length characters which becomes a unique identi- over a period of three months. The table does fier of a given browser [25]. This kind of finger- not include parameters with the entropy below print does not work in the case of mass-produced 0.1. Some parameters, such as screen_id and devices with limited configuration and upgrade pos- User-agent , were broken down into individual sibilities, such as smartphones. In the case of one elements. For screen_id it is width , height , model of the device, the collected data is identical, available_width and available_height . In the generating the same fingerprints. Another problem case of User-agent the whole sequence was di- is the instability of some features. For example, the vided into elements starting with prefix ua_ . software or operating system versions are regularly updated and then generate new fingerprints, too.

3.2 Parameters extracted from the browser As mentioned in Subsection 3.1, the data nec- essary for the browser fingerprint are extracted di- rectly from the browser. To cunduct the research under this paper the following features were se- lected:

– the features of the device (including device memory, color depth, logic cores, touch support, Figure 1. Changes in the browser fingerprints screen parameters, audio parameters) recorded daily in repeat visitors. – operating system parameters (i.e., its version, The same dataset was used to investigate the list of fonts, time zone) changes in the values of browser fingerprints. Fig- ure 3.2 presents a graph showing the number of – features of the browser (including its version, list times that the website was accessed again over the of plug-ins, language list, User-agent header, ad- following days. The graph also shows the number block information, database information, Web of users with a change in the value of at least one Storage mechanisms, screen resolution avail- fingerprint feature. The graph shows that on the sec- able, window resolution available, Do Not Track ond day about 12,000 users returned to the website, header) and among them as many as 2,000 showed a change – graphics card information (canvas fingerprint, in at least one feature. The last day of the research WebGL renderer) recorded a return of approximately 1,000 users who had accessed the website on the first day of the To calculate the level of identifying information project. All the repeat visitors displayed changes in each of the fingerprint features mentioned above, in at least one fingerprint feature. Figure 3.2 shows the measure of entropy is used. The higher the en- the percentage of visitors returning to the website tropy is, the more unique and identifiable a finger- over the following days. It also shows the percent- print. Let H be the entropy, X - a discrete random age of users visiting the website on that day when variable with possible values x1,...,xn and P(X) – each of them had a change in at least one feature. It a probability mass function. The entropy follows can be seen on day 20 that 56% of the repeat visi- this formula tors had already had a change of at least one feature. After 40 days, the changes had occurred in 80% of H(X)= P(x )log P(x ). (1) ∑ i b i the investigated users. Table 1 shows for each fea- − i Marcin Gabryel, Konrad Grzanek, Yoichi Hayashi BROWSER FINGERPRINT CODING METHODS . . . 247 unaware that they are being identified, as the use of For b = 2 it is the Shannon entropy and the re- ture the percentage of the users who had shown the 4 Browser fingerprint encoding browser fingerprinting leaves no trace. sult is in bits. The fingerprint’s features, along changes. This means that the data can be divided methods For a quick search and comparison, the col- with the calculated entropy, are presented in Ta- into a stable data group and an unstable data group. ble 1. The data came from 131,326 users 4.1 Hashing lected data set is given onto the input of a hash algo- Table 1. Obtainable device fingerprint features rithm. The hash is an alphanumeric string of fixed who made 365,209 visits to different websites over a period of three months. The table does Hashing refers to the process of generating a length characters which becomes a unique identi- fixed-size output from an input of a variable size. fier of a given browser [25]. This kind of finger- not include parameters with the entropy below Feature No. of No. of Group 0.1. Some parameters, such as screen_id and bits of changes This is done through the use of mathematical for- print does not work in the case of mass-produced entropy in % mulas known as hash functions (implemented as devices with limited configuration and upgrade pos- User-agent , were broken down into individual device_memory 1.66 0.0 1 elements. For screen_id it is width , height , do_not_track_val_id 0.21 0.1 1 hashing algorithms). A conventional hash function sibilities, such as smartphones. In the case of one fonts 1.36 0.4 1 available_width and available_height . In the should have collision resistance. This is a feature model of the device, the collected data is identical, audio_params_id 0.50 1.2 2 that prevents the same hash value from being ob- generating the same fingerprints. Another problem case of User-agent the whole sequence was di- webgl_vendor_id 1.66 0.0 1 vided into elements starting with prefix ua_ . webgl_renderer_id 5.35 0.1 1 tained from two different input sets. These two is the instability of some features. For example, the logic_cores 1.18 0.0 1 properties of the hash function, a fixed hash length software or operating system versions are regularly platform 1.48 0.0 1 timezone 0.15 0.0 1 and collision resistance, enable a quick search of updated and then generate new fingerprints, too. app_version 10.97 64.6 2 large data sets. Instead of comparing dozens of pa- touch_enabled 0.68 0.0 1 max_touch_points 0.84 0.0 1 rameters for identical values, only hash values are 3.2 Parameters extracted from the screen_id 4.78 1.7 - compared with each other. The introduction out- width 2.60 1.2 2 browser height 4.32 1.1 2 lines the disadvantages of using this solution for As mentioned in Subsection 3.1, the data nec- av_width 2.63 1.4 2 searching and comparing browser fingerprints. In av_height 4.58 1.4 2 the following Subsections, new methods for obtain- essary for the browser fingerprint are extracted di- adblock_enabled 0.26 0.6 1 rectly from the browser. To cunduct the research canvas_2d_fingerprint 5.62 12.9 2 ing hashes and their use in searching for similar browser_plugins_hash 0.88 0.0 1 browser fingerprints are proposed. under this paper the following features were se- user-agent 10.98 64.7 - lected: br_version 4.26 62.9 2 os_version 2.8 4.9 2 4.2 The LSH algorithm app_version 10.97 64.4 2 – the features of the device (including device platform 1.48 0.0 1 The main task of the Locality-Sensitive Hash- memory, color depth, logic cores, touch support, Figure 1. Changes in the browser fingerprints ua_device_brand_name 2.56 0.0 1 ua_device_model 6.51 0.0 1 ing (LSH) algorithm is to quickly compare docu- screen parameters, audio parameters) recorded daily in repeat visitors. ua_client_name 1.87 0.0 1 ments in terms of their contents. The documents do ua_client_version 4.76 63.3 2 – operating system parameters (i.e., its version, The same dataset was used to investigate the ua_client_type 0.39 0.0 1 not need to be identical, as it is in the case of the list of fonts, time zone) changes in the values of browser fingerprints. Fig- ua_device_type 1.08 0.0 1 hash function, because the proposed method is not ua_device_brand 2.56 0.0 1 ure 3.2 presents a graph showing the number of ua_device_code 6.64 0.0 1 resistant to minor changes in the document. The – features of the browser (including its version, list times that the website was accessed again over the ua_os_name 0.78 0.0 1 LSH algorithm consists of three steps: ua_os_version 2.90 5.1 2 of plug-ins, language list, User-agent header, ad- following days. The graph also shows the number block information, database information, Web ua_preferred_client_name 2.32 0.2 1 of users with a change in the value of at least one ua_preferred_client_version 5.05 59.3 2 – transforming the document into a set of char- Storage mechanisms, screen resolution avail- fingerprint feature. The graph shows that on the sec- ua_preferred_client_type 0.31 0.2 1 acters of length k (the shingling method, also able, window resolution available, Do Not Track ond day about 12,000 users returned to the website, known as k-shingles or k-grams method), header) and among them as many as 2,000 showed a change – compressing the shingles set using the "Min- in at least one feature. The last day of the research – graphics card information (canvas fingerprint, Hashing" method, so that the similarity of the recorded a return of approximately 1,000 users who WebGL renderer) base sets of documents in their compressed ver- had accessed the website on the first day of the sions can still be checked, To calculate the level of identifying information project. All the repeat visitors displayed changes in each of the fingerprint features mentioned above, in at least one fingerprint feature. Figure 3.2 shows – the LSH algorithm, which allows us to find the the measure of entropy is used. The higher the en- the percentage of visitors returning to the website most similar pairs of documents or all pairs that tropy is, the more unique and identifiable a finger- over the following days. It also shows the percent- are above some lower bound in similarity. print. Let H be the entropy, X - a discrete random age of users visiting the website on that day when variable with possible values x1,...,xn and P(X) – each of them had a change in at least one feature. It Shingling is an effective method of representing a probability mass function. The entropy follows can be seen on day 20 that 56% of the repeat visi- a document as a set. To generate the set, we need to this formula tors had already had a change of at least one feature. select short phrases or sentences from the document After 40 days, the changes had occurred in 80% of – the so-called shingles. This causes documents to H(X)= P(x )log P(x ). (1) Figure 2. Percentage of changes in the browser ∑ i b i the investigated users. Table 1 shows for each fea- − i fingerprints recorded daily in repeat visitors. have many common elements in their sets even if 248 Marcin Gabryel, Konrad Grzanek, Yoichi Hayashi the sentences appear in documents in a different or- bucket under at least one of the hash functions, are der. false negatives. A detailed description of particular The next step consists in creating the so-called parts of the LSH algorithm can be found in many characteristic , where the columns contain works including [26] and [27]. sets of shingles of individual documents, and the In the proposed algorithm the i-th fingerprint consecutive lines correspond to individual shingles. parameters fi need to be divided into the stable In the matrix cells at the intersection of row i and fsi and unstable ones fni (see Section 3.2) and column j, there is value 1 in the case of the i-th f = fs, fn . A MinHash algorithm starts operat- i { i i} shingle in the j-th document. ing for each set of parameters, which will generate In the MinHash algorithm a so-called SIG sig- two signature matrixes SIGsk, j and SIGnk, j. The nature matrix is created with the dimensions m n, algorithm starts the search process twice, i.e. for × where each of the m documents corresponds to n the stable and unstable parameters. The following signatures. The matrix is calculated by performing steps of the algorithm are presented in Algorithm 2. random and independent n permutations of m rows Its operation results in returning a set of fingerprints of the characteristic matrix. The MinHash value for fqn similar to fingerprint fq. the column of the j-th document is the number of The LSH algorithm for searching for the first row (in the order resulting from the permu- Algorithm 2 similar browser fingerprints in relation to parameter tations), for which this column has value 1. These stability. calculations are time-consuming, therefore instead of selecting random n row permutations, random n Initial procedure: hash functions h ,h ,...h are selected. The signa- 1 2 n 1. Prepare two groups of parameters: stable ones ture matrix is built taking into account each row in fs = fs1,..., fsm and unstable ones fn = the given order. Let SIG , be an element of the sig- { } k j f ,..., f , where m – number of finger- nature matrix for the k-th hash function and column { n1 nm} prints. j of document d j. The next steps in generating the signatures matrix are shown in Algorithm 1. 2. According to Algorithm 1 determine two sig-

nature matrices SIGsk, js and SIGnk, jn (for sta- Algorithm 1 Algorithm for generating the signa- ble and unstable parameters, respectively) for tures matrix. two sets ds and dn, k = 1,...,m, js = 1,...,ns, jn = 1,...,nn, ns, nn – number of signatures for 1. Initially, set SIGk, j to ∞ for all values of k and j. stable and unstable parameters.

2. For each i-th row of the characteristic matrix re- 3. Determine the similarity thresholds ts and tn for peat points 2 and 3. stable and unstable parameters, respectively.

3. Calculate h1( j),h2( j),...,hn( j). To find fingerprints fqn similar to fingerprint fq the following steps need to be carried out: 4. For each column j, check if there is 1 in row i. If yes, then for each k = 1,2,...,n, set SIGk, j = 1. Start the LSH algorithm using the stable param- min(SIGk, j,hk( j)). eters fs to find similar candidate pairs fqs

The idea of the LSH algorithm allows check- fqs = LSH(SIGs( fq),SIGs( fs),ts), ing the similarity of two elements. As a result of its operation, information is returned whether the where: SIGs( fq), SIGs( fs) – signature matrix pair forms a so-called "candidate pair", i.e. whether values for stable parameters obtained for the pa- their similarity is greater than a specified threshold rameters of fingerprints fq and fs. t (similarity threshold). Any pair that hashed to the 2. Having found fingerprints fqs do one more same bucket is considered as a “candidate pair”. search for a similar fingerprint, but this time us- Dissimilar pairs that do hash to the same bucket ing unstable parameters fn are false positives. On the other hand, those pairs, which despite being similar, do not hash to the same fqn = LSH(SIGn( fq),SIGn( fqs),tn), Marcin Gabryel, Konrad Grzanek, Yoichi Hayashi BROWSER FINGERPRINT CODING METHODS . . . 249

the sentences appear in documents in a different or- bucket under at least one of the hash functions, are where: SIGn( fq), SIGn( fqs) – signature matrix whole training set der. false negatives. A detailed description of particular values for unstable parameters obtained for the ˙ parts of the LSH algorithm can be found in many parameters of fingerprints fq and fn. θ,θ = argmaxL(x,y). (6) The next step consists in creating the so-called θ,θ˙ characteristic matrix, where the columns contain works including [26] and [27]. 3. Return obtained similar fingerprints fqn. sets of shingles of individual documents, and the In the proposed algorithm the i-th fingerprint The autoencoder presented above is discussed consecutive lines correspond to individual shingles. parameters fi need to be divided into the stable in relation to continuous data x. In the case of cate- In the matrix cells at the intersection of row i and fsi and unstable ones fni (see Section 3.2) and 4.3 Deep learning methods – autoencoders gorical data, one-hot encoded data are given into the column j, there is value 1 in the case of the i-th fi = fsi, fni . A MinHash algorithm starts operat- input. In the case of one variable x of the categori- { } An autoencoder is a neural network with at shingle in the j-th document. ing for each set of parameters, which will generate least one hidden layer. The input and output have cal type, input dimension m is equal to the number SIGs SIGn In the MinHash algorithm a so-called SIG sig- two signature matrixes k, j and k, j. The the same size. The autoencoder is trained in such of categories. Each category is numbered with con- nature matrix is created with the dimensions m n, algorithm starts the search process twice, i.e. for a way that the values given onto its input are to secutive natural numbers. Single vector xi is filled × where each of the m documents corresponds to n the stable and unstable parameters. The following be copied onto its output. The encoder aims to with zeros and contains a single value 1 in place j, signatures. The matrix is calculated by performing steps of the algorithm are presented in Algorithm 2. compress data to a low-dimensional representation, where j is the category number. For this type of random and independent n permutations of m rows Its operation results in returning a set of fingerprints while the decoder aims to reconstruct the input data data function s( ) in formula (3) will take the value f f · of the characteristic matrix. The MinHash value for qn similar to fingerprint q. from the low-dimensional representation generated of softmax [29], where for each of the outputs k the column of the j-th document is the number of by the encoder [28]. t Algorithm 2 The LSH algorithm for searching for e k the first row (in the order resulting from the permu- m s(t)k = m t , (7) similar browser fingerprints in relation to parameter The learning set is X = x1,x2,...,xN R ∑ e j tations), for which this column has value 1. These { }∈ j=1 stability. where xi is the m-dimensional feature vector, N – calculations are time-consuming, therefore instead where t Rm, and t is j-th element of vector t. For Initial procedure: the number of samples. The encoder maps input j of selecting random n row permutations, random n n ∈ vector xi onto the hidden representation hi R us- the softmax function, the loss function L(xi,yi) is hash functions h1,h2,...hn are selected. The signa- ∈ 1. Prepare two groups of parameters: stable ones ing function fθ as in calculated by using the categorical cross-entropy ture matrix is built taking into account each row in fs = fs1,..., fsm and unstable ones fn = m the given order. Let SIGk, j be an element of the sig- { } hi = fθ(xi)=s(Wxi + b), (2) fn1,..., fnm , where m – number of finger- L(xi,yi)= ∑ xijlog(yij), (8) nature matrix for the k-th hash function and column { } − = prints. m n j 1 j of document d . The next steps in generating the where W R × is a set of weights, n is the num- j ∈ signatures matrix are shown in Algorithm 1. ber of units in the hidden layer h, b Rn is a bias where m is the number of outputs (number of cate- 2. According to Algorithm 1 determine two sig- ∈ nature matrices SIGs and SIGn (for sta- vector, θ is set W,b , and s( ) is the adopted ac- gories). k, js k, jn { } · Algorithm 1 Algorithm for generating the signa- ble and unstable parameters, respectively) for tivation function (sigmoid function) determined by For encoding stable and unstable fingerprint tures matrix. the following formula two sets ds and dn, k = 1,...,m, js = 1,...,ns, features, there are two hashes required. The au- jn = 1,...,nn, ns, nn – number of signatures for 1 toencoder will, therefore, consist of two pairs of ( )= . 1. Initially, set SIGk, j to ∞ for all values of k and j. stable and unstable parameters. s t t (3) encoder-decoders for stable and unstable parame- 1 + exp− ters, respectively. The autoencoder will thus have 2. For each i-th row of the characteristic matrix re- 3. Determine the similarity thresholds ts and tn for The decoder maps back values hi obtained in the stable and unstable parameters, respectively. two hidden layers, h1 and h2, whose outputs will re- peat points 2 and 3. hidden layer onto the output vector y Rm accord- i ∈ turn the values of the fingerprint hashes given onto ing to the following formula 3. Calculate h1( j),h2( j),...,hn( j). To find fingerprints fqn similar to fingerprint fq the the network input. The diagram of such an autoen- coder is shown in Figure 3. following steps need to be carried out: = ( )= ( ˙ + ˙), 4. For each column j, check if there is 1 in row i. yi gθ˙ xi s Whi b (4) If yes, then for each k = 1,2,...,n, set SIGk, j = 1. Start the LSH algorithm using the stable param- n m m where W˙ R × are weights, b˙ R is a bias vec- (SIG ,h ( j)) ∈ ∈ min k, j k . eters fs to find similar candidate pairs fqs and θ˙ = W˙ ,b˙ . { } The idea of the LSH algorithm allows check- fqs = LSH(SIGs( fq),SIGs( fs),ts), The learning of the autoencoder consists in min- ing the similarity of two elements. As a result of imizing the difference between input xi and output its operation, information is returned whether the where: SIGs( fq), SIGs( fs) – signature matrix yi. To this end a loss function is calculated as shown values for stable parameters obtained for the pa- pair forms a so-called "candidate pair", i.e. whether in the following formula their similarity is greater than a specified threshold rameters of fingerprints fq and fs. 2 2 t (similarity threshold). Any pair that hashed to the L(xi,yi)= xi yi = xi s(Ws˙ (Wxi +b)+b˙) . 2. Having found fingerprints fqs do one more ∥ − ∥ ∥ − ∥ same bucket is considered as a “candidate pair”. (5) search for a similar fingerprint, but this time us- Figure 3. Dedicated structure of autoencoders. Dissimilar pairs that do hash to the same bucket The aim of the learning is thus finding optimum val- ing unstable parameters fn are false positives. On the other hand, those pairs, ues of parameters θ and θ˙ facilitating the minimiz- There is a training set x x ,x Rm, where i ∈ 1 j 2k ∈ which despite being similar, do not hash to the same f = LSH(SIGn( f ),SIGn( f ),t ), ing of the error between the input and output for the x Rm1 , x Rm2 – stable and unstable input qn q qs n 1 j ∈ 2k ∈ 250 Marcin Gabryel, Konrad Grzanek, Yoichi Hayashi

data, m1 – the number of stable data, m2 – the num- times in one user within the three months were se- ber of unstable data, m = m1 + m2. The values of lected. The number of the data filtered in this way the hidden layers will then be amounted to ca. 213,000. About 33,000 unique users were identified. The data containing finger-

h1 j = fθ1 (x1 j)=s(W1xsj+ b1), (9) prints were then divided in the 80/20 ratio into train- ing and test groups. Most of the data were used to and initiate the MinHash algorithm or to learn the neural h2 j = fθ2 (x1 j)=s(W2xnk + b2), (10) network. The fingerprints from the test group were where: W Rm1 n1 , W Rm2 n2 – weights of the used to check the effectiveness of the search. 1 ∈ × 2 ∈ × hidden layers, n1, n2 is the number of units in the Precision and recall were used to evaluate the n n hidden layer h1 j and h2k, b1 R 1 , b2 R 2 – bias effectiveness of the tests [30]. Precision is the ratio ∈ ∈ vectors, θ1 = W1,b1 and θ2 = W2,b2 . The out- of the number of correctly classified data to the total { } m { } put value yi y1 j,y2k R of the autoencoder number of irrelevant and relevant data classified ∈{ }∈ will require certain calculations tp precision = , (14) tp+ fp y = g (x )=s(W˙ h + b˙ ) 1 j θ˙1 1 j 1 1 j 1 (11) and recall is the ratio between the number of data and that are correctly classified to the total number of y = g (x )=s(W˙ h + b˙ ). 2k θ˙2 2k 2 2 j 2 (12) positive data n m n m where: W˙ 1 R 1× 1 , W˙ 2 R 2× 2 are the weights, tp m ∈ m ∈ recall = , (15) b˙1 R 1 , b˙2 R 2 are the bias vectors and θ˙1 = + ∈ ∈ tp fn W˙ ,b˙ and θ˙ = W˙ ,b˙ . For this autoencoder { 1 1} 2 { 2 2} the loss function will be expressed by the following where tp – true positive, fp– false positive, fn – formula false negative and they can be derived from a con- fusion matrix [30]. The parameter which combines L(x1i,y1i)+L(x2i,y2i) the above two parameters is F1 score that it is the L(xi,yi)= 2 harmonic mean of precision and recall 2 2 x1i y1i + x2i y2i = ∥ − ∥ ∥ − ∥ 2 precision recall 2 F1 score = · · . (16) x s(W˙ h + b˙ ) 2 + x s(W˙ h + b˙ ) 2 precision + recall = ∥ 1i − 1 1 j 1 ∥ ∥ 2i − 2 2 j 2 ∥ . 2 The first of the conducted experiments consists in (13) testing the effectiveness of Algorithm 2. The ini- tiating part of this algorithm requires that the data After the learning process has been completed, only that constitute the browser fingerprint are divided encoders are used for further research. They are into two groups, i.e. stable and unstable (according treated as hash functions. The inputs of the en- to the results obtained in Section 3.2). The initial coders are given fingerprint feature values (as one- values of the algorithm parameters must then be de- hot encode data). The output values obtained by the termined: the number of signatures for encoding ns, two layers h1 and h2 are rounded to integers. nn and threshold values ts, tn. The selection of these values requires a number of tests with different 5 Experiments combinations of these parameters. The following values have been adopted for each parameter of the For this research on browser fingerprints cod- algorithm: n , n 2,4,8,16,24,32,64,96,128 , s n ∈{ } ing methods, authentic data were collected during t = 1, t 1,0.9,0.8,0.7,0.6,0.5 . The best val- s n ∈{ } visits to 186 different types of websites including ues were obtained for ns = 128. Table 2 presents online shops, companies offering various services a summary of the obtained results. The following and financial institutions, including banks. For 3 rows show the results for different tn values, while months, the total number of registered clicks totaled the columns show the results for different nn values. ca. 45 million. For the sake of the research, the data The best was a set of parameters ns = 128, nn = 4, for which the fingerprints parameters changed5 to 9 ts = 1 and tn = 0.5 lub tn = 0.6. Here the value Marcin Gabryel, Konrad Grzanek, Yoichi Hayashi BROWSER FINGERPRINT CODING METHODS . . . 251

data, m1 – the number of stable data, m2 – the num- times in one user within the three months were se- F1 = 0.35 for precision = 0.34 and recall = 0.36 The best results of the two coding and finger- ber of unstable data, m = m1 + m2. The values of lected. The number of the data filtered in this way was obtained. print benchmarking methods presented in this pa- the hidden layers will then be amounted to ca. 213,000. About 33,000 unique per are presented in Table 4. The results are com- users were identified. The data containing finger- Table 2. The F1 score obtained for ns = 128 and pared with the commonly used hashing method (see ts = 1. h1 j = fθ1 (x1 j)=s(W1xsj+ b1), (9) prints were then divided in the 80/20 ratio into train- Section 4.1). Two experiments using hash function ing and test groups. Most of the data were used to SHA1 were conducted. In the first case, stable and and initiate the MinHash algorithm or to learn the neural nn unstable features were given onto the hash function tn 2 4 8 16 24 h2 j = fθ2 (x1 j)=s(W2xnk + b2), (10) network. The fingerprints from the test group were input. In the second case only stable features were 1 0.298 0.297 0.281 0.286 0.287 m1 n1 m2 n2 used to check the effectiveness of the search. 0.9 0.298 0.297 0.281 0.285 0.279 given. The new methods proposed in the paper give where: W1 R × , W2 R × – weights of the ∈ ∈ 0.8 0.298 0.297 0.282 0.278 0.276 much better results than commonly used hashing hidden layers, n1, n2 is the number of units in the Precision and recall were used to evaluate the 0.7 0.298 0.297 0.289 0.284 0.286 n1 n2 methods. hidden layer h1 j and h2k, b1 R , b2 R – bias effectiveness of the tests [30]. Precision is the ratio 0.6 0.298 0.350 0.317 0.293 0.300 ∈ ∈ 0.5 0.321 0.350 0.345 0.327 0.334 vectors, θ1 = W1,b1 and θ2 = W2,b2 . The out- of the number of correctly classified data to the total { } m { } tn 32 48 64 96 128 Table 4. Comparison of the best results obtained. put value yi y1 j,y2k R of the autoencoder number of irrelevant and relevant data classified ∈{ }∈ 1 0.289 0.290 0.290 0.290 0.290 will require certain calculations tp 0.9 0.283 0.282 0.284 0.283 0.283 Method The best F1 score precision = , (14) 0.8 0.276 0.275 0.275 0.274 0.274 hashing stable and unstable param- 0.276 tp+ fp 0.7 0.285 0.277 0.273 0.272 0.270 eters (see 4.1) y = g (x )=s(W˙ h + b˙ ) 1 j θ˙1 1 j 1 1 j 1 (11) 0.6 0.285 0.285 0.281 0.279 0.273 hashing stable parameters 0.202 and recall is the ratio between the number of data 0.5 0.305 0.305 0.306 0.285 0.287 MinHash algorithms with LSH (see 0.350 and 4.2) that are correctly classified to the total number of autoencoders (see 4.3) 0.377 y = g (x )=s(W˙ h + b˙ ). The next experiment concerned the use of a 2k θ˙2 2k 2 2 j 2 (12) positive data n m n m neural network with a dedicated structure consist- where: W˙ 1 R 1× 1 , W˙ 2 R 2× 2 are the weights, tp m ∈ m ∈ recall = , (15) ing of two autoencoders (see Section 4.3). Several 6 Conclusion b˙1 R 1 , b˙2 R 2 are the bias vectors and θ˙1 = + ∈ ∈ tp fn possible combinations of neural network hyperpa- W˙ 1,b˙1 and θ˙2 = W˙ 2,b˙2 . For this autoencoder The paper presents two new methods of encod- { } { } where tp – true positive, fp– false positive, fn – rameters were tested. Each time a different number the loss function will be expressed by the following ing and comparing browser fingerprints: an algo- false negative and they can be derived from a con- of encoder and decoder layers, units in each layer formula rithm using MinHash encoding (cooperating) with fusion matrix [30]. The parameter which combines and the size of the deepest coding layers h1 and h2. the LSH and a dedicated structure of a neural net- L(x1i,y1i)+L(x2i,y2i) the above two parameters is F1 score that it is the were selected. The optimization parameters were L(xi,yi)= work consisting of two encoders. Both methods 2 harmonic mean of precision and recall assumed as follows: categorical cross-entropy as 2 2 the loss function, stochastic gradient descent opti- use the results of a previously conducted analysis x1i y1i + x2i y2i = ∥ − ∥ ∥ − ∥ 2 precision recall mizer, number of epochs – 250 and batch size 32. of changes in fingerprint characteristics over time. 2 F1 score = · · . (16) 2 2 precision + recall The F1 score values for different network structures This allowed selecting two groups of features: sta- x1i s(W˙ 1h1 j + b˙1) + x2i s(W˙ 2h2 j + b˙2) = ∥ − ∥ ∥ − ∥ . are presented in Table 3. The subsequent rows show ble and unstable ones. The presented experimental 2 The first of the conducted experiments consists in the examined structures of neural networks, speci- results showed that the effectiveness of searching (13) testing the effectiveness of Algorithm 2. The ini- fied number of units for encoding layers h and h , for similar fingerprints is much greater if the differ- tiating part of this algorithm requires that the data 1 2 After the learning process has been completed, only and the F1 score value obtained for the test data. ence between these two groups is taken into account that constitute the browser fingerprint are divided encoders are used for further research. They are The best results were achieved by network No. 4 when encoding. into two groups, i.e. stable and unstable (according treated as hash functions. The inputs of the en- considering the obtained F1 score and network size. A great advantage which the proposed meth- to the results obtained in Section 3.2). The initial coders are given fingerprint feature values (as one- ods offer is the possibility to easily save the results values of the algorithm parameters must then be de- . The F1 score results obtained for hot encode data). The output values obtained by the Table 3 of encoding in the database. This is possible both termined: the number of signatures for encoding ns, different structures of autoencoders. two layers h1 and h2 are rounded to integers. with the hashes obtained by MinHash algorithms nn and threshold values ts, tn. The selection of these No. Type Structure No. F1 and with the hashes obtained by the h and h au- values requires a number of tests with different 1 2 of the of score toencoder layers. This gives the possibility to put combinations of these parameters. The following pa- units 5 Experiments ram- of both algorithms into practice. Both methods can values have been adopted for each parameter of the eter layer also be used when creating device fingerprints for For this research on browser fingerprints cod- algorithm: ns, nn 2,4,8,16,24,32,64,96,128 , group h ∈{ } 1 512-256-64-256-512 64 different browsers used by one user [16]. In this ing methods, authentic data were collected during ts = 1, tn 1,0.9,0.8,0.7,0.6,0.5 . The best val- 1 0.327 ∈{ } 2 512-256-4-256-512 4 case, it will be necessary to perform an appropri- visits to 186 different types of websites including ues were obtained for ns = 128. Table 2 presents 1 512-256-32-256-512 64 2 0.303 ate analysis of the changes occurring in the finger- online shops, companies offering various services a summary of the obtained results. The following 2 512-256-8-256-512 4 1 640-368-128-368-640 128 print feature values beforehand and create appropri- and financial institutions, including banks. For 3 rows show the results for different tn values, while 3 0.361 2 256-128-8-128-256 8 ate groups of stable and unstable features. months, the total number of registered clicks totaled the columns show the results for different n values. 1 640-368-96-368-640 96 n 4 0.377 2 256-128-8-128-256 8 ca. 45 million. For the sake of the research, the data The best was a set of parameters ns = 128, nn = 4, Browser fingerprinting is a tool that helps to re- 1 256-128-32-128-256 32 5 0.298 for which the fingerprints parameters changed5 to 9 ts = 1 and tn = 0.5 lub tn = 0.6. Here the value 2 256-128-8-128-256 8 duce the number of fraudulent activities on the In- 252 Marcin Gabryel, Konrad Grzanek, Yoichi Hayashi ternet. The new algorithms proposed in this paper [12] Kobusinska, A., Pawluczuk, K., Brzezinski, J. may increase the level of detecting online fraudu- (2018). Big Data fingerprinting information analyt- lent activities being carried out by some users by ics for sustainability. Future Generation Computer identifying them more accurately and efficiently. Systems, 86, 1321-1337. [13] Mayer J R. 2009. Any person... a pamphleteer”: Internet Anonymity in the Age of Web 2.0. Un- References dergraduate Senior Thesis, Princeton University (2009). [1] Kristol D.M., HTTP cookies: Standards, privacy, and politics, ACM Trans. Internet Techn. 1 (2) [14] Steven E. and Arvind N. 2016. Online Tracking: A (2001) 151–198. 1-million-site Measurement and Analysis. In Pro- ceedings of the 2016 ACM SIGSAC Conference [2] Low C., Cookie law explained, 2016. on-line on Computer and Communications Security (CCS https://www.cookielaw.org/the-cookie-law/ (re- ’16). ACM, New York, NY, USA, 1388–1401. trieved:03/2020). [15] Gómez-Boix, A., Laperdrix, P., Baudry, B. (2018, [3] Alaca, F., Van Oorschot, P. C. (2016, December). April). Hiding in the crowd: an analysis of the ef- Device fingerprinting for augmenting web authen- fectiveness of browser fingerprinting at large scale. tication: classification and analysis of methods. In In Proceedings of the 2018 world wide web con- Proceedings of the 32nd Annual Conference on ference (pp. 309-318). Computer Security Applications (pp. 289-301). [16] Cao, Y., Li, S., Wijmans, E. (2017, March). (Cross- [4] Nagaraja, S., Shah, R. (2019, May). Clicktok: click ) Browser Fingerprinting via OS and Hardware fraud detection using traffic analysis. In Proceed- Level Features. In NDSS. ings of the 12th Conference on Security and Pri- [17] 2020. The Evolution of Hi-Def Fingerprint- vacy in Wireless and Mobile Networks (pp. 105- ing in Bot Mitigation - Distil Networks. 116). https://resources.distilnetworks.com/all-blog- [5] Mouawi, R., Elhajj, I.H., Chehab, A. et al. Crowd- posts/device-fingerprinting-solution-bot- sourcing for click fraud detection. EURASIP J. on mitigation Info. Security 2019, 11 (2019) [18] 2020. Device Tracking Add-on for [6] Dave, V., Guha, S., Zhang, Y. (2012, August). minFraud Services - MaxMind. Measuring and fingerprinting click-spam in ad net- https://dev.maxmind.com/minfraud/device/ works. In Proceedings of the ACM SIGCOMM [19] Bursztein, E., Malyshev, A., Pietraszek, T., 2012 conference on Applications, technologies, ar- Thomas, K. (2016, October). Picasso: Lightweight chitectures, and protocols for computer communi- device class fingerprinting for web clients. In Pro- cation (pp. 175-186). ceedings of the 6th Workshop on Security and Pri- vacy in Smartphones and Mobile Devices (pp. 93- [7] Vastel, A., Rudametkin, W., Rouvoy, R., Blanc, 102). X. (2020, February). FP-Crawlers: Studying the Resilience of Browser Fingerprinting to Block [20] Renjith, S. (2018). Detection of Fraudulent Sellers Crawlers. In NDSS Workshop on Measurements, in Online Marketplaces using Support Vector Ma- Attacks, and Defenses for the Web (MADWeb’20). chine Approach. arXiv preprint arXiv:1805.00464. [8] 2019. https://www.emarketer.com/content/digital- [21] Zhang, X., Han, Y., Xu, W., Wang, Q. (2019). ad-fraud-2019 HOBA: A novel feature engineering methodology for credit card fraud detection with a deep learning [9] Barker S. ,"Future Digital Advertising, Artificial architecture. Information Sciences. Intelligence & Advertising Fraud 2019-2023", Ju- niper Research, 2019 [22] Ludwig, S. A. (2019). Applying a neural network ensemble to intrusion detection. Journal of Arti- [10] Eckersley P., How unique is your ? ficial Intelligence and Soft Computing Research, in: Privacy Enhancing Technologies, 10th Interna- 9(3), 177-188. tional Symposium, PETS 2010, Berlin, Germany, July 21-23, 2010. Proceedings, 2010, pp. 1–18 [23] de Souza, G. B., da Silva Santos, D. F., Pires, R. G., Marana, A. N., Papa, J. P. (2019). Deep features [11] Laperdrix, P., Bielova, N., Baudry, B., Avoine, G. extraction for robust fingerprint spoofing attack de- (2019). Browser Fingerprinting: A survey. arXiv tection. Journal of Artificial Intelligence and Soft preprint arXiv:1905.01051. Computing Research, 9(1), 41-49. Marcin Gabryel, Konrad Grzanek, Yoichi Hayashi BROWSER FINGERPRINT CODING METHODS . . . 253 ternet. The new algorithms proposed in this paper [12] Kobusinska, A., Pawluczuk, K., Brzezinski, J. [24] Salakhutdinov, R., Hinton, G. (2009). Seman- in LSH Method to Find Similar Documents. Jour- may increase the level of detecting online fraudu- (2018). Big Data fingerprinting information analyt- tic hashing. International Journal of Approximate nal of Basic and Applied Research, 3, 466-472. lent activities being carried out by some users by ics for sustainability. Future Generation Computer Reasoning, 50(7), 969-978. Systems, 86, 1321-1337. [28] Goodfellow, Ian; Bengio, Yoshua; Courville, identifying them more accurately and efficiently. [25] 2020. FingerprintJS. Fraud detection API. Aaron (2016). Deep Learning. MIT Press [13] Mayer J R. 2009. Any person... a pamphleteer”: https://fingerprintjs.com/ Internet Anonymity in the Age of Web 2.0. Un- [26] Leskovec J., Rajaraman A., Ullman J.D.: Mining [29] Bengio Y., Learning deep architectures for ai References dergraduate Senior Thesis, Princeton University of Massive Datasets, Cambridge University Press, Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1– (2009). 2014 127, Jan. 2009. [1] Kristol D.M., HTTP cookies: Standards, privacy, and politics, ACM Trans. Internet Techn. 1 (2) [14] Steven E. and Arvind N. 2016. Online Tracking: A [27] Azgomi, H., Mahjur, A. (2013). A Solution for [30] Olson, D.L., Delen, D.: Advanced Data Mining (2001) 151–198. 1-million-site Measurement and Analysis. In Pro- Calculating the False Positive and False Negative Techniques, 1st edn. Springer, Heidelberg (2008). ceedings of the 2016 ACM SIGSAC Conference [2] Low C., Cookie law explained, 2016. on-line on Computer and Communications Security (CCS https://www.cookielaw.org/the-cookie-law/ (re- ’16). ACM, New York, NY, USA, 1388–1401. Marcin Gabryel earned his Ph.D. Prof. Yoichi Hayashi received the trieved:03/2020). degree in computer science at Cze- Dr. Eng. degree in systems engineer- [15] Gómez-Boix, A., Laperdrix, P., Baudry, B. (2018, stochowa University of Technology, ing from Tokyo University of Science, [3] Alaca, F., Van Oorschot, P. C. (2016, December). April). Hiding in the crowd: an analysis of the ef- Poland, in 2007. He is an assistant pro- Tokyo in 1984. In 1986, he joined the Device fingerprinting for augmenting web authen- fectiveness of browser fingerprinting at large scale. fessor in the Department of Computer Computer Science Department of Iba- tication: classification and analysis of methods. In In Proceedings of the 2018 world wide web con- Engineering at Częstochowa Universi- raki University, Japan, as an Assistant Proceedings of the 32nd Annual Conference on ference (pp. 309-318). ty of Technology. His research focuses Professor. Since 1996, he has been on developing new methods in compu- a Full Professor at Computer Science Computer Security Applications (pp. 289-301). [16] Cao, Y., Li, S., Wijmans, E. (2017, March). (Cross- tational intelligence and data mining. Department, Meiji University, Tokyo. [4] Nagaraja, S., Shah, R. (2019, May). Clicktok: click ) Browser Fingerprinting via OS and Hardware He has published over 50 research papers. His present re- He was a visiting professor at the University of Alabama at fraud detection using traffic analysis. In Proceed- Level Features. In NDSS. search interests include deep learning architectures and their Birmingham and University of Canterbury (New Zealand). applications in databases and security. He authored over 230 published computer science papers. His ings of the 12th Conference on Security and Pri- [17] 2020. The Evolution of Hi-Def Fingerprint- current research interests include explainable AI, deep learn- vacy in Wireless and Mobile Networks (pp. 105- ing in Bot Mitigation - Distil Networks. Konrad Grzanek, scientist, pro- ing, rule extraction, high-performance classifiers, and medi- 116). https://resources.distilnetworks.com/all-blog- grammer and lecturer. Graduate of cal informatics and medical imaging. He has been the Action posts/device-fingerprinting-solution-bot- the Technical University of Łódź Editor of Neural Networks and the Associate Editor of IEEE [5] Mouawi, R., Elhajj, I.H., Chehab, A. et al. Crowd- (FTIMS). Assistant professor at the Trans. Fuzzy Systems. He has served as Editor-in-Chief and mitigation sourcing for click fraud detection. EURASIP J. on Social Academy of Sciences. He holds Associate Editor, Editorial Board Member, Review Board Info. Security 2019, 11 (2019) [18] 2020. Device Tracking Add-on for a Ph.D. from Częstochowa University Member, Guest Editor and Reviewer in 60 academic jour- of Technology (CUT). His research nals, and was involved in the work for the European Research [6] Dave, V., Guha, S., Zhang, Y. (2012, August). minFraud Services - MaxMind. https://dev.maxmind.com/minfraud/device/ interests focus on programming lan- Council Executive Agency, National Sciences and Engineer- Measuring and fingerprinting click-spam in ad net- guages, software quality, software de- ing Research Council of Canada (NSERC). He has been works. In Proceedings of the ACM SIGCOMM [19] Bursztein, E., Malyshev, A., Pietraszek, T., velopment processes, and artificial intelligence, in particular a senior member of the IEEE since 2000. 2012 conference on Applications, technologies, ar- Thomas, K. (2016, October). Picasso: Lightweight on combining machine learning methods with static software chitectures, and protocols for computer communi- device class fingerprinting for web clients. In Pro- analysis. As a programmer, he is an advocate and promoter cation (pp. 175-186). ceedings of the 6th Workshop on Security and Pri- of the functional programming style. Author of over 30 pub- lications related to various problems of computer science and vacy in Smartphones and Mobile Devices (pp. 93- [7] Vastel, A., Rudametkin, W., Rouvoy, R., Blanc, software engineering. 102). X. (2020, February). FP-Crawlers: Studying the Resilience of Browser Fingerprinting to Block [20] Renjith, S. (2018). Detection of Fraudulent Sellers Crawlers. In NDSS Workshop on Measurements, in Online Marketplaces using Support Vector Ma- Attacks, and Defenses for the Web (MADWeb’20). chine Approach. arXiv preprint arXiv:1805.00464. [8] 2019. https://www.emarketer.com/content/digital- [21] Zhang, X., Han, Y., Xu, W., Wang, Q. (2019). ad-fraud-2019 HOBA: A novel feature engineering methodology for credit card fraud detection with a deep learning [9] Barker S. ,"Future Digital Advertising, Artificial architecture. Information Sciences. Intelligence & Advertising Fraud 2019-2023", Ju- niper Research, 2019 [22] Ludwig, S. A. (2019). Applying a neural network ensemble to intrusion detection. Journal of Arti- [10] Eckersley P., How unique is your web browser? ficial Intelligence and Soft Computing Research, in: Privacy Enhancing Technologies, 10th Interna- 9(3), 177-188. tional Symposium, PETS 2010, Berlin, Germany, July 21-23, 2010. Proceedings, 2010, pp. 1–18 [23] de Souza, G. B., da Silva Santos, D. F., Pires, R. G., Marana, A. N., Papa, J. P. (2019). Deep features [11] Laperdrix, P., Bielova, N., Baudry, B., Avoine, G. extraction for robust fingerprint spoofing attack de- (2019). Browser Fingerprinting: A survey. arXiv tection. Journal of Artificial Intelligence and Soft preprint arXiv:1905.01051. Computing Research, 9(1), 41-49.