Loss Functions for Clustering in Multi-Instance Learning

Loss Functions for Clustering in Multi-instance Learning Marek Dědič1;2, Tomáš Pevný3, Lukáš Bajer2, Martin Holeňa4 1 Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague, Trojanova 13, Prague, Czech Republic 2 Cisco Systems, Inc., Karlovo náměstı́ 10, Prague, Czech Republic 3 Faculty of Electrical Engineering, Czech Technical University in Prague, Karlovo náměstí 13, Prague, Czech Republic 4 Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic Abstract: Multi-instance learning belongs to one of re- their work significantly by deciding about whole clusters cently fast developing areas of machine learning. It is a su- of servers instead of each one individually. pervised learning method and this paper reports research A key prerequisite for the application of multi-instance into its unsupervised counterpart, multi-instance cluster- learning to clustering is that loss functions for clustering, ing. Whereas traditional clustering clusters points, multi- originally proposed for points, are adapted for bags. In this instance clustering clusters bags, i.e. multisets of points or paper, such an adaptation is outlined for three sophisticated of other kinds of objects. The paper focuses on the problem loss functions: contrastive predictive coding, triplet loss of loss functions for clustering. Three sophisticated loss and magnet loss. functions used for clustering of points, contrastive predic- Of the three proposed methods, triplet loss and magnet tive coding, triplet loss and magnet loss, are elaborated for loss utilize some labels as part of the training process and multi-instance clustering. Finally, they are compared on 18 so are not truly unsupervised. As such, the authors expect benchmark datasets, as well as on a real-world dataset. them to outperform the method based on contrastive predictive coding, which on the other hand is the most promis- ing and innovative approach from the theoretical point of 1 Introduction view. In the next section, basic properties of multi-instance Multi-instance learning (MIL) belongs to recently fast de- learning and clustering are briefly introduced. The adapta- veloping areas of machine learning. Though it is a super- tion of the three considered loss functions is explained in vised learning method, this paper addresses its application Section 3. A comprehensive comparison of them is then to unsupervised learning – clustering. Whereas traditional presented in Section 4. clustering is one of points, multi-instance clustering clusters multisets of points, also known as bags. Such a grouping of the points into bags is considered a property of the 2 Multi-instance Learning and Clustering problem at hand and therefore sourced from the input data. While there have been some previous attempts at multi- Multi-instance learning (MIL) was first described by [7]. instance clustering, they use pairwise relations between all In its original form, it was developed and used for su- points in all bags, quickly becoming unwieldy and compu- pervised learning. Some prior art also exists for unsu- tationally infeasible. Our work uses multi-instance learn- pervised learning, such as [5, 31], however, it uses pair- ing as a general toolkit for learning representations of bags wise instance distances as a basis for clustering, which and then clusters the bags using those representations. This doesn’t properly utilize the inherent structure of the data builds on previous works by the authors and enables clus- and quickly becomes computationally infeasible. tering of arbitrary data-structures by expressing them as The MIL paradigm is a type of representation learning hierarchies of multi-instance problems and then using the on data which has some internal structure. Therefore, it internal structure to better solve the problems at hand. views a sample as a bag (i.e. a multiset) of an arbitrary Multi-instance clustering is evaluated in Section 4 in number of objects. The basic elements of MIL are sam- the application domain of computer and network security. ples from a space X and their corresponding labels from a While this is only one of many possible applications of space Y (a space of classes). Compared to usual supervised MIL due to its general expressive power for structured data, learning, MIL replaces individual instances with bags of it is the domain of choice for the authors. Here, MIL is instances from the space X such that every instance in X used to represent user activity as a bag of network con- belongs to at least one bag from the bag-space B. nections for each user in a fixed time window. For clus- [7] provides an example of a multi-instance problem tering, this enables detection of compromised users based where each bag represents a key chain with some keys (in- on their complex behaviours. Clustering opens a new win- stances). To solve the problem of finding which key opens dow of opportunity here by e.g. grouping servers with sim- a particular lock, a “proxy” MIL problem is presented – de- ilar behaviour together, allowing a human analyst to boost termining which key chain opens that particular lock. This Copyright ©2020 for this paper by its authors. Use permitted under line of thinking leads to the pivotal definition of the label Creative Commons License Attribution 4.0 International (CC BY 4.0). of a bag as being positive iff the bag contains at least one positive instance. In later works such as [23], this interpre- The functions fI and fB are realized by a deep neural net- tation of MIL is abandoned in favor of a more general one. work using ReLU as its activation function. The aggrega- An instance is no longer viewed as having a meaning in and tion g is realized as an element-wise mean or maximum of of itself, but only in the context of its bag. The notion of all the vectors. an instance-level label is dropped because in this interpre- The structure of the data is highly exploited using the tation, the bag is the atomic unit of interest. To this end, embedded-space paradigm. The approach uses multiple the embedded-space paradigm, described in Section 2.2, is layers of MIL nested in each other – the instances of a used. bag do not necessarily need to be feature vectors, but can be bag in themselves. The HTTP traffic of a particular 2.1 A probabilistic formulation of multi-instance client is therefore represented as a bag of all second-level learning domains the client has exchanged datagrams with. Each second-level domain is then represented as a bag of indi- A probabilistic way of describing multi-instance learning vidual URLs which the client has connected to. Individual was first introduced in [23] and builds on the previous work URLs are then split into 3 parts, domain, path and query, [19]. and each part is represented as a bag of tokens, which can Let for the space X exist a measurable space (X ; A), then be broken down even further. In the end, the model X where A is a σ-algebra on X . Let P denote the set of consists of 5 nested MIL problems. all probability measures on (X ; A). A bag B is viewed as A benefit of MIL is also its easy use for explainability a random sample with replacement of a random variable and interpretability. The authors of [22] present a way to X governed by a particular probability distribution pB 2 P , extract indicators of compromise and explain the decision that is using the learned MIL model. B = fxijxi ∼ pB; i 2 f1; : : : ; mgg where m 2 N: (1) 2.3 Clustering 2.2 Embedded-space paradigm for solving Clustering is a prime example of a problem typically asso- multi-instance problems ciated with unsupervised learning. The problem at hand, however, is not one of clustering ordinary number vectors While it is possible to use several approaches to solv- in a linear space. Instead, a clustering of objects repre- ing multi-instance problems, in this work, the embedded- sented by bags is explored, as that is the problem solved for space paradigm was used. For an overview of the other datasets introduced later in Section 4. While [28] present a paradigms, see [6]. clustering of bags using a modified Hausdorff distance on In the embedded space paradigm, labels are only defined bags and [17] present a clustering using maximum mean on the level of bags. In order for these bag labels to be discrepancy, a different approach to clustering of bags is φ : B! X¯ learned, an embedding function of the form used in this work. Among the main reasons for this choice X¯ must be defined, where is a latent space, which may or is the prohibitively high computational complexity of the X may not be identical to . Using this function, each bag previously mentioned approaches on large datasets and the φ (B) 2 X¯ can be represented by an object , which makes it possibility of utilizing the representations previously intro- possible to use any off-the-shelf supervised learning algo- duced in [22]. X¯ rithm acting on . Among the simplest embedding func- An approach based on the embedded-space paradigm for tions are, e.g. element-wise minimum, maximum, and MIL was chosen. In order to utilize the structure of the mean. A more complicated embedding function may for data, a MIL model is used to represent each bag in the la- example apply a neural network to each instance of the bag tent space X¯. This presents the issue of how to train the em- and subsequently pool the instances using one of the afore- bedding function φ, because its learning in standard MIL mentioned functions.

Load more