Measuring Similarity 1 Dr

Measuring Similarity 1 Dr. Arijit Laha 1 Senior Member: ACM, IEEE Email: [email protected], Homepage: Introduction https://sites.google.com/site/arijitlaha/ The ability to recognize objects and their relationships is at the core Pedestrian Data Science Series Article #1 of intelligent behavior. This, in turn, depend on one’s ability of perceiving similarity or dissimilarity between objects, be physical or abstract ones. Hence, if we are interested to make computers behave The motivation for writing this white paper (hopefully, the first of a series, with any degree of intelligence, we have to write programs that can depends on its reception and utilization) work with relevant representation of objects and means to compute their is to introduce our young data scientists to some subtleties of the art (and of similarities or lack thereof, i.e., dissimilarity (obviously, they are two course, science) they have taken up to faces of the same coin). practice. Let us examine two emphasized phrases of the paragraph above. They are most crucial and fundamental issues in building any computer- based system showing iota of intelligent behavior. 2 Essentially, 2 If you represent a ’chair’, to your computer program for identifying objects to • we need to work with a representation of objects of interest con- seat on, as just something that has four taining adequate information relevant for the problem at hand. For legs – then the computer is very likely to advise you to sit on your house-puppy! example, information needed about a chair for distinguishing it from a table is quite different than the information needed if we want to distinguish between an easy chair and a work chair; • once we have a proper representation of the objects, we need to incorporate into a relevant mathematical framework which will enable us to compute a measure of similarity or dissimilarity between objects. There are quite a number of available measures for each type of commonly used representations. We can find a good listing here. Unfortunately, the measures within each group has many, some obvious and some subtle, differences. Thus, we need to choose and evaluate them in context of our problems. A nice survey of many of these measures is available on internet. Advanced readers can and urged to) directly go there and read it and its alikes. In this white paper we shall discuss the objects represented by a set of their attributes and computing distances between pairs of them as points in feature space. The computed distance is interpreted as a measure of dissimilarity, and thus, inverse of similarity - in the sense, the less is distance the more is similarity and vice-versa. We shall also find a measure, the cosine similarity that can be directly interpreted as similarity. Don’t worry my friends, I shall take you there very gently. Working with Object data As far as organization of the data is concerned, there are three major categories: • Object data: Objects are represented by an ordered set of their attributes/ characteristics/features. We shall use the term “feature” henceforth - these are the data which we store and perceive as one record/instance per row in a file/table;3 3 In traditional statistics these are also known as cross-sectional data 2 dr. arijit laha • Sequence Data: Data elements corresponds to a particular order, temporal (time series), spatial (letters/words appearing in a text); and • Relational/relationship data: The data captures various relationships among the objects - we often call them “graph data”. Remember, there can be many situations while solving real-life problem, when both types of data may be used as well as may be converted to one another. Nevertheless, here we concentrate on object data only. Attributes Features in object data can be numerical, Boolean as well as categorical. Again, let us consider some features of a chair: • Height of the chair: numerical; • Area of the seat: numerical; • Has armrest: Boolean - either has or not (1/0) • Reclinable: Boolean - yes/no • Color of the chair: categorical • Number of legs: ? - can be 4, 3, 1 (not 2, I guess), is it meaningful to consider it numeric, so that we can do all kind of mathematical and/or statistical jugglery or should we consider it categorical - 4-legged etc.? Example, my chair is 22 inch in height, with seat area 400 sq. inch. and arm- Numerical features allow us to apply a vast array of mathematical rests, is fixed-back, red with 4 legs. and statistical tools in order to work with them. Thus, in most cases Hence, MyChair = (22, 400, 1, 0, 4L), a 5 − tuple is the representation of my we try to frame the problems in terms of numerical features (even chair. In statistical literature such data when the raw data is something different such as text). In majority of are often called multivariate data. cases, Boolean features also can be treated as numerical with values 0 and 1. Categorical features are slightly difficult to deal with within same framework, since we cannot work out a concept of similarity among their values, e.g., it is somewhat absurd to say that the color green is more similar to black than blue or vice-versa. Nevertheless, we shall see later that the categorical features also can be transformed. Thus, without restricting the applicability, hereafter we shall consider all the features of an object. Now, let us set up some basic nomenclatures here. • Let X denote a data set with n instances of object data; th • Let xi = xi1, xi2, ··· , xip 2 X be the i object with p attributes/ 4 4 Notice the order of subscripts, first sub- features with values xi1, xi2, ··· , xiprespectively ; and script correspond to an object, while the second to an attribute. • To keep the things simple, let us posit that 8xi 2 X, the types of features available are same. In other words, each of the object instance is represented by a p − tuple consisted of the values of the same set of attributes in the same order. measuring similarity 3 Figure 1: The Iris data, partial listing, sorted by sepal width (Courtesy: Wikipedia) 4 dr. arijit laha To understand this properly, let us consider one of the most well known (and classic!) data sets, the Iris data set. The data set is comprised of 50 sample observations from three species of Iris flowers. The measured features are length and widths of the sepals and petals of the flowers. Part of the data is shown in Figure 1. Each row in the data represents measurements of one flower with the species of the flower noted in fifth column. Thus, this data has four numerical features, i.e., p = 4 and 3 × 50 = 150 instances (rows), i.e., n = 150. The inclusion of ’species’ in the fifth column, makes it a labeled data, where each instance is labeled with a category or class. For the time being we shall ignore these labels in our discussion. measuring similarity 5 It is very important to remember that an object can potentially have many (often extremely large number of) attributes/features. For example, consider the Iris flowers. Other than the four collected in the Iris data set, there are many other possible attributes, including simple ones like color to very complex ones like genome sequence (think of others in between ). It is clear that we cannot use all possible attributes. So, which ones do we use? This is one of the big questions almost always faced by data scientists in real world problems (in research labs one has the luxury of using well-known data sets). The problem is essentially that of determining relevance and accessibility/availability of the features with respect to the problem-in-hand. Addressing this issue constitutes big and vital parts of data science known as feature selection and feature extraction, known collectively as feature engineering. While feature engineering is very much problem-specific (and thus out of current scope), let us look into some of the guiding principles here. Relevance Usually we would like to use discerning features or discriminating features. They are the features which can be useful for computationally identifying similar and dis- similar object in the sense relevant for the problem. For example, consider the species identification problem of Iris flowers. The attribute ‘color’ is not very discriminating (all Irises are violet), but genome data can be very useful. On the other hand, if the problem is to identifying between Irises and Roses, color can be one of the very useful features. Let us try to understand the above in simpler terms. Again, consider the problem of Iris species identification. If the objects, i.e., Iris flowers are represented with a set of discriminating feature, we can expect that, in general, the similarity computed using the features is greater between two flowers from same species than otherwise. Let us add another feature to the representation. This will effect the similarities. Now, if the similarity between flowers from same species is increased and/or similarity between flowers from different species is decreased after introduction of the new feature, then the new feature is discriminating one. Existence of non-discriminating/irrelevant/redundant features in data evidently do not contribute to the solutions, but can often worsen the situation. Their existence can be thought as noise in the data which essentially decreases signal-to-noise ratio in the information content of the data. This makes the task of extracting knowledge/information (signal) from the data more difficult. Pragmatics Even after we have some idea on which features are useful, we have to consider the cost as well as feasibility of collecting and/or processing them.

Load more