Measuring Similarity 1 Dr. Arijit Laha 1 Senior Member: ACM, IEEE Email: [email protected], Homepage: Introduction https://sites.google.com/site/arijitlaha/

The ability to recognize objects and their relationships is at the core Pedestrian Data Science Series Article #1 of intelligent behavior. This, in turn, depend on one’s ability of perceiving similarity or dissimilarity between objects, be physical or abstract ones. Hence, if we are interested to make computers behave The motivation for writing this white paper (hopefully, the first of a series, with any degree of intelligence, we have to write programs that can depends on its reception and utilization) work with relevant representation of objects and means to compute their is to introduce our young data scientists to some subtleties of the art (and of similarities or lack thereof, i.e., dissimilarity (obviously, they are two course, science) they have taken up to faces of the same coin). practice. Let us examine two emphasized phrases of the paragraph above. They are most crucial and fundamental issues in building any computer- based system showing iota of intelligent behavior. 2 Essentially, 2 If you represent a ’chair’, to your com- puter program for identifying objects to • we need to work with a representation of objects of interest con- seat on, as just something that has four taining adequate information relevant for the problem at hand. For legs – then the computer is very likely to advise you to sit on your house-puppy! example, information needed about a chair for distinguishing it from a table is quite different than the information needed if we want to distinguish between an easy chair and a work chair;

• once we have a proper representation of the objects, we need to incorporate into a relevant mathematical framework which will enable us to compute a of similarity or dissimilarity between objects.

There are quite a number of available measures for each type of commonly used representations. We can find a good listing here. Unfortunately, the measures within each has many, some obvious and some subtle, differences. Thus, we need to choose and evaluate them in context of our problems. A nice survey of many of these measures is available on internet. Advanced readers can and urged to) directly go there and read it and its alikes. In this white paper we shall discuss the objects represented by a set of their attributes and computing between pairs of them as points in feature . The computed is interpreted as a measure of dissimilarity, and thus, inverse of similarity - in the sense, the less is distance the more is similarity and vice-versa. We shall also find a measure, the cosine similarity that can be directly interpreted as similarity. Don’t worry my friends, I shall take you there very gently.

Working with Object data

As far as organization of the data is concerned, there are three major categories:

• Object data: Objects are represented by an ordered set of their attributes/ characteristics/features. We shall use the term “feature” henceforth - these are the data which we store and perceive as one record/instance per row in a file/table;3 3 In traditional statistics these are also known as cross-sectional data 2 dr. arijit laha

• Sequence Data: Data elements corresponds to a particular order, temporal (time series), spatial (letters/words appearing in a text); and

• Relational/relationship data: The data captures various relation- ships among the objects - we often call them “graph data”.

Remember, there can be many situations while solving real-life prob- lem, when both types of data may be used as well as may be converted to one another. Nevertheless, here we concentrate on object data only.

Attributes

Features in object data can be numerical, Boolean as well as categorical. Again, let us consider some features of a chair:

• Height of the chair: numerical;

of the seat: numerical;

• Has armrest: Boolean - either has or not (1/0)

• Reclinable: Boolean - yes/no

• Color of the chair: categorical

• Number of legs: ? - can be 4, 3, 1 (not 2, I guess), is it meaningful to consider it numeric, so that we can do all kind of mathematical and/or statistical jugglery or should we consider it categorical - 4-legged etc.? Example, my chair is 22 inch in height, with seat area 400 sq. inch. and arm- Numerical features allow us to apply a vast array of mathematical rests, is fixed-back, red with 4 legs. and statistical tools in order to work with them. Thus, in most cases Hence, MyChair = (22, 400, 1, 0, 4L), a 5 − tuple is the representation of my we try to frame the problems in terms of numerical features (even chair. In statistical literature such data when the raw data is something different such as text). In majority of are often called multivariate data. cases, Boolean features also can be treated as numerical with values 0 and 1. Categorical features are slightly difficult to deal with within same framework, since we cannot work out a concept of similarity among their values, e.g., it is somewhat absurd to say that the color green is more similar to black than blue or vice-versa. Nevertheless, we shall see later that the categorical features also can be transformed. Thus, without restricting the applicability, hereafter we shall consider all the features of an object. Now, let us set up some basic nomenclatures here.

• Let X denote a data set with n instances of object data;

 th • Let xi = xi1, xi2, ··· , xip ∈ X be the i object with p attributes/ 4 4 Notice the order of subscripts, first sub- features with values xi1, xi2, ··· , xiprespectively ; and script correspond to an object, while the second to an attribute. • To keep the things simple, let us posit that ∀xi ∈ X, the types of features available are same. In other words, each of the object instance is represented by a p − tuple consisted of the values of the same set of attributes in the same order. measuring similarity 3

Figure 1: The Iris data, partial list- ing, sorted by sepal width (Courtesy: Wikipedia) 4 dr. arijit laha

To understand this properly, let us consider one of the most well known (and classic!) data sets, the Iris data set. The data set is comprised of 50 sample observations from three species of Iris flowers. The measured features are length and widths of the sepals and petals of the flowers. Part of the data is shown in Figure 1. Each row in the data represents measurements of one flower with the species of the flower noted in fifth column. Thus, this data has four numerical features, i.e., p = 4 and 3 × 50 = 150 instances (rows), i.e., n = 150. The inclusion of ’species’ in the fifth column, makes it a labeled data, where each instance is labeled with a category or class. For the time being we shall ignore these labels in our discussion. measuring similarity 5

It is very important to remember that an object can potentially have many (often extremely large number of) attributes/features. For example, consider the Iris flowers. Other than the four collected in the Iris data set, there are many other possible attributes, including simple ones like color to very complex ones like genome sequence (think of others in between ). It is clear that we cannot use all possible attributes. So, which ones do we use? This is one of the big questions almost always faced by data scientists in real world problems (in research labs one has the luxury of using well-known data sets). The problem is essentially that of determining relevance and accessibility/availability of the features with respect to the problem-in-hand. Addressing this issue constitutes big and vital parts of data science known as feature selection and feature extraction, known collectively as feature engineering. While feature engineering is very much problem-specific (and thus out of current scope), let us look into some of the guiding principles here.

Relevance Usually we would like to use discerning features or discriminating features. They are the features which can be useful for computationally identifying similar and dis- similar object in the sense relevant for the problem. For example, consider the species identification problem of Iris flowers. The attribute ‘color’ is not very discriminating (all Irises are violet), but genome data can be very useful. On the other hand, if the problem is to identifying between Irises and Roses, color can be one of the very useful features. Let us try to understand the above in simpler terms. Again, consider the problem of Iris species identification. If the objects, i.e., Iris flowers are represented with a set of discriminating feature, we can expect that, in general, the similarity computed using the features is greater between two flowers from same species than otherwise. Let us add another feature to the representation. This will effect the similarities. Now, if the similarity between flowers from same species is increased and/or similarity between flowers from different species is decreased after introduction of the new feature, then the new feature is discriminating one. Existence of non-discriminating/irrelevant/redundant features in data evidently do not contribute to the solutions, but can often worsen the situation. Their existence can be thought as noise in the data which essentially decreases signal-to-noise ratio in the infor- mation content of the data. This makes the task of extracting knowledge/information (signal) from the data more difficult.

Pragmatics Even after we have some idea on which features are useful, we have to consider the cost as well as feasibility of collecting and/or processing them. Let us again consider the problem of Iris species identification. While the genome data can be extremely good in discriminating species, it is costly to collect as well as to process computationally. So, if we have the freedom to dictate the data collection strategy, we shall have to consider the cost-benefit tradeoff for selecting the features to be collected/measured. In a lot of real-world works we don’t have the above freedom. Instead we have to work with already available data, which is mostly collected for some other purpose (transactional, operational, etc.), with no plan of subjecting them to application of data science techniques. Such scenarios are known as data repurposing. Here our challenge is to identify the relevant features from the data for use in solving the data science problem. In worst case, we might face a situation where there is not enough useful feature in the data for solving the problem to a satisfactory level . 6 dr. arijit laha

Objects in feature space

The

Till now we have been discussing about a single object and its repre- sentation. Now, let us think about working with a number of objects (at the least, similarity refers to a pair of objects!). In this the concepts drawn from the study of vector spaces becomes most invaluable tool. To understand how to use them let us start with a recalling of our school days when we have learnt as Euclidean or plane geometry. There we studied properties of various geometric objects, like straight , parallel lines, , , etc., based on a number of axioms or postulates. But, think carefully, at that time we did not have any business with coordinates of the points. That came latter under the heading co-ordinate geometry (bit of nasty stuff for most of us ). Well, the person responsible for complicating our innocent(?) school days was René Descartes, inventor of, among a lot of other stuff, the Cartesian coordinate systemshown in Figure 2. As depicted in Figure 2, Cartesian coordinate system is a rectilinear coordinate system, whose axes are mutually (in formal mathematical terms, orthogonal)and linear (i.e., straight lines). Introduction of Cartesian coordinate system endows each in the Euclidean plane with a distinct address, called its coordinate, a pair of numeric values. This created a bridge between geometry and algebra, the marriage producing the offspring called the . One of minor consequences of this is we can plot and thus visualize algebraic equations, e.g., x2 + y2 = 1, a centered at origin with radius 1!.

Figure 2: Left: Illustration of a Carte- sian coordinate plane. Four points are marked and labeled with their coor- dinates: (2,3) in green, (−3,1) in red, (−1.5,−2.5) in blue, and the origin (0,0) in purple. Right: Cartesian coordinate sys- tem in 3 . Each point is de- scribed with a 3-tuple of coordinate val- ues corresponding to its position along x, y, and z axes respectively.(Courtesy: Wikipedia)

A Euclidean plane equipped with a Cartesian coordinate system 5 Notice that I have used ’a’, not ’the’ in becomes a Euclidean space. 5 Here the term “space” is not a common the previous sentence. sense or colloquial term (like in ’a specious home’), rather it refers to an algebraic geometric entity . Yes, we are venturing in a slightly dangerous territory, but have courage, we need not go too deep into it. For our purpose it will suffice to understand that a “space” is a set of distinctly identifiable (by their coordinates) points with certain properties holding among them. These properties are the ones which essentially distinguish one type of space from another. The points to remember: measuring similarity 7

• There are many types of spaces other Euclidean Spaces (yes, there are quite a few Euclidean spaces, even with non-Cartesian coordi- nates), called (obviously!) non-Euclidean spaces; and

• There aremany coordinate systems other than Cartesian, the non- Cartesians. Thankfully, here we shall work only with real-valued (i.e., the coordinate values are all real numbers) Euclidean space and its gener- alizations equipped with Cartesian coordinate systems. Let us have an intuitive understanding how such a space relates to our object data. Let us, for the time being, consider that the Iris data has only two features, namely, petal length and petal width. If we conceive the petal length as x value and petal width as y value, then we can associate an iris flower with a point p = (x, y) in the two dimensional space.

Figure 3: Two features, petal length and petal width or Iris data (Colors indicate class labels) .

If we continue to do the same with all the flowers in the data set, we end up with a plot like Figure 3. This kind of plotting is bread and butter for us, we call them, yes, you already know, the scatter plots. Here two attribute values are mapped along two axes of the Cartesian coordinate system, so that every object, i.e., flower has its own representative point in this Euclidean space.

Feature space

We typically use scatter plots for visualization of the data. However, by representing objects (as a set of feature values) as points in a Euclidean space we also achieve something very interesting and extremely important. Essentially, we transform or abstract our objects into algebraic entities, i.e., points in a Euclidean space. Now we can exploit the properties of Euclidean space to measure similarities/dissimilarities between pairs of objects. Thus, our objects becomes entities of a very rich mathematical framework and becomes amenable to study using a tools from a number of mathematical disciplines. In data science, particularly machine learning terminology we call such a space a feature space. The points in a feature space can 8 dr. arijit laha

represent the objects described in terms of a ordered set of numerical

attribute values xi = (xi1, xi2, ··· , xin) ∈ X, is called a feature vector. The n of a feature space is same as the number of features used to describe the objects in the data set. For example, in the Iris data set the flowers are described with four features. Thus, a feature space for the Iris data will have 4 dimensions (i.e., n = 4).

Gathering tools and intuitions

To work with the objects in feature space, we need to get our hands on relevant mathematical tools. Also, our physical perception is limited to three spatial (and of course, one for time) dimensions. Thus, we can visually perceive the distribution of objects in two dimension like shown in Figure 3. We can also visualize in 3D, but such visualization need to take some other factors, including orientation relative to observer, in consideration. However, most often our data sets have even more features. Thus relying on visual representation for understanding them is quite impossible. Therefore, we have to equip ourselves with some mathematical intuitions.

Taking it easy

To begin with, let us start with a 2D space, where mathematics and visual perception can go hand-in-hand. Let us consider two points in (2D) Euclidean space, p = (x1, y1) and q = (x2, y2). From our school days we know:

Figure 4: between • The distance between these two points (by applying Pythagorean points p and q theorem as shown in Figure 4) is q 2 2 d(p, q) = (x1 − x2) + (y1 − y2) ,( 1)

known as the Euclidean distance;

• The distance of a point, say p, from origin of the coordinate system, i.e., (0, 0) is q 2 2 d(p, (0, 0)) = d(p) = x1 + y1 (2)

• From : given three points, they for a unique and sum of the lengths of any two sides of the triangle is always greater than or equal to the length of the third side, i.e., given points p,q and r,

d(p, q) + d(q, r) ≥ d(p, r).( 3)

This is known as the triangle inequality.

Clearly, if the points represent objects, the distance between them can be interpreted as a measure of dissimilarity between the corresponding objects. Thus, we are able to calculate dissimilarity, which is inverse (more dissimilarity => less similarity and vice-versa). Distance-based similarity is one of the most important and popular of all similarity measures.

Figure 5: Dot product of vectors −→p and −→q . measuring similarity 9

There is another way of computing similarities directly in Euclidean space. That is based on the measurement of the angle between the lines from origin to the points. To understand that, we shall need to re-orient us slightly. We have to think of the points in Euclidean space as vectors with (1) magnitude equal to the distance of the point from the origin; and (2) direction from origin to the point. Traditionally −→ a vector from point A to B is denoted as AB. However, since here the starting point is always origin, for brevity we denote the vector −→ from origin to p as simply p , and thus the point p = (x1, y1) can be written as −→ p = x1iˆ+ y1jˆ, where iˆ and jˆ are unit vectors along x and y axes respectively. With the vector interpretation of the points, we can apply vector operations on then. The dot product of vectors −→p and −→q is defined as −→ −→ p · q = x1x2 + y1y2.( 4)

Given the above definition of dot product, the distance of point p from origin can be expressed as q q 2 2 p −→ −→ −→ d(p) = x1 + y1 = x1x1 + y1y1 = p · p =k p k,( 5) where k −→p k is interpreted as the length of the vector −→p and called its norm. Again, the distance between two points can also be computed as the norm of the difference vector of the vectors representing the points (remember the parallelogram rules of vector addition/subtraction from school!) as shown below: q 2 2 d(p, q) = (x1 − x2) + (y1 − y2) (6) q = (x1 − x2)(x1 − x2) + (y1 − y2)(y1 − y2) q = (−→p − −→q ) · (−→p − −→q ) = k −→p − −→q k .

6 6 As seen here, the concepts of norm Remember that earlier we have mentioned that Euclidean space and distance are equivalent. This equivalence has very deep signifi- is an algebraic geometric space. Here the concept of distance come cance. This is what enable us to ap- from geometry while the concepts of vector, dot product, norm etc. ply the so-called "kernel trick", mak- ing possible implementation of vari- comes from algebra, specifically, the vector algebra. In equations 5 and ous kernel-based methods in machine 6 we have shown how the geometrical distances relates to norms of learning. vectors. Finally, we shall show how a geometric concept of the angle between two vectors can be computed algebraically. The dot product of two vectors (Eqn. 4 can also be computed as

−→p · −→q =k −→p kk −→q k cos θ,( 7) where θ is the angle between the vectors −→p and −→q (see Figure 5). It is intuitively clear that the more similar two objects, the less would be the angle θ between their vectors and accordingly higher 10 dr. arijit laha

will be the value of corresponding cos(θ). Thus the value of cos θ can be used as a similarity measure between objects and computed as

−→p · −→q cos θ = ,( 8) k −→p kk −→q k

and called the cosine similarity. The cosine similarity can take value [-1,1] corresponding to 180o ≥ θ ≥ 0o angles. Sometimes we would like to restrict them to [0,1]. This will happen naturally if all the attribute values are positive. However, otherwise we might want to apply normalization like

1  −→p · −→q  S(−→p , −→q ) = + 1 . 2 k −→p kk −→q k

By mapping our objects (with two attributes) to points in Euclidean space, we have discovered two ways of measuring/estimating sim- ilarity, (1) as inverse or opposite of Euclidean distance and (2) the cosine similarity. However, we must keep in mind that they represent different perspectives of similarity. For example, if cosine similarity of two objects is 1, that is maximum similarity. But then their Euclidean distance may not be zero. For example, try the objects (7,4) and (14,8). The fact is, if the proportion of each attribute is same, then they be- come collinear with angle 0 between them and their cosine similarity is one irrespective of their difference of lengths/norms. Also, consider 7 The reason behind this discrepancy lies the objects (0,5) and (5,0) - they have cosine similarity 0.7 in the fact that the Euclidean distance is a measure of minimum length to cover between two points in the space, while Diving deeper cosine similarity is more of a measure of collinearity (or orthogonality) of two Now, what will happen if the number of attributes is more than vectors. 2? We cannot effectively visualize them anymore, but they can still be abstracted as points in higher dimensional spaces, which are generalization of the Euclidean spaces we just studied. Let us try to understand them. From mathematical standpoint, feature spaces are vector spaces, more accurately they are normed vector space. Many of them, actually most interesting ones fall in the subcategory of inner product spaces. See Figure 6 for the hierarchy of mathematical spaces. Here are some intuitive pointers about the most relevant properties of feature spaces irrespective of their dimensions:

• The dimensions of feature spaces we are interested in are always Figure 6: Hierarchy of mathematical finite (clearly we are not interested in deling with objects of infinite spaces. (Courtesy: Wikipedia) number of features);

• All feature spaces are metric spaces. Thus. the distance between any two points is measurable/defined;

• Feature spaces are actually normed vector spaces. Hence, there is a norm associated with every vector, which can be interpreted as the length of the vector. In case the vector is a feature vector (while shall be playing more intensely with the feature space, we shall be working with many other types of vectors, e.g, weight measuring similarity 11

vectors representing a separating plane, other than feature vectors representing the data set), we can intuitively understand the norm as the distance of the data point from the origin (see Equation 2). Also, note that for a given feature space, we may have choice of using more than one norms; and

• If the feature space is an inner product space, then will be a defined structure in the space called the inner product, which can be considered a generalization of the ‘dot product’ in Euclidean space 4. Inner product always induce a corresponding norm like shown in Equation 5

Formally, an inner product space is a vector space with an additional structure called an inner product that associates each pair of vectors with a scalar. Also, an inner product always induces a norm, that assigns a positive value to each vector in the vector space which can be interpreted as the length or size of the vector. However, mind it, while there is always a norm corresponding to (formally, induced by) an inner product, there are many norms which cannot be induced by a valid inner product. At this point it may seem that we are (and shall continue to) unnecessarily talking at length about inner products Getting the vocabulary right while only use of them is to compute the cosine similarity. However, we are look- Working with (high dimension) generalization of Euclidean space ing forward to more interesting stuffs to come later. One of them are Ker- require a new or rather enhanced set of vocabulary. For example, with nel methods, among the most sophis- a say 50-dimension space you can no longer refer to the coordinate ticated of ML techniques (yes, includes axes as x, y, z and so on, you will exhaust all the alphabet . So, we SVM). Existence of inner products in rele- vant spaces is enabler of these methods name the axes of a n-dimensional space as x1, x2, ...., xn. Also, we denote a vector (or point) in such a space as

x = (x1, x2, ..., xn), where xi is the value of i-th component or coordinate value. Also, notice that here we denote the vector by boldface x instead of −→x , as done earlier. In mathematical terminology, the set of all real numbers is denoted by <, Thus, the n dimensional inner product space, as a set of real- valued points in a n dimension is denoted as

The metric and norm

For our purpose, it will suffice to say that a metric d is a defined on the set

d :

Or in common language, d takes two n-dimensional real vectors and produces a scalar, i.e., a . Also, for any x, y, z ∈

1. d(x, y) ≥ 0, i.e., d is always non-negative;

2. d(x, y) = 0 ⇔ x = y, i,e„ x and y are indiscernible (indistinguish- able);

3. d(x, y) = d(y, x), i.e., d is a symmetric function; and

4. d(x, z) ≤ d(x, y) + d(y, x) - is a generalization of the triangle in- equality we seen earlier in Eqn. 3.

Similarly, within our context, we can define a norm as a function

k x k:

for x ∈

1. k x k> 0, unless all components of x are 0. In such case, x is called a zero vector, i.e., a vector of length zero and denoted as 0;

2. Multiplication of the vector x with a scalar α ∈ < changes the length of the vector, but not direction, i.e., k αx k= |α| k x k;

3. Taking norms as distances, the distance form point A to C via point B is never shorter than going directly from A to C. This is (clearly) a variation of the triangle inequality expressed as

k x + y k≤k x k + k y k, forx, y ∈

Naturally, Euclidean distance, generalized to n dimension is a valid (and in fact, one of the most popular) metric in the following form: n q q 2 2 2 2 d(x, y) = ∑ (xi − yi) = (x1 − y1) + (x2 − y2) + ... + (xn − yn) i=1 (9) We can also define the Euclidean norm as follows: s n q k k= 2 = 2 + 2 + + 2 x ∑ xi x1 x2 ... xn (10) i=1

The Inner Product

The inner product of two real n-vectors is a scalar. It is a generalization of ‘dot’ product in Euclidean space. Thus, we shall use the ‘·’ notation to indicate the inner product in general. Every inner product induce an associated norm. For our purpose it will suffice to understand that they are related in following way: √ k x k= x · x.( 11)

The above holds for any inner product and the corresponding norm induced by it. Now, in generalized Euclidean space, one of the most popular (and interesting) the inner product is the direct generalization of the dot product as defined in Eqn. 4 as follows:

n x · y = ∑ xiyi = x1y1 + x2y2 + ... + xnyn.( 12) i=1 measuring similarity 13

We can use Eqn. 11 to induce the corresponding norm as follows: √ s n 2 k x k= x · x = ∑ xi , i=1 which is (no surprise!) just the Euclidean norm defined in Eqn. 10 Finally, the inner product of x and y in terms of angle between them is

x · y = k x kk y k cos θ (13) x · y ⇒ cos θ = k The cosine similarity(14) k x kk y Computing with elements of a vector space often require applica- tion of the concepts of linear algebra. Thus, it is convenient to express the vectors with matrix notation as follows: x = (x1, x2, ..., xn) (15)   x1    x2  =   = |xi (As a column matrix)  .   .  xn T h i T = x1 x2 ··· xn = hx| (As a transposed row matrix)

hx| and |xi are known as the bra and ket (dividing the word ‘bracket’ by removing ‘c’ in middle ) notation of vectors. They are very popular in quantum mechanics, but also sometimes used in ML literature. Using these notations, the inner product is expressed as: n T x y = hx, yi = ∑ xiyi i=1

More distance measures and norms

We are already familiar with the Euclidean norm. However, this norm is (possibly most prominent, from ML perspective) member of a family of norms or distances called Lp norms which is defined for p ≥ 1, ∈ < and x ∈

1 ! p 1 n p p p p p k x kp= (|x1| + |x2| + · · · |xn| ) = ∑ |xi| .( 16) i=1 This is also known as Minkowski norm/distance. Clearly, this can be used to define any number of distance measures in

L∞ : kx k∞= max |xi| The Chebishev norm. i (0,1) (19) Figure 7 depicts the unit circles for L1, L2 and L∞ in 2D space. (-1,0) (1.0) Observe that not all three are circles as we usually understand. ‘Unit (0.0) circle’ is a term in ML that denotes the contour of the points in space at distance 1 from the origin for a particular distance measure. Naturally, they are different for different distance measures as we can see in the (0,-1) Figure 7.

Figure 7: Unit circles: contours of unit Figure 8: Result of K − means clustering distance from origin for L , L and L . of Iris data for features petal1 length2 and∞ petal width using Euclidean distance (left) and Manhattan distance (right).

Figure 8 depicts the results of clustering of Iris data for two fea- tures. As we can observe that the results are identical. But we must remember that the data set is fairly simple with only two dimensions and 150 points. In more complicated cases the choice of norm or distance measure to use in a problem may impact the quality of the solution. Readers should experiment with them in algorithms e.g, on some complicated/realistic data sets in order to develop proper understanding. Finally, we used 2D space for visual illustration. However, reader should develop intuitive understanding of higher dimensional space where we have to consider unit (in 3D) and hyperspheres (in >3D). However, as mentioned earlier, the Euclidean distance measure is most popular of them. It has the advantage of being quadratic form. As a result, they are analytic (differentiable everywhere in its domain) functions and suitable for use in gradient-search algorithms in ML. Also, we can modify the distance measure by incorporating more information about the data distribution. Here we shall study two of them.

Weighted Euclidean distance

Sometimes we want to give different degree of importance to dif- ferent dimensions of the feature space to reflect the importance of the features. The easy way of achieving this is to associate a weight value with each dimension. Thus, for assignment of the weight values wi, 1 ≥ i ≥ n we can compute weighted Euclidean distance as follows: s i=1 2 d(x, y) = ∑ wi(xi − yi) .( 20) n It can be easily seen that it can be induced by the inner product

i=1 hx, yi = ∑ wixiyi.( 21) n measuring similarity 15

The determination of wi-s are problem specific. However, one of most commonly used weight assignment comes out of data stan- dardization or normalization processes. In real life problem often the spreads of values for different features can vary substantially, say, one feature ranges 0-1 while another 100-1000000. In such cases the value of the second feature dominates in distance computation. This is usually undesirable. The normalization process is used to bring all the features within same/similar range of values. The validity of such process lies in the fact that the relative posi- tion of the data points are under linear transformations (i.e., , and ). In other words, in a feature space S, if

• d(x, y) > d(x, z) then in a space S0 obtained by application of any combination of linear transformations of points in S there will be d(x0, y0) > d(x0, z0); and

0 0 d(x,y) = d(x ,y ) • The ratio of distances will remain unchanged, i.e., d(x,z) d(x0,z0) . One of the popular normalization processes involve applying the transformation: 0 xi − xi xi = , 1 ≥ i ≥ n, si where xi is the mean value of i-th feature and si is the standard deviation of the same. This transformation makes all features to have 0 mean and standard deviation 1. This transformation often referred to as Z-transformation and the transformed values are called Z-values or Z-scores. This normalization can be achieved computationally in by simply 1 using the weighted Euclidean distance, where wi = 2 . (Reader, si please check it)

Mahalanobis distance

In normalization using Z-transform one uses the squared standard deviations, i.e., variances only. Variances are diagonal values of the covariance matrix of the data. If we want to take into account the covariances of the data also, we shall need to incorporate the off- diagonal elements of the covariance matrix Σ into distance calculation. Doing so, gives us the Mahalanobis distance q d(x, y) = [x − y]TΣ−1[x − y],( 22) induced by the inner product

hx, yi = xTΣ−1y.( 23)

1 It can be easily seen that weighted Euclidean distance with wi = 2 si is a special case of the Moholanobis distance where the covariance matrix is diagonal. Figure 9 depicts the contours of points equidistant from the origin for the special and general Moholanobis distances. It can be observed that while in both cases the contours are elliptical, with their half axes equal to the standard deviations sis. For the 16 dr. arijit laha

Figure 9: Unit circles for special and general Moholanobis distances.

special one the axes of the are parallel to the coordinate axes. However, the general one does not have this restrictions, it can be oriented in any direction with axes aligning with the principal axes (or eigenvectors) of the covariance matrix Σ.

Looks alike, but not quite ...

There are situations where the data we work with have the appearance of feature vectors. However, whether they are feature vectors may not be very clear. One of the pointers may be whether each component of a feature vector represent an attribute of the object. Even so, it might be difficult determine and thus the applicability of feature space formalism.

Keep doing reality check

The idea of feature space is extremely powerful one. But we should not be blinded by it and shoehorn anything that looks like numerical vector into the feature space framework. Yes, you can do the computa- tions, but the results must be meaningful in the context of the problem at hand. We should not get into situations like 5 bananas + 7 houses = 12 ??! . We should always remember that the mathematical semantics must match the problem semantics in real world. To understand this let us study an example.

Example case

Rating data captures users’/consumers’ like/dislike of about items in terms of the rating they give to the items. We are quite familiar with them appearing in online retail stores, video subscriptions, product reviews, etc. Each of these ratings can be expressed as a 3-tuple . Rating data sets are used for building collaborative filtering based recommender systems. Here we are not interested in recommender systems per se. There is enough material, including the linked above available on the topic. However, we are interested in computation of similarity between users based on their like/dislike of items and/or between items based on their attractiveness to the users. measuring similarity 17

These similarities are at the core of the working of recommender systems. But what kind of similarity measure should we use here? Let us have a deeper look in order to find out.

Working with rating data

Let U, I and R be the sets of users, items and ratings respectively. We shall denote rating given by user u ∈ U for item i ∈ I as rui ∈ R. Then the full set of ratings given by the users u and v can be expressed as the sets

ru = {rui|∀rui ∈ R}andrv = {rvi|∀rvi ∈ R} respectively. Since the rating values are usually from a numeric rating scale, in first glance, we can impose an order on the elements of rus and treat them as numerical feature vectors with components comprised of ratings for the items. Viola! we can use the Euclidean or some other distance/similarity measures discussed earlier. But wait..., not all users rate all items (consider a big e-tail portal, they have tens of thousand products, how many of them a particular customer might buy and/or review/rate?), nor do they rate the same set of items. So, how do we assign dimensions of the feature space to represent all rus. May not sound like a big problem, we can have a feature space with dimension equal to the total number of items, i.e., |I| (that may be a few ten thousands, which have its accompanying problem called curse of dimensionality, but let us ignore that for the time being) and pad the unavailable ratings in the feature vector with a special value. For example, if the rating scale is 1-5, we put 0 for the items for which the user has not given rating. Well and good, the computational problem is solved. Now, let us see what does that mean mathematically and in reality. Mathe- matically, 0 is just a number among real numbers (i.e., −0.00...01 < 0 < 0.00...01), with no special meaning other than that. So, it will be treated in the same footing as the other valid rating values. In reality, the user may not even have known that such an item exists in the universe! So it is not an accurate (rather far from) representation of the user’s likes/dislikes, which is what we are interested in. If we go to other extreme and create the feature space for items rated by all user, there may not be any or very few items at most. That again defeats the purpose thoroughly. We can take middle ground by considering pairs of users at a time. We can compute Euclidean distance between them in the feature space representing ratings for items they both rated. In that case comparisons like d(u, v) > d(u, w) are meaningless unless both pairs of users have same common set of rated items. In other words, to be meaningful they have to be computed in same feature space. But that is not likely to happen in general. Thus, we can see that use of feature space itself as computational framework may not be valid in this case. So, what do we do? There are many other similarity measures not based on the idea of feature space. We have to explore them. 18 dr. arijit laha

Understanding what we want

To choose a suitable similarity measure, we have to understand the true nature of the similarity we are looking for. Here, we have to ask ourselves what do we really mean by “similar customers" in the present problem context. Here we are not interested in their physical similarity or similarity in academic achievement! We are interested in the similarity of their likes/dislikes of a set of items and need to compute a meaningful aggregate measure of the similarity over a set of items. For example, two users’ likes are similar over a set of items {i, j, k} if both like i more than j but little less than k. Can visualize the ratings given by a user in a graph like in Figure 10. It is intuitively clear that this user’s rating will be similar to rating of another user whose rating curve shows similar pattern of variation over the same item set. So, if we can find means of measuring the similarity of this variation pattern, we can use that as a similarity measure. Clearly, here we are looking for some kind of measure of correlation between the variation patterns.

Figure 10: Ratings given by a user for items 1-6. 5 4

3 Ratings 2

1

1 2 3 4 5 6 Items

Correlation based similarity measures

In order to apply correlation based techniques we can consider the user ratings ru and rv as sequences of values corresponding to the ratings given by them for a set of common items.

Pearson product-moment correlation coefficient

Pearson coefficient is a measure of linear dependence between two variable. Given the sequence of values of two variables X = {x1, x2, ..., xn} and Y = {y1, y2, ..., yn}, the Pearson correlation coefficient is defined measuring similarity 19

Figure 11: Two rating sequences with Pearson correlation coefficient 1. 5 4

3 Ratings v 2 u 1 r=1.0

1 2 3 4 5 6 Items

as: n ∑ (xi − x)(yi − y) = r = i 1 ,( 24) s n s n 2 2 ∑ (xi − x) ∑ (yi − y) i=1 i=1 where, x and y are sample means of X and Y respectively. r gives a value between -1 to +1 (inclusive). +1 and -1 indicating total positive and negative correlations respectively, while 0 signifies no correlation. A slightly easier to implement formula is

n ∑ xiyi − nxy = r = i 1 .( 25) s n s n 2 2 2 2 ∑ (xi − nx ) ∑ (yi − ny ) i=1 i=1

It can be easily seen that the similarity of the ratings given by users u and v can be measured by computing the Pearson correlation coefficient between the sequences ru and rv for the items common in them. Figure 11 depicts a case for two ratings sequences for which r = 1, which can be readily interpreted as full similarity according to the sense we have developed above. Dear reader, I would like to attract your attention to the Equation 8 and 24. Do you observe any similarity between Spearman rank correlation coefficient them? Well, giving a clue, think about cosine similarity of the vectors (x − xi1) The Spearman rank correlation coefficient is same as Pearson correla- and (y − yi1), where 1 is a unit column vector. This is known as adjusted or aug- tion coefficient between two ranked variables. In other words, unlike the mented cosine similarity between x and y. computation using the actual values of the variables, the ranksof the Please work out and see . respective variables are used for computing the Pearson coefficients. A ranking of items in a set is a ordering among the items which allows us to pairwise compare any two items in order to determine 20 dr. arijit laha

whether one is ranked higher or lower or equal to other. For a set of numerical values, one of the commonest rankings can be their sorting order, with tied positions equalized. Thus, correlation between two sequence of numerical values x and y is computed first by converting x and y into corresponding sequences of ranking values rx and ry and then computing The Spearman coefficient ρ as:

6 ∑n d 2 ρ = 1 = i=1 i ,( 26) n(n2 − 1)

x y where, di = r i − r i. More stuff to look into

Some other useful similarity measures

There are many other kinds of objects we often need to compare and find similarity measures. Many of them even cannot have any meaningful numerical representations. Here are a few examples:

• The similarity between sets can be computed using Jaccard index and Tanimoto similarity;

• Distance between binary strings is computed using Hamming dis- tance;

• Distance between alphabet strings can be computed using Leven- shtein distance.

With the advent of Kernel-based methods, particularly the “kernel trick”, objects even of non-vectorial natures, e.g, graphs, sequence data, image, text. etc. can also be brought into the purview of the methods primarily invented for numerical vector data. I leave it to your own interest and effort to look into these.

Further reading on some basic but must-know stuff

• Origin (probable) of the term “Data Science"

• A Few Useful Things to Know About Machine Learning 8. 8 Pedro Domingos. A few useful things to know about machine • The key word in "Data Science" is not Data, it is Science learning. Commun. ACM, 55(10): 78–87, October 2012. ISSN 0001-0782. • Data Science and Prediction 9. doi: 10.1145/2347736.2347755. URL http://doi.acm.org/10.1145/2347736. 2347755 9 Vasant Dhar. Data science and pre- diction. Commun. ACM, 56(12):64–73, December 2013. ISSN 0001-0782. doi: 10.1145/2500499. URL http://doi.acm. org/10.1145/2500499

Bibliography

Vasant Dhar. Data science and prediction. Commun. ACM, 56(12): 64–73, December 2013. ISSN 0001-0782. doi: 10.1145/2500499. URL http://doi.acm.org/10.1145/2500499.

Pedro Domingos. A few useful things to know about machine learning. Commun. ACM, 55(10):78–87, October 2012. ISSN 0001- 0782. doi: 10.1145/2347736.2347755. URL http://doi.acm.org/ 10.1145/2347736.2347755.

Ciao!