Clustering CS503 - Spring 2019 Narayanan C Krishnan [email protected] Supervised Vs Unsupervised Learning

Total Page:16

File Type:pdf, Size:1020Kb

Clustering CS503 - Spring 2019 Narayanan C Krishnan Ckn@Iitrpr.Ac.In Supervised Vs Unsupervised Learning Clustering CS503 - Spring 2019 Narayanan C Krishnan [email protected] Supervised vs Unsupervised Learning ' • Supervised learning – Given !", $" "%&, learn a function (: * → , • Categorical output – classification • Continuous output – regression ' • Unsupervised learning - Given !" "%&, can we infer the structure of the data? • Learning without a teacher Clustering CS503 - Machine Learning 2 Why Unsupervised Learning? • Unlabeled data is cheap • Labeled data is expensive – cumbersome to collect • Exploratory data analysis • Preprocessing step for supervised learning algorithms • Analysis of data in high dimensional spaces Clustering CS503 - Machine Learning 3 Cluster Analysis • Discover groups such that samples within a group are more similar to each other than samples across groups Clustering CS503 - Machine Learning 4 Applications of Clustering (1) • Unsupervised image segmentation Clustering CS503 - Machine Learning 5 Components of Clustering • A dissimilarity (similarity) function • Measures the distance/dissimilarity between examples • A loss function • Evaluates the clusters • An algorithm that optimizes this loss function Clustering CS503 - Machine Learning 6 Proximity Matrices • Data is directly represented in terms of proximity between pairs of objects • Subjectively judged dissimilarities are seldom distance in the strict sense (not necessarily follow the properties of a distance measure) • Replace the proximity matrix ! by ! + !# /2 Clustering CS503 - Machine Learning 7 Dissimilarity Based on Attributes (1) • Data point x" has # features • Attributes are real-valued • Euclidean distance between the data points + . # x", x% = ' ,"( − ,%( ()* • Resulting clusters are invariant to rotation and translation, but not to scaling • If features have different scales - standardize the data Clustering CS503 - Machine Learning 8 Dissimilarity Based on Attributes (2) • Data point x" has # features • Attributes are real-valued • Any ℒ% norm ) . 1 # x", x' = * /"+ − /'+ +,- • Cosine distance between the data points . ∑+,- /"+/'+ # x", x' = . 3 . 3 ∑+,- /"+ ∑+,- /'+ Clustering CS503 - Machine Learning 9 Dissimilarity Based on Attributes (3) • Data point x" has # features • Attributes are ordinal • Grades – A, B, C, D • Answers to survey question - strongly agree, agree, neutral, disagree • Replace the ordinal values by quantitative representations $ − 1/2 , $ = 1, … , ) ) Clustering CS503 - Machine Learning 10 Dissimilarity Based on Attributes (4) • Data point x" has # features • Attributes are categorical • Values of an attribute are unordered • Define explicit difference between the values $%% ⋯ $%' ⋮ ⋱ ⋮ $'% ⋯ $'' • Often 2 • For identical values - $*,*, = 0, if 1 = 1 2 • For different values- $*,*,= 1, if 1 ≠ 1 Clustering CS503 - Machine Learning 11 Loss Function for Clustering (1) • Assign each observation to a cluster without regard to the probability model describing the data • Let ! - be the number of clusters and " - indexes into the number of clusters • Each observation is assigned to one and only one cluster • View the assignment as a function # $ = " • Loss function - 1 & # = ) ) ) 1(x , x 0) 2 / / *+, . / +* . /0 +* • Characterized the extent to which observations assigned to the same cluster tend to be close to one another • Within cluster distance/scatter Clustering CS503 - Machine Learning 12 Loss Function for Clustering (2) • Consider the function ) ) 1 ! = % % + * 2 && &'( &*'( • Total point scatter • This can be split as - 1 ! = % % % + * + % + * 2 && && ,'( . & ', . &* ', . &* 0, ! = 1 2 + 3(2) Clustering CS503 - Machine Learning 13 Loss Function for Clustering (3) • The function ! " * 1 ! " = & & & / - 2 ,, '() + , (' + ,- .' • Between cluster distance/scatter • Thus minimizing 0 " is equivalent to maximizing ! " Clustering CS503 - Machine Learning 14 Combinatorial Clustering • Minimize ! over all possible assignments of " data points to # clusters • Unfortunately feasible only for very small data sets • The number of distinct assignments is - 1 # $ ", # = ) −1 -/* 01 #! 0 *+, • $(10, 4) = 34,105 • $ 19, 4 = 10,9 • Not a practical clustering algorithm Clustering CS503 - Machine Learning 15 K- Means Clustering (1) • Most popular iterative descent clustering method • Suppose all variables/features are real-valued and we use squared Euclidean distance as the dissimilarity measure ) ! x#, x%& = x% − x%& • The within cluster scatter can be written as 2 1 ) * + = . x − x & 2 % % /01 3 % 0/ 3 %& 0/ 2 ) = . 4/ . x% − x5/ /01 3 % 0/ Clustering CS503 - Machine Learning 16 K-Means Clustering (2) • Find , ∗ 2 ! = min ( -) ( x. − x1) ' )*+ ' . *) • Note that for a set 3 2 56̅ = argmin: ( x. − m .∈6 • So find , ∗ 2 ! = min ( -) ( x. − m) ', : @ = =>? )*+ ' . *) Clustering CS503 - Machine Learning 17 K-Means Clustering (3) • Find the “optimal” solution using Expectation Maximization • Iterative procedure consisting of two steps % • Expectation step (E Step) – Fix the mean vectors m" "#$ and find the optimal &∗ • Maximization step (M step) – Fix the cluster assignments & and find % the optimal mean vectors m" "#$ • Each step of this procedure reduces the loss function value Clustering CS503 - Machine Learning 18 K-Means Clustering Illustration (1) Clustering CS503 - Machine Learning 19 K-Means Clustering Illustration (2) Clustering CS503 - Machine Learning 20 K-Means Clustering Illustration (3) Clustering CS503 - Machine Learning 21 K-Means Clustering Illustration (4) Clustering CS503 - Machine Learning 22 K-Means Clustering Illustration (5) Clustering CS503 - Machine Learning 23 K-Means Clustering Illustration (6) Clustering CS503 - Machine Learning 24 K-Means Clustering Illustration (7) Clustering CS503 - Machine Learning 25 K-Means Clustering Illustration (8) Clustering CS503 - Machine Learning 26 K-Means Clustering Illustration (9) Clustering CS503 - Machine Learning 27 K-Means Clustering Illustration (10) • Blue point - Expectation step • Red point – Maximization step Clustering CS503 - Machine Learning 28 How to Choose K? • Similar to choosing ! in kNN • The loss function generally decreases with ! Clustering CS503 - Machine Learning 29 Limitations of K-Means Clustering • Hard assignments are susceptible to noise/outliers • Assumes spherical (convex) clusters with uniform prior on the clusters • Clusters can change arbitrarily for different ! and initializations Clustering CS503 - Machine Learning 30 K-Medoids • K-Means is suitable only when using Euclidean distance • Susceptible to outliers • Challenge when the centroid of a cluster is not a valid data point • Generalizing K-Means to arbitrary distance measures • Replace the mean calculation by median calculation • Ensures the centroid to be a medoid – always a valid data point • Increases computation as we have to now find the medoid Clustering CS503 - Machine Learning 31 Soft K-Means as Gaussian Mixture Models (1) • Probabilistic Clusters • Each cluster is associated with a Gaussian Distribution - !(#$, Σ$) • Each cluster also has a prior probability - ($ • Then the likelihood of a data point drawn from the ) clusters will be 0 * + = - ($* + #$, Σ$ $./ 0 • Where ∑$./ ($ = 1 Clustering CS503 - Machine Learning 32 Soft K-Means as Gaussian Mixture Models (2) • Given ! iid data points, the likelihood function " #$, … , #' is " #$, … , #' = Clustering CS503 - Machine Learning 33 Soft K-Means as Gaussian Mixture Models (3) • Given ! iid data points, the likelihood function " #$, … , #' is ' ' 0 " #$, … , #' = ∏*+$ "(#*) = ∏*+$ ∑/+$ 1/" #* 2/, Σ/ • Let us take the log likelihood Clustering CS503 - Machine Learning 34 Soft K-Means as Gaussian Mixture Models (4) • Given ! iid data points, the likelihood function " #$, … , #' is ' ' 0 " #$, … , #' = ∏*+$ "(#*) = ∏*+$ ∑/+$ 1/" #* 2/, Σ/ • Let us take the log likelihood ' 0 4 log 4 1/" #* 2/, Σ/ *+$ /+$ Clustering CS503 - Machine Learning 35 Soft K-Means as Gaussian Mixture Models (5) • Latent Variables • Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0) • Given the complete data 1, 2, we look at maximizing / 1, 2 0), 3), Σ) Clustering CS503 - Machine Learning 36 Soft K-Means as Gaussian Mixture Models (6) • Latent Variables • Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0) • Let the probability / #") = 1|!" be denoted as 2 #") Clustering CS503 - Machine Learning 37 Soft K-Means as Gaussian Mixture Models (7) • Latent Variables • Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0) • Let the probability / #") = 1|!" be denoted as 2 #") • From Bayes theorem / #") = 1 / !"|#") = 1 2 #") = / #") = 1|!" = / !" Clustering CS503 - Machine Learning 38 Soft K-Means as Gaussian Mixture Models (8) • Latent Variables • Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0) • Let the probability / #") = 1|!" be denoted as 2 #") • From Bayes theorem / #) = 1 / !"|#") = 1 2 #") = / #") = 1|!" = / !" ∑ ∑( ( • The marginal distribution / !" = 345 / !", #" = ).% / #") = 1)/ !"|#") = 1 Clustering CS503 - Machine Learning 39 Soft K-Means as Gaussian Mixture Models (9) • Now, ! "# = 1 = &# ! '(|"(# = 1 = * '(| +#, Σ# • Therefore . "(# = ! "(# = 1|'( = Clustering CS503 - Machine Learning 40 Estimating the mean !" (1) • Begin with the log-likelihood function ' + # log # ,"- .$ !", Σ" $%& "%& • Taking the derivative wrt to !" and equating
Recommended publications
  • Adaptive Wavelet Clustering for Highly Noisy Data
    Adaptive Wavelet Clustering for Highly Noisy Data Zengjian Chen Jiayi Liu Yihe Deng Department of Computer Science Department of Computer Science Department of Mathematics Huazhong University of University of Massachusetts Amherst University of California, Los Angeles Science and Technology Massachusetts, USA California, USA Wuhan, China [email protected] [email protected] [email protected] Kun He* John E. Hopcroft Department of Computer Science Department of Computer Science Huazhong University of Science and Technology Cornell University Wuhan, China Ithaca, NY, USA [email protected] [email protected] Abstract—In this paper we make progress on the unsupervised Based on the pioneering work of Sheikholeslami that applies task of mining arbitrarily shaped clusters in highly noisy datasets, wavelet transform, originally used for signal processing, on which is a task present in many real-world applications. Based spatial data clustering [12], we propose a new wavelet based on the fundamental work that first applies a wavelet transform to data clustering, we propose an adaptive clustering algorithm, algorithm called AdaWave that can adaptively and effectively denoted as AdaWave, which exhibits favorable characteristics for uncover clusters in highly noisy data. To tackle general appli- clustering. By a self-adaptive thresholding technique, AdaWave cations, we assume that the clusters in a dataset do not follow is parameter free and can handle data in various situations. any specific distribution and can be arbitrarily shaped. It is deterministic, fast in linear time, order-insensitive, shape- To show the hardness of the clustering task, we first design insensitive, robust to highly noisy data, and requires no pre- knowledge on data models.
    [Show full text]
  • Logistic Regression Trained with Different Loss Functions Discussion
    Logistic Regression Trained with Different Loss Functions Discussion CS6140 1 Notations We restrict our discussions to the binary case. 1 g(z) = 1 + e−z @g(z) g0(z) = = g(z)(1 − g(z)) @z 1 1 h (x) = g(wx) = = w −wx − P wdxd 1 + e 1 + e d P (y = 1jx; w) = hw(x) P (y = 0jx; w) = 1 − hw(x) 2 Maximum Likelihood Estimation 2.1 Goal Maximize likelihood: L(w) = p(yjX; w) m Y = p(yijxi; w) i=1 m Y yi 1−yi = (hw(xi)) (1 − hw(xi)) i=1 1 Or equivalently, maximize the log likelihood: l(w) = log L(w) m X = yi log h(xi) + (1 − yi) log(1 − h(xi)) i=1 2.2 Stochastic Gradient Descent Update Rule @ 1 1 @ j l(w) = (y − (1 − y) ) j g(wxi) @w g(wxi) 1 − g(wxi) @w 1 1 @ = (y − (1 − y) )g(wxi)(1 − g(wxi)) j wxi g(wxi) 1 − g(wxi) @w j = (y(1 − g(wxi)) − (1 − y)g(wxi))xi j = (y − hw(xi))xi j j j w := w + λ(yi − hw(xi)))xi 3 Least Squared Error Estimation 3.1 Goal Minimize sum of squared error: m 1 X L(w) = (y − h (x ))2 2 i w i i=1 3.2 Stochastic Gradient Descent Update Rule @ @h (x ) L(w) = −(y − h (x )) w i @wj i w i @wj j = −(yi − hw(xi))hw(xi)(1 − hw(xi))xi j j j w := w + λ(yi − hw(xi))hw(xi)(1 − hw(xi))xi 4 Comparison 4.1 Update Rule For maximum likelihood logistic regression: j j j w := w + λ(yi − hw(xi)))xi 2 For least squared error logistic regression: j j j w := w + λ(yi − hw(xi))hw(xi)(1 − hw(xi))xi Let f1(h) = (y − h); y 2 f0; 1g; h 2 (0; 1) f2(h) = (y − h)h(1 − h); y 2 f0; 1g; h 2 (0; 1) When y = 1, the plots of f1(h) and f2(h) are shown in figure 1.
    [Show full text]
  • Cluster Analysis, a Powerful Tool for Data Analysis in Education
    International Statistical Institute, 56th Session, 2007: Rita Vasconcelos, Mßrcia Baptista Cluster Analysis, a powerful tool for data analysis in Education Vasconcelos, Rita Universidade da Madeira, Department of Mathematics and Engeneering Caminho da Penteada 9000-390 Funchal, Portugal E-mail: [email protected] Baptista, Márcia Direcção Regional de Saúde Pública Rua das Pretas 9000 Funchal, Portugal E-mail: [email protected] 1. Introduction A database was created after an inquiry to 14-15 - year old students, which was developed with the purpose of identifying the factors that could socially and pedagogically frame the results in Mathematics. The data was collected in eight schools in Funchal (Madeira Island), and we performed a Cluster Analysis as a first multivariate statistical approach to this database. We also developed a logistic regression analysis, as the study was carried out as a contribution to explain the success/failure in Mathematics. As a final step, the responses of both statistical analysis were studied. 2. Cluster Analysis approach The questions that arise when we try to frame socially and pedagogically the results in Mathematics of 14-15 - year old students, are concerned with the types of decisive factors in those results. It is somehow underlying our objectives to classify the students according to the factors understood by us as being decisive in students’ results. This is exactly the aim of Cluster Analysis. The hierarchical solution that can be observed in the dendogram presented in the next page, suggests that we should consider the 3 following clusters, since the distances increase substantially after it: Variables in Cluster1: mother qualifications; father qualifications; student’s results in Mathematics as classified by the school teacher; student’s results in the exam of Mathematics; time spent studying.
    [Show full text]
  • Regularized Regression Under Quadratic Loss, Logistic Loss, Sigmoidal Loss, and Hinge Loss
    Regularized Regression under Quadratic Loss, Logistic Loss, Sigmoidal Loss, and Hinge Loss Here we considerthe problem of learning binary classiers. We assume a set X of possible inputs and we are interested in classifying inputs into one of two classes. For example we might be interesting in predicting whether a given persion is going to vote democratic or republican. We assume a function Φ which assigns a feature vector to each element of x — we assume that for x ∈ X we have d Φ(x) ∈ R . For 1 ≤ i ≤ d we let Φi(x) be the ith coordinate value of Φ(x). For example, for a person x we might have that Φ(x) is a vector specifying income, age, gender, years of education, and other properties. Discrete properties can be represented by binary valued fetures (indicator functions). For example, for each state of the United states we can have a component Φi(x) which is 1 if x lives in that state and 0 otherwise. We assume that we have training data consisting of labeled inputs where, for convenience, we assume that the labels are all either −1 or 1. S = hx1, yyi,..., hxT , yT i xt ∈ X yt ∈ {−1, 1} Our objective is to use the training data to construct a predictor f(x) which predicts y from x. Here we will be interested in predictors of the following form where β ∈ Rd is a parameter vector to be learned from the training data. fβ(x) = sign(β · Φ(x)) (1) We are then interested in learning a parameter vector β from the training data.
    [Show full text]
  • The Central Limit Theorem in Differential Privacy
    Privacy Loss Classes: The Central Limit Theorem in Differential Privacy David M. Sommer Sebastian Meiser Esfandiar Mohammadi ETH Zurich UCL ETH Zurich [email protected] [email protected] [email protected] August 12, 2020 Abstract Quantifying the privacy loss of a privacy-preserving mechanism on potentially sensitive data is a complex and well-researched topic; the de-facto standard for privacy measures are "-differential privacy (DP) and its versatile relaxation (, δ)-approximate differential privacy (ADP). Recently, novel variants of (A)DP focused on giving tighter privacy bounds under continual observation. In this paper we unify many previous works via the privacy loss distribution (PLD) of a mechanism. We show that for non-adaptive mechanisms, the privacy loss under sequential composition undergoes a convolution and will converge to a Gauss distribution (the central limit theorem for DP). We derive several relevant insights: we can now characterize mechanisms by their privacy loss class, i.e., by the Gauss distribution to which their PLD converges, which allows us to give novel ADP bounds for mechanisms based on their privacy loss class; we derive exact analytical guarantees for the approximate randomized response mechanism and an exact analytical and closed formula for the Gauss mechanism, that, given ", calculates δ, s.t., the mechanism is ("; δ)-ADP (not an over- approximating bound). 1 Contents 1 Introduction 4 1.1 Contribution . .4 2 Overview 6 2.1 Worst-case distributions . .6 2.2 The privacy loss distribution . .6 3 Related Work 7 4 Privacy Loss Space 7 4.1 Privacy Loss Variables / Distributions .
    [Show full text]
  • Cluster Analysis Y H Chan
    Basic Statistics For Doctors Singapore Med J 2005; 46(4) : 153 CME Article Biostatistics 304. Cluster analysis Y H Chan In Cluster analysis, we seek to identify the “natural” SPSS offers three separate approaches to structure of groups based on a multivariate profile, Cluster analysis, namely: TwoStep, K-Means and if it exists, which both minimises the within-group Hierarchical. We shall discuss the Hierarchical variation and maximises the between-group variation. approach first. This is chosen when we have little idea The objective is to perform data reduction into of the data structure. There are two basic hierarchical manageable bite-sizes which could be used in further clustering procedures – agglomerative or divisive. analysis or developing hypothesis concerning the Agglomerative starts with each object as a cluster nature of the data. It is exploratory, descriptive and and new clusters are combined until eventually all non-inferential. individuals are grouped into one large cluster. Divisive This technique will always create clusters, be it right proceeds in the opposite direction to agglomerative or wrong. The solutions are not unique since they methods. For n cases, there will be one-cluster to are dependent on the variables used and how cluster n-1 cluster solutions. membership is being defined. There are no essential In SPSS, go to Analyse, Classify, Hierarchical Cluster assumptions required for its use except that there must to get Template I be some regard to theoretical/conceptual rationale upon which the variables are selected. Template I. Hierarchical cluster analysis. For simplicity, we shall use 10 subjects to demonstrate how cluster analysis works.
    [Show full text]
  • Bayesian Classifiers Under a Mixture Loss Function
    Hunting for Significance: Bayesian Classifiers under a Mixture Loss Function Igar Fuki, Lawrence Brown, Xu Han, Linda Zhao February 13, 2014 Abstract Detecting significance in a high-dimensional sparse data structure has received a large amount of attention in modern statistics. In the current paper, we introduce a compound decision rule to simultaneously classify signals from noise. This procedure is a Bayes rule subject to a mixture loss function. The loss function minimizes the number of false discoveries while controlling the false non discoveries by incorporating the signal strength information. Based on our criterion, strong signals will be penalized more heavily for non discovery than weak signals. In constructing this classification rule, we assume a mixture prior for the parameter which adapts to the unknown spar- sity. This Bayes rule can be viewed as thresholding the \local fdr" (Efron 2007) by adaptive thresholds. Both parametric and nonparametric methods will be discussed. The nonparametric procedure adapts to the unknown data structure well and out- performs the parametric one. Performance of the procedure is illustrated by various simulation studies and a real data application. Keywords: High dimensional sparse inference, Bayes classification rule, Nonparametric esti- mation, False discoveries, False nondiscoveries 1 2 1 Introduction Consider a normal mean model: Zi = βi + i; i = 1; ··· ; p (1) n T where fZigi=1 are independent random variables, the random errors (1; ··· ; p) follow 2 T a multivariate normal distribution Np(0; σ Ip), and β = (β1; ··· ; βp) is a p-dimensional unknown vector. For simplicity, in model (1), we assume σ2 is known. Without loss of generality, let σ2 = 1.
    [Show full text]
  • Cluster Analysis Objective: Group Data Points Into Classes of Similar Points Based on a Series of Variables
    Multivariate Fundamentals: Distance Cluster Analysis Objective: Group data points into classes of similar points based on a series of variables Useful to find the true groups that are assumed to really exist, BUT if the analysis generates unexpected groupings it could inform new relationships you might want to investigate Also useful for data reduction by finding which data points are similar and allow for subsampling of the original dataset without losing information Alfred Louis Kroeber (1876-1961) The math behind cluster analysis A B C D … A 0 1.8 0.6 3.0 Once we calculate a distance matrix between points we B 1.8 0 2.5 3.3 use that information to build a tree C 0.6 2.5 0 2.2 D 3.0 3.3 2.2 0 … Ordination – visualizes the information in the distance calculations The result of a cluster analysis is a tree or dendrogram 0.6 1.8 4 2.5 3.0 2.2 3 3.3 2 distance 1 If distances are not equal between points we A C D can draw a “hanging tree” to illustrate distances 0 B Building trees & creating groups 1. Nearest Neighbour Method – create groups by starting with the smallest distances and build branches In effect we keep asking data matrix “Which plot is my nearest neighbour?” to add branches 2. Centroid Method – creates a group based on smallest distance to group centroid rather than group member First creates a group based on small distance then uses the centroid of that group to find which additional points belong in the same group 3.
    [Show full text]
  • A Unified Bias-Variance Decomposition for Zero-One And
    A Unified Bias-Variance Decomposition for Zero-One and Squared Loss Pedro Domingos Department of Computer Science and Engineering University of Washington Seattle, Washington 98195, U.S.A. [email protected] http://www.cs.washington.edu/homes/pedrod Abstract of a bias-variance decomposition of error: while allowing a more intensive search for a single model is liable to increase The bias-variance decomposition is a very useful and variance, averaging multiple models will often (though not widely-used tool for understanding machine-learning always) reduce it. As a result of these developments, the algorithms. It was originally developed for squared loss. In recent years, several authors have proposed bias-variance decomposition of error has become a corner- decompositions for zero-one loss, but each has signif- stone of our understanding of inductive learning. icant shortcomings. In particular, all of these decompo- sitions have only an intuitive relationship to the original Although machine-learning research has been mainly squared-loss one. In this paper, we define bias and vari- concerned with classification problems, using zero-one loss ance for an arbitrary loss function, and show that the as the main evaluation criterion, the bias-variance insight resulting decomposition specializes to the standard one was borrowed from the field of regression, where squared- for the squared-loss case, and to a close relative of Kong loss is the main criterion. As a result, several authors have and Dietterich’s (1995) one for the zero-one case. The proposed bias-variance decompositions related to zero-one same decomposition also applies to variable misclassi- loss (Kong & Dietterich, 1995; Breiman, 1996b; Kohavi & fication costs.
    [Show full text]
  • Are Loss Functions All the Same?
    Are Loss Functions All the Same? L. Rosasco∗ E. De Vito† A. Caponnetto‡ M. Piana§ A. Verri¶ September 30, 2003 Abstract In this paper we investigate the impact of choosing different loss func- tions from the viewpoint of statistical learning theory. We introduce a convexity assumption - which is met by all loss functions commonly used in the literature, and study how the bound on the estimation error changes with the loss. We also derive a general result on the minimizer of the ex- pected risk for a convex loss function in the case of classification. The main outcome of our analysis is that, for classification, the hinge loss appears to be the loss of choice. Other things being equal, the hinge loss leads to a convergence rate practically indistinguishable from the logistic loss rate and much better than the square loss rate. Furthermore, if the hypothesis space is sufficiently rich, the bounds obtained for the hinge loss are not loosened by the thresholding stage. 1 Introduction A main problem of statistical learning theory is finding necessary and sufficient condi- tions for the consistency of the Empirical Risk Minimization principle. Traditionally, the role played by the loss is marginal and the choice of which loss to use for which ∗INFM - DISI, Universit`adi Genova, Via Dodecaneso 35, 16146 Genova (I) †Dipartimento di Matematica, Universit`adi Modena, Via Campi 213/B, 41100 Modena (I), and INFN, Sezione di Genova ‡DISI, Universit`adi Genova, Via Dodecaneso 35, 16146 Genova (I) §INFM - DIMA, Universit`adi Genova, Via Dodecaneso 35, 16146 Genova (I) ¶INFM - DISI, Universit`adi Genova, Via Dodecaneso 35, 16146 Genova (I) 1 problem is usually regarded as a computational issue (Vapnik, 1995; Vapnik, 1998; Alon et al., 1993; Cristianini and Shawe Taylor, 2000).
    [Show full text]
  • Cluster Analysis: What It Is and How to Use It Alyssa Wittle and Michael Stackhouse, Covance, Inc
    PharmaSUG 2019 - Paper ST-183 Cluster Analysis: What It Is and How to Use It Alyssa Wittle and Michael Stackhouse, Covance, Inc. ABSTRACT A Cluster Analysis is a great way of looking across several related data points to find possible relationships within your data which you may not have expected. The basic approach of a cluster analysis is to do the following: transform the results of a series of related variables into a standardized value such as Z-scores, then combine these values and determine if there are trends across the data which may lend the data to divide into separate, distinct groups, or "clusters". A cluster is assigned at a subject level, to be used as a grouping variable or even as a response variable. Once these clusters have been determined and assigned, they can be used in your analysis model to observe if there is a significant difference between the results of these clusters within various parameters. For example, is a certain age group more likely to give more positive answers across all questionnaires in a study or integration? Cluster analysis can also be a good way of determining exploratory endpoints or focusing an analysis on a certain number of categories for a set of variables. This paper will instruct on approaches to a clustering analysis, how the results can be interpreted, and how clusters can be determined and analyzed using several programming methods and languages, including SAS, Python and R. Examples of clustering analyses and their interpretations will also be provided. INTRODUCTION A cluster analysis is a multivariate data exploration method gaining popularity in the industry.
    [Show full text]
  • Factors Versus Clusters
    Paper 2868-2018 Factors vs. Clusters Diana Suhr, SIR Consulting ABSTRACT Factor analysis is an exploratory statistical technique to investigate dimensions and the factor structure underlying a set of variables (items) while cluster analysis is an exploratory statistical technique to group observations (people, things, events) into clusters or groups so that the degree of association is strong between members of the same cluster and weak between members of different clusters. Factor and cluster analysis guidelines and SAS® code will be discussed as well as illustrating and discussing results for sample data analysis. Procedures shown will be PROC FACTOR, PROC CORR alpha, PROC STANDARDIZE, PROC CLUSTER, and PROC FASTCLUS. INTRODUCTION Exploratory factor analysis (EFA) investigates the possible underlying factor structure (dimensions) of a set of interrelated variables without imposing a preconceived structure on the outcome (Child, 1990). The analysis groups similar items to identify dimensions (also called factors or latent constructs). Exploratory cluster analysis (ECA) is a technique for dividing a multivariate dataset into “natural” clusters or groups. The technique involves identifying groups of individuals or objects that are similar to each other but different from individuals or objects in other groups. Cluster analysis, like factor analysis, makes no distinction between independent and dependent variables. Factor analysis reduces the number of variables by grouping them into a smaller set of factors. Cluster analysis reduces the number of observations by grouping them into a smaller set of clusters. There is no right or wrong answer to “how many factors or clusters should I keep?”. The answer depends on what you’re going to do with the factors or clusters.
    [Show full text]