Lecture 16-18): Clustering Large Data Bases

Lecture 16-18 • Self organized maps (SOM) • Visualizing Clustering Results • Clustering large data bases – Birch – DBScan – Clarans SOM Structure • One layer neural network. • Neurons usually arranged in 2D grid. • Neurons are completely connected to each other (weight is inversely proportional to distance). • Number of inputs into each neuron is the number of attributes. • Neuron performs no computation but have a model/memory for each attribute. SOM - Functionality • Input : a series of instances • Processing: Each neuron competes to represent the instance. • Output : A maps that compresses the data and a list of instances won for each neuron. Color Distance Map - 1 http://www-ece.rice.edu/~erzsebet/AIS99wsh/AISwshop99.html Mapping SOM’s are mapping from a higher to lower dimensional space? Other Types of Dimension Reduction • Multi-dimensional scaling (MDS) Chernoff Faces Mapping that SOM Performs • The SOM mapping is complex. We can say: – It can be surjective – It is not injective and therefore not bijective. – Preserves some topological features • “The property of topology preserving means that the mapping preserves the relative distance between the points. Points that are near each other in the input space are mapped to nearby map units in the SOM.” Training a SOM - 1 Initialize model/memory/weights • Step 1: Provide an instance to each neuron • Step 2: Determine the neuron that represents the instance. • Step 3: Calculate the winner’s neighborhood • Step 4: Neurons adjust their model/memory Size of neighborhood and adjustment decreases with time. Training a SOM - 2 The Neuron that wins, c, has the model that is closest to the instance in Euclidean distance The neighborhood of a Neuron depends on time and spatial location A neuron’s model is updated depending on distance to the winner and the instance α is the learning rate that decreases monotonically with time, t. ri and rc are models and || ri – rc || the Euclidean distance between them Properties of SOM • Takes a long time to train – Complexity can be O(ns 2), s is the number of neurons, n is the number of instances – Use batch training • Parse all the data • Collect the list of instances won for each neuron • For each neuron calculate the centroid for the list • Update neuron’s memory/model using the centroid. • Sensitive to initial model. Using a SOM • For each neuron determine its centroid. • Need to determine k, (user discernable) • Determining quality of SOM • Construct a B/W distance map. – For each neuron calculate the mean distance between it’s model/memory and its neighbors. – Map a distance of 0 to white and a (normalized) distance of 1 to black Black and White Distance Map - 1 SOM –Example (1) Map Nine Points From 2D to 1D (4 Neurons) 4.5 4 Point # x y 1 0.411678 0.812056 3.5 2 0.688305 0.077209 3 3 0.899551 0.553964 2.5 Series1 4 1.563111 0.781096 5 1.142188 0.479861 2 Series2 6 3.423893 2.60615 1.5 7 3.994179 2.268564 1 8 2.932121 3.846707 9 3.481125 3.739238 0.5 0 0 1 2 3 4 5 SOM –Example (2) c is the winner, has the shortest Euclidean distance to the instance mc = mc + || mc - i|| / 2 mc+1 = mc+1 + || mc+1 - i|| / 4 mc-1 = mc-1 + || mc-1 - i|| / 4 No-wrapping at boundaries. Models/Weights Distance Node 1 Node 2 Node 3 Node 4 Instance Node 1 Node 2 Node 3 Node 4 0.245462 0.289383 0.322582 0.89323 0.234232 0.07681 0.096661 0.852594 0.770152 0.746476 0.770546 0.047812 0.92321 0.812056 0.239847 0.275596 0.234232 0.89323 0.19222 0.148123 0.185529 0.175298 0.94545 0.779266 0.780924 0.812056 0.92321 0.67877 0.216034 0.254752 0.19222 0.89323 3.423893 5.084992 5.019907 5.159054 4.213604 0.729018 0.755385 0.67877 0.92321 2.60615 0.216034 0.254752 1.000138 2.158562 0.764344 0.723365 0.711014 0.842445 2.604934 0.729018 0.755385 1.160615 1.76468 0.553964 0.353111 0.509548 0.94119 2.158562 3.994179 5.224378 5.098521 4.312602 2.339502 SOM –Example (3) Node 1 Node 2 Node 3 Node 4 Instance Distance 0.353111 0.509548 1.704437 3.07637 0.563111 0.305841 0.179985 1.684085 3.748786 0.685254 0.654674 1.323855 2.016622 0.781096 0.405611 0.53633 1.419106 3.07637 3.481125 6.105537 5.966149 4.613092 2.127371 0.709215 0.717885 1.188165 2.016622 3.739238 0.405611 0.53633 1.934611 3.278748 2.932121 5.664002 5.524613 3.018284 1.315404 0.709215 0.717885 1.825934 2.87793 3.846707 0.405611 0.53633 2.183988 3.105434 1.142188 0.96593 0.843881 2.893066 4.845704 0.709215 0.717885 2.331127 3.362319 0.479861 0.589755 0.839259 1.923538 3.105434 1.563111 1.231794 1.088298 2.534856 6.704769 0.651876 0.598873 1.868311 3.362319 0.781096 0.589755 0.839259 1.833431 2.334273 0.651876 0.598873 1.596507 2.071707 After 1 epoch: Node 1 & 2 have instances 1 through 5 Node 4 has instances 6 through 9. Distances “preserved” Clustering Large Databases • Most clustering algorithms assume a large data structure which is memory resident. • Clustering may be performed first on a sample of the database then applied to the entire database. • Algorithms – BIRCH – DBSCAN – CURE Desired Features for Large Databases • One scan (or less) of DB • Online • Suspendable, stoppable, resumable • Incremental • Work with limited main memory • Different techniques to scan (e.g. sampling) • Process each tuple once BIRCH • Balanced Iterative Reducing and Clustering using Hierarchies • Incremental, hierarchical, one scan • Save clustering information in a tree • Each entry in the tree contains information about one cluster • New nodes inserted in closest entry in tree Clustering Feature • CT Triple: (N,LS,SS) – N: Number of points in cluster – LS: Sum of points in the cluster – SS: Sum of squares of points in the cluster • CF Tree – Balanced search tree – Node has CF triple for each child – Leaf node represents cluster and has CF value for each subcluster in it. – Subcluster has maximum diameter BIRCH Algorithm Improve Clusters DBSCAN • Density Based Spatial Clustering of Applications with Noise • Outliers will not effect creation of cluster. • Input – MinPts – minimum number of points in cluster – Eps – for each point in cluster there must be another point in it less than this distance away. DBSCAN Density Concepts • Eps-neighborhood: Points within Eps distance of a point. • Core point: Eps-neighborhood dense enough (MinPts) • Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. • Density-reachable: A point si density-reachable form another point if there is a path from one to the other consisting of only core points. Density Concepts DBSCAN Algorithm CURE • Clustering Using Representatives • Use many points to represent a cluster instead of only one • Points will be well scattered CURE Approach CURE Algorithm CURE for Large Databases Comparison of Clustering Techniques.

Lecture 16-18): Clustering Large Data Bases

A Survey of Hierarchical Clustering Algorithms the Journal Of

The CURE for Class Imbalance

An Improved CURE Algorithm Mingjuan Cai, Yongquan Liang

Dynamic Group Recommendation Based on the Attention Mechanism

An Improvement for DBSCAN Algorithm for Best Results in Varied Densities

Download (PDF)

Hierarchical Clustering Algorithm for Big Data Using Hadoop and Mapreduce

Statistical Modeling Machine Learning

Intel® Xeon® Processors Projector

Privacy-Preserving Clustering Using Representatives Over Arbitrarily Partitioned Data∗

Robust Clustering Algorithms

Large Scale Clustering