K-Means Clustering
Properties of cluster-
Property 1: All the data points in a cluster should be similar to each other.
Property 2: The data points from different clusters should be as different as possible
Case-II is better than Case-I
Algorithm-
- First we initialize k points, called means, randomly.
- We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far.(by estimating the distance between the points: we can use cosine similarity in our case)
- We repeat the process for a given number of iterations and at the end, we have our clusters.
Stopping Criteria for K-Means Clustering
There are essentially three stopping criteria that can be adopted to stop the K-means algorithm:
- Centroids of newly formed clusters do not change
- Points remain in the same cluster
- Maximum number of iterations are reached
We can stop the algorithm if the centroids of newly formed clusters are not changing. Even after multiple iterations, if we are getting the same centroids for all the clusters, we can say that the algorithm is not learning any new pattern and it is a sign to stop the training.
Another clear sign that we should stop the training process if the points remain in the same cluster even after training the algorithm for multiple iterations.
Finally, we can stop the training if the maximum number of iterations is reached. Suppose if we have set the number of iterations as 100. The process will repeat for 100 iterations before stopping.
https://www.geeksforgeeks.org/k-means-clustering-introduction/
Why does K-means fail?
Briefly, K-means performs poorly because the underlying assumptions on the shape of the clusters are not met; it is a parametric algorithm parameterized by the K cluster centroids, the centers of gaussian spheres. K-means performs best when clusters are:
- “round” or spherical
- equally sized
- equally dense
- most dense in the center of the sphere
- not contaminated by noise/outliers
Implementation in python:
https://github.com/mrinalyadav7-atom/Text_clustering-Numo-Uno-/tree/master/Kmeans
DBSCAN Clustering | Density-based clustering-
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each point of a cluster, the neighbourhood of a given radius has to contain at least a minimum number of points.
Real-life data may contain irregularities, like –
i) Clusters can be of arbitrary shape such as those shown in the figure below.
ii) Data may contain noise.
Which K-means can’t handle
Parameters:
The DBSCAN algorithm basically requires 2 parameters:
- eps: specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value (eps), these points are considered neighbours.
- minPoints: the minimum number of points to form a dense region. For example, if we set the minPoints parameter as 5, then we need at least 5 points to form a dense region.
Parameter estimation:
The parameter estimation is a problem for every data mining task. To choose good parameters we need to understand how they are used and have at least a basic previous knowledge about the data set that will be used.
- eps: if the eps value chosen is too small, a large part of the data will not be clustered. It will be considered outliers because don’t satisfy the number of points to create a dense region. On the other hand, if the value that was chosen is too high, clusters will merge and the majority of objects will be in the same cluster. The eps should be chosen based on the distance of the dataset (we can use a k-distance graph to find it), but in general small eps values are preferable.
- minPoints: As a general rule, a minimum minPoints can be derived from a number of dimensions (D) in the data set, as minPoints ≥ D + 1. Larger values are usually better for data sets with noise and will form more significant clusters. The minimum value for the minPoints must be 3, but the larger the data set, the larger the minPoints value that should be chosen.
In this algorithm, we have 3 types of data points.
- Core Point: A point is a core point if it has more than MinPts points within eps.
- Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a core point.
- Noise or outlier: A point which is not a core point or border point.
DBSCAN algorithm –
- Find all the neighbour points within eps and identify the core points or visited with more than MinPts neighbours.
- For each core point if it is not already assigned to a cluster, create a new cluster.
- Find recursively all its density connected points and assign them to the same cluster as the core point.
A point a and b are said to be density connected if there exist a point c which has a sufficient number of points in its neighbours and both the points a and b are within the eps distance. This is a chaining process. So, if b is neighbour of c, c is neighbour of d, d is neighbour of e, which in turn is neighbour of a implies that b is neighbour of a. - Iterate through the remaining unvisited points in the dataset. Those points that do not belong to any cluster are noise.
https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering
How HDBSCAN Works-
Regular DBScan is amazing at clustering data of varying shapes, but falls short of clustering data of varying density. It extends DBSCAN by converting it into a hierarchical clustering algorithm and then using a technique to extract a flat clustering based in the stability of clusters.We can break it out into a series of steps
- Transform the space according to the density/sparsity.
- Build the minimum spanning tree of the distance weighted graph.
- Construct a cluster hierarchy of connected components.
- Condense the cluster hierarchy based on minimum cluster size.
- Extact the stable clusters from the condensed tree.
Implementation in python:
https://github.com/mrinalyadav7-atom/Text_clustering-Numo-Uno-
Density Bars, Density Bars with DBScan Applied, and Density Bars with HDBScan Applied-
K-means vs HDBSCAN-
https://github.com/scikit-learn-contrib/hdbscan/issues/178
https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-methods/
https://colab.research.google.com/drive/1GMcOAJb-pzm4g3osXYYiwlExUvFh4N47#scrollTo=A-ImumZxZkga
Comparing Python Clustering Algorithms-
https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html