Mrinal Yadav
4 min readJan 10, 2021

--

Photo by Gertrūda Valasevičiūtė on Unsplash

METHODS FOR DETERMINING OPTIMAL NUMBER OF CLUSTERS:

  1. ELBOW METHOD:

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.

From the above visualization, we can see that the optimal number of clusters should be around 3. But visualizing the data alone cannot always give the right answer. Hence we demonstrate the following steps:

We now define the following:-

  1. Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.
  2. Inertia: It is the sum of squared distances of samples to their closest cluster center.

We iterate the values of k from 1 to 9 and calculate the values of distortions for each value of k and calculate the distortion and inertia for each value of k in the given range.

Code for elbow method
Result for the code

Some notes:

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion. Thus for the given data, we conclude that the optimal number of clusters for the data is 3.

Implementation in python:

https://github.com/mrinalyadav7-atom/Text_clustering-Numo-Uno-/tree/master/Kmeans

Silhouette Index :

Silhouette analysis refers to a method of interpretation and validation of consistency within clusters of data. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.

Calculation of Silhouette Value –

If the Silhouette index value is high, the object is well-matched to its own cluster and poorly matched to neighbouring clusters. The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient is defined as –

S(i) = ( b(i) — a(i) ) / ( max { ( a(i), b(i) ) }

Where,

  • a(i) is the average dissimilarity of ith object to all other objects in the same cluster
  • b(i) is the average dissimilarity of ith object with all objects in the closest cluster
code for silhoutte score
result of the code

Some notes:

Range of Silhouette Value –

Now, obviously S(i) will lie between [-1, 1]

  1. If silhouette value is close to 1, sample is well-clustered and already assigned to a very appropriate cluster.
  2. If silhouette value is about to 0, sample could be assign to another cluster closest to it and the sample lies equally far away from both the clusters. That means it indicates overlapping clusters
  3. If silhouette value is close to –1, sample is misclassified and is merely placed somewhere in between the clusters.

Implementation:

https://github.com/mrinalyadav7-atom/Text_clustering-Numo-Uno-/tree/master/Kmeans

--

--