Mrinal Yadav
4 min readDec 30, 2020

--

Photo by NASA on Unsplash

HIERARCHICAL AGGLOMERATIVE CLUSTERING:

Implementation code in python:

https://github.com/mrinalyadav7-atom/Text_clustering-Numo-Uno-/tree/master/Agglomerative

Also known as bottom-up approach or hierarchical agglomerative clustering (HAC). A structure that is more informative than the unstructured set of clusters returned by flat clustering. This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up algorithms treat each data as a singleton cluster at the outset and then successively agglomerates pairs of clusters until all clusters have been merged into a single cluster that contains all data.

Agglomerative Hierarchical Clustering

We assign each point to an individual cluster in this technique. Suppose there are 4 data points. We will assign each of these points to a cluster and hence will have 4 clusters in the beginning:

Then, at each iteration, we merge the closest pair of clusters and repeat this step until only a single cluster is left:

We are merging (or adding) the clusters at each step, right? Hence, this type of clustering is also known as additive hierarchical clustering.

code snippet

How should we Choose the Number of Clusters in Hierarchical Clustering?

To get the number of clusters for hierarchical clustering, we make use of an awesome concept called a Dendrogram.

A dendrogram is a tree-like diagram that records the sequences of merges or splits.

We have the samples of the dataset on the x-axis and the distance on the y-axis. Independent variables on x axis and distance between them on y axis. Whenever two clusters are merged, we will join them in this dendrogram and the height of the join will be the distance between these points.

Now, we can set a threshold distance and draw a horizontal line (Generally, we try to set the threshold in such a way that it cuts the tallest vertical line)

The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the threshold

Types of Linkages in Clustering

The process of Hierarchical Clustering involves either clustering sub-clusters(data points in the first iteration) into larger clusters in a bottom-up manner or dividing a larger cluster into smaller sub-clusters in a top-down manner. During both the types of hierarchical clustering, the distance between two sub-clusters needs to be computed. The different types of linkages describe the different approaches to measure the distance between two sub-clusters of data points. The different types of linkages are:-

1. Single Linkage: For two clusters R and S, the single linkage returns the minimum distance between two points i and j such that i belongs to R and j belongs to S.

Single linkage

2. Complete Linkage: For two clusters R and S, the single linkage returns the maximum distance between two points i and j such that i belongs to R and j belongs to S.

Complete linkage

3. Average Linkage: For two clusters R and S, first for the distance between any data-point i in R and any data-point j in S and then the arithmetic mean of these distances are calculated. Average Linkage returns this value of the arithmetic mean.

Nr: no of datapoints in cluster r

Ns: no of datapoints in cluster s.

Average linkage

Some notes:

Implementation in python:

https://colab.research.google.com/drive/1LyZc33p5TDjv7H-hAsrLkMp-48hGvEeM?usp=sharing

--

--