Sign in

Photo by Franki Chamaki on Unsplash

Properties of cluster-

Property 1: All the data points in a cluster should be similar to each other.

Property 2: The data points from different clusters should be as different as possible


Photo by Michael Dziedzic on Unsplash

Spectral Clustering and GMM

How does Spectral Clustering work?

In spectral clustering, the data points are treated as nodes of a graph. Thus, clustering is treated as a graph partitioning problem. The nodes are then mapped to a low-dimensional space that can be easily segregated to form clusters. An important point to note is that no assumption is made about the shape/form of the clusters.

What are the steps for Spectral Clustering?

Spectral clustering involves 3 steps:

1. Compute a similarity graph

2. Project the data onto a low-dimensional space

3. Create clusters

Step 1 — Compute a similarity graph:


Photo by Gertrūda Valasevičiūtė on Unsplash

METHODS FOR DETERMINING OPTIMAL NUMBER OF CLUSTERS:

  1. ELBOW METHOD:

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.


Photo by NASA on Unsplash

Support Vector Machines

  1. In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.
  2. SVM + kernel SVM

Photo by NASA on Unsplash

HIERARCHICAL AGGLOMERATIVE CLUSTERING:

Implementation code in python:

https://github.com/mrinalyadav7-atom/Text_clustering-Numo-Uno-/tree/master/Agglomerative

Also known as bottom-up approach or hierarchical agglomerative clustering (HAC). A structure that is more informative than the unstructured set of clusters returned by flat clustering. This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up algorithms treat each data as a singleton cluster at the outset and then successively agglomerates pairs of clusters until all clusters have been merged into a single cluster that contains all data.


Photo by Markus Spiske on Unsplash

AFFINITY PROPAGATION:

  • We use a type of clustering algorithm where the complete data is viewed as a network with each data point being a node in the network.
  • The entire algorithm is based on finding iteratively how well one point is suited to be a representative of another point (i.e., how suited a particular point is to be an exemplar to another point by gaining information about other prospective representatives in the data).
  • Unlike clustering algorithms such as k-means or k-medoids, affinity propagation does not require the number of clusters to be determined or estimated before running the algorithm
  • A…

Photo by Hunter Harritt on Unsplash

WORD EMBEDDINGS-NLP

What are Word Embeddings?

Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text.

” Word Embeddings are Word converted into numbers ”

A dictionary may be the list of all unique words in the sentence. So, a dictionary may look like — [‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]

A vector representation of a word may be a one-hot encoded vector where 1 stands for the position where the word exists and 0 everywhere else. The vector representation of “numbers” in this format according to the above dictionary is [0,0,0,0,0,1] and of converted is[0,0,0,1,0,0].

This is just a…


Photo by Joakim Honkasalo on Unsplash

DBSCAN Clustering Algorithm

DBSCAN ALGORITHM:

  • Based on a set of points (let’s think in a bidimensional space as exemplified in the figure), DBSCAN groups together points that are close to each other based on a distance measurement (usually Euclidean distance) and a minimum number of points. It also marks as outliers the points that are in low-density regions.
  • Stand for density based spatial clustering of applications with noise
  • Clusters are dense regions in the data space, separated by regions of the lower density of points. The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. …

Mrinal Yadav

Data analytics and Naval architecture. Undergraduate at IIT Kharagpur. Loves to spend time with football.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store