Introduction to clustering

 clustering means grouping similar types of data points or samples to form clusters which can describe general behavior of  the samples.In machine learning Clustering is a very  effective technique in machine learning and it is not a  hardware intensive technique.

Before diving into algorithmic details let’s first see an example of how powerful an clustering is,

Consider the image below where the x-axis represents height of people and y-axis represents weight of people ,this is sample data of of people visiting a clothing shop,

Now after applying clustering we can clearly see that small and avg size people tends to visit more rather than large people ,by just applying simple clustering a shopkeeper can increase it sales just by making more clothes for small and mid sized people

In this article we will be covering three clustering algorithms ,

  • K-means clustering
  • hierarchical clustering
  • Density based clustering
K-means Clustering

The algorithm works as follows,

  1. Choose how many clusters you want to form i.e K-clusters here K is a hyperparameter
  2. select random K points from the dataset and marked them as centroid for first iteration
  3. calculate the euclidean distance of each point from each of the centroid ,you can also use manhattan distance but that is problem specific decision as what distance metric you want to use
  4. each point belong to cluster of that centroid from which it has least distance
  5. after that calculate the mean of all the points inside that cluster and that will be your centroid for next iteration
  6. repeat steps from 3-5 till cluster don’t change

There is a big disadvantage with K-means clustering as it always geometrically prefers to make similar size size clusters which is not always preferred by the  problem

Working of K-means

 

Hierarchical

The geometrical concept is ,

  • Initially every point is a cluster based on some similarity or distance metric
  • combine cluster based on some distance or metric than form them as bigger cluster
  • repeat step two till the desired number of clusters are formed

The most popular hierarchical clustering is agglomerative clustering,the algorithm works as follows

  1. Calculate the proximity matrix
  2. Repeat
    • combine the clusters with minimum proximity value
    • update the proximity value
  3. Until only one clusters remain

 

Dbscan

Density-based spatial clustering of applications with noise ,the Goal of this algorithm is to make clusters of dense points and eliminate noise or outliers,

Working
There are following parameters which needs to be evaluated in order to run properly
  • epsilon => the radius of hypersphere in which we look for points
  • minpts => the minimum number of points which needs to be present in the epsilon
  • core => the point in around which the cluster is formed
  • border => the points which are not core points but lie inside the cluster of the core points
  • noise points => which are not the part of any cluster
  • Density edge => direct edge connecting two points of same cluster
  • Density connected points => path between two points of same cluster

From above mentioned parameters ,epsilon and minpts are very important  hyperparameters of the algorithm as it will decide how well the algorithm is working.The working of this algorithm is very simple

  • Find all the core points of all the point inside epsilon
  • if the number of core points are more than minpts then form a cluster or if epsilon covers a cluster itself then that point becomes part of that already existing cluster
  • repeat above mentioned steps for all the points till clusters are well formed

Impact of Machine Learning is getting bigger every day ,so don’t get left behind by not making yourself aware of how these complex algorithms works both at mathematical level and intuitive level.

Send a Message