There are many popular algorithms present for dimensionality reduction ,In this article we will cover one of the most popular dimensionality reduction algorithm PCA.
Definition : In Simple terms it is the process of reducing the dimension of a vector from a n-dimension space to k-dimension where n>k by getting minimum loss .There are many reason for this to be done but the main reason is for data visualization.
Let’s motivate you with a example of height vs color of hairs graph ,consider this image below,
By observing above image ,we can clearly see that the variance is very high in hair length and it is very low in color intensity of graph, so a simple dimensionality reduction which we can perform by minimal loss will be to simply drop the hair color and take consideration of only height of hair by doing this we will reduce the dimensionality by 2-D to 1-D .
The above mentioned approach is not a sophisticated approach for reducing dimensionality by droping features ,For understanding take a look at the image below,
In the image we cannot drop any dimension neither the X nor Y as droping any dimension will result in great loss ,so we need to come with a sophisticated way of reducing dimension.
If in the above image if we somehow able to twist the x-axis and the y-axis by some degree just as in image below then we can drop Y for dimensionality reduction. This is the goal of pca to maximize the the rotation of axis’s to minimize the loss in information.
So Let’s formulate or discussion from here,
so mathematically ,What we want is to find the direction of the axis which will cause the minimum loss ,
i.e to find where is the corresponding axis on which we want to project our points such that
we can maximize the variance i.e ,Now we have got our objective function of PCA all we need to do is optimize our objective function by using any optimization algorithm present.
There also another formulation of PCA which is very popular and also preferred in programing implementation of PCA which is as follows,
- Calculate the covariance matrix of Dataset X
- Now find the eigen values of the covariance matrix and eigen vectors of corresponding eigen values.
- Now choose the top eigen values which will be the dimensionality in which you want to reduce the data and their corresponding eigen vectors.
- apply the dot product between choosen eigen vectors and the dataset .
- The result of the datset will be final dataset.
Dimensionality reduction is still the area of research in machine learning as reduction in dimension always result in loss of information ,so how to reduce the loss of information is very important.
We will cover many interesting topics in future so stay tuned