The Curse of dimensionality refers to unavoidable random phenomenon which occur when the dimensionality of data is High. This has different effect in different domains ranging from numerical analysis , machine learning , databases.
A more mathematical definition of curse of dimensionality is “When the dimension of the data point becomes very high or alternatively the number of features to express a data point becomes very large”.
The high dimension is not a problem for machine learning but it affects other domains also like ,
- machine learning
- Distance metrics
- Nearest Neighbours
In all domains above machine learning and distance metrics are affected largely due to curse of dimensionality .There is a direct correlation between performance of a machine learning model vs dimensionality . When dimensionality of a data point increases the performance of the model decreases.This relation is not an inversely proportional as the decrease in performance is problem specific.
Let’s discuss the effect of dimensionality on each of the following
Most of the optimization techniques relies on the tweaking the weights corresponding to each feature and as the number of features increases the weights also increases. Optimizing each weight and finding derivative w.r.t to each weight is very computationally intensive task. This is a huge obstacle in optimizing any algorithm. Though many years of research modern techniques have been developed for optimizing algorithms which are able to reduce the effect of dimensionality.
The machine learning involves algorithms which learns from samples of data to predict a general well optimized opinion on each data point.This property machine learning algorithms makes it very dependent on the nature of the data and its features. If the dimensionality of data increases then to recognize data points correctly the algorithms need more samples for each feature . This effect of dimensionality directly impacts the performance of the model and its ability to learn from data. Due to the fact that machine learning algorithms are highly dependent on data ,this makes the designing of models more difficult as if data have any bias towards any feature then the model itself becomes biased.
Distance metrics are techniques to find distance or similarity between points in a sample space . Some of these techniques include Euclidean distance ,manhattan distance etc. These techniques are highly effected by high dimension data .Lets consider a example of hypersphere in d dimension with radius r then the volume of that hypersphere is given by ,
here we can observe if d is very high the volume of the sphere shrinks and it can shrink to such a point that it becomes pointless to be even considered i.e ,
A further development of this phenomenon is as follows. For any fixed n, it turns out that the minimum and the maximum distance between a random reference point Q and a list of n random data points Pi becomes neglegible.
Due to this effect the the distance function start loosing their usefulness and due to this phenomenon machine learning algorithms based on distance metric like Knn and clustering tends to perform drastically poor.
Nearest Neighbor Search
The effect of high dimensionality complicates the nearest neighbor search as we cannot reject a point based on single coordinate as to high dimension the points which are far in low dimension representation can come closer in high dimension representation.This must be noted that only dimension not affect the nearest neighbor search it can be affected by various reasons.
The curse of dimensionality is right now a unavoidable problem many techniques are proven to reduce its effect, eliminating this problem is a hot area of research in machine learning.