“How good your machine learning model predict depends upon what data you use to train it.”
If the data is itself bad don’t expect good result, now the immediate question arises “How you define bad data ?”, the answer of this question is quite difficult as the quality of data depends upon many factors but among all factors the choice of features used to create data is most important. The choice of the features must be align with objective of the problem. Choosing features is not a type of task where one should only depend upon intuition. Feature selection is a statistical technique which used to find relevant features.
The goal of this article is to answer following questions.
- What is feature selection ?
- What are filter methods for feature selection ?
- What are wrapper methods for feature selection ?
- What are embedded methods for feature selection ?
What is feature selection ?
Feature selection refers to attribute or variable selection for a problem. This must be performed before feeding data to an algorithm because of following reasons,
- Higher the dimension of the data slower the process of training model.
- Some features may be completely irrelevant to predict output.
- It affects the accuracy of model .
- In some cases overfitting can also be avoided.
What are filter methods for feature selection ?
Filter selection methods involves methods in which we use some statistical tests in order to find the relevance of the features with respect to output. These tests are performed without considering the algorithm of the model. The features which are selected, are purely dependent upon their statistical test scores.
Some of the most popular statistical tests performed are,
- Co-variance : Measuring co-variance by taking a subset of features is often a good, as by doing this we can get a fairly good idea as how are features behaves with respect to each other.
- Pearson correlation: Though the co-variance provides good information about correlation between variables but the disadvantage with using Co-variance is there is no upper-bound on co-variance score this makes it little difficult to judge how much the features are aligned with each other for that we use Pearson correlation.
- LDA : Linear discriminant analysis is one of the techniques where we try to select features which will be based as how good they separates one of two classes.
What are wrapper methods for feature selection ?
The feature selection techniques that comes under this heading, selects features which depends upon what type of algorithm you use for your model. The idea is simple select a single feature or a subset of feature and feed it into model and store accuracy, after certain number of combinations which ever subset performs the best choose it.
The techniques which comes under this heading is,
- Forward feature selection : The idea is simple first choose a single feature which gives the best performance then keep adding features on top of it and checks the performance after every addition and stop this process if the performance does not improve or improve slightly which is not very significant.
- Backward Elimination : The idea of this technique is quite opposite as that of forward feature selection, in this technique we first start with all features and start eliminating features one by one till there is no significant decrement in performance of the model.
- Recursive selection: This is not a very computationally friendly technique as it first select select best features and worst features at each stage and then find the best features from left over and checks the performance, it does this process recursively. Then after all features are finished it ranks the features according to which they are eliminated. Then the algorithm chooses top ranks features.
There is no all in one package which implements all these techniques but there are some packages which can help sklearn, boruta etc, for rest of techniques implement it by yourself.
Embedded feature selection
These methods of feature selection are built on top of filter and wrapper feature selection. One of the most popular example of embedded feature selection is regularization. They often provides constraint over the objective function which avoids the problem of coefficients of the weights becoming zero thus helps overfitting of the model. These methods are completely algorithm dependent.