Feature selection

From Wikipedia, the free encyclopedia

Feature selection, also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. Feature selection is necessary either because it is computationally infeasible to use all available features, or because of problems of estimation when limited data samples (but a large number of features) are present. The latter problem is related to the so-called curse of dimensionality, which refers to the fact that the number of data samples required to estimate some arbitrary multivariate probability distribution increases exponentially as the number of dimensions in the data increases linearly.

Simple feature selection algorithms are ad hoc, but there are also more methodical approaches. From a theoretical perspective, it can be shown that optimal feature selection for supervised learning problems requires an exhaustive search of all possible subsets of features of the chosen cardinality. If large numbers of features are available, this is impractical. For practical supervised learning algorithms, the search is for a satisfactory set of features instead of an optimal set. Many popular approaches are greedy hill climbing approaches. Such an approach evaluates a possible subset of features and then modifies that subset to see if an improved subset can be found. Evaluation of subsets can be done many ways - some metric is used to score the features, and possibly the combination of features. Since exhaustive search is generally impractical, at some stopping point, the subset of features with the highest scores by the metric will be selected. The stopping point varies by algorithm.

Two popular metrics for classification problems are correlation and mutual information. These metrics are computed between a candidate feature (or set of features) and the desired output category.

In statistics the most popular form of feature selection is called stepwise regression. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this would typically be done by cross validation. In statistics, some criteria would be optimized. However, this leads to the inherit problem of nesting. More robust methods have been explored, such as Branch and Bound and Piecewise Linear Networks.

1 Optimality criteria
2 References
3 See also
4 External links

[edit] Optimality criteria

There are a variety of optimiality criteria that can be used for controlling feature selection. The oldest would be Collin Mallow's C-p statistic or Akaike information criterion (AIC). These both add variables if the t statistic is bigger than $\sqrt{2}$ . Other criteria are, BIC (which uses $\sqrt{\log{n}}$ ), MDL (which asymptotically uses $\sqrt{\log{n}}$ but some argue this asymptotic isn't computed correctly), Bonnferroni / RIC (which use $\sqrt{2\log{p}}$ ), and a variety of new criteria that are movitated by FDR (which use something close to $\sqrt{2\log{\frac{p}{q}}}$ ).