Feature selection
From Wikipedia, the free encyclopedia
Feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique, commonly used in machine learning, of selecting a subset of relevant features for building robust learning models. When applied in biology domain, the technique is also called discriminative gene selection, which detects influential genes based on DNA microarray. By removing most irrelevant and redundant features from the data, feature selection helps improve the performace of learning models by:
-
- Alleviating the effect of the curse of dimensionality.
- Enhancing generalization capability.
- Speeding up learning process.
- Improving model interpretability.
Feature selection also helps people to acquire better understanding about their data by telling them that which are the important features and how they are related with each other.
Contents |
[edit] Introduction
Simple feature selection algorithms are ad hoc, but there are also more methodical approaches. From a theoretical perspective, it can be shown that optimal feature selection for supervised learning problems requires an exhaustive search of all possible subsets of features of the chosen cardinality. If large numbers of features are available, this is impractical. For practical supervised learning algorithms, the search is for a satisfactory set of features instead of an optimal set.
Many popular approaches use greedy hill climbing, which iteratively evaluates a possible subset of features and then modifies it to see the modified subset is better. Evaluation of subsets can be done many ways - some metric is used to score the features, and possibly the combination of features. Since exhaustive search is generally impractical, at some stopping point, the subset of features with the highest scores by the metric will be selected. The stopping criterion varies by algorithm.
Two popular metrics for classification problems are correlation and mutual information. These metrics are computed between a candidate feature (or set of features) and the desired output category.
In statistics, the most popular form of feature selection is stepwise regression. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as Branch and Bound and Piecewise Linear Networks.
[edit] Optimality criteria
There are a variety of optimality criteria that can be used for controlling feature selection. The oldest are Mallows' Cp statistic and Akaike information criterion (AIC). These add variables if the t statistic is bigger than .
Other criteria are Bayesian information criterion (BIC) which uses , minimum description length (MDL) which asymptotically uses but some argue this asymptote is not computed correctly[citation needed], Bonnferroni / RIC which use , and a variety of new criteria that are movitated by false discovery rate (FDR) which use something close to .
[edit] minimum-Redundancy-Maximum-Relevance
Selecting features that correlate strongest to the classification variable has been called the "maximum-relevance selection". Many heuristic algorithms can be used, such as the sequential forward, backward, or floating selections.
On the other hand, features can be selected to be different from each other, while they still have high correlation to the classification variable. This scheme, called "minimum-Redundancy-Maximum-Relevance" selection (mRMR), has been found to be more powerful than the maximum relevance selection.
As a special case, statistical dependence between variables can be used instead of correlation. Mutual information can be used to quantify the dependency. In this case, it is shown that mRMR is an approximation to maximizing the dependency between the joint distribution of the selected features and the classification variable.
[edit] Methods incorporating Feature Selection
- Random forests (RF)
- Random multinomial logit (RMNL)
- Ridge regression
- Decision tree
- Many other machine learning methods applying a pruning step.
[edit] References
- JMLR Special Issue on Variable and Feature Selection
- Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005. Program
- Feature Selection for Knowledge Discovery and Data Mining (Book)
- An Introduction to Variable and Feature Selection (Survey)
- Toward integrating feature selection algorithms for classification and clustering (Survey)
- Searching for Interacting Features
- Feature Subset Selection Bias for Classification Learning
[edit] See also
- Dimensionality reduction
- Feature extraction
- Data mining
- YALE, a freely available open-source software for intelligent data analysis, knowledge discovery, data mining, machine learning, visualization, etc. featuring numerous feature generation and feature selection operators.
- Weka, a Java software package including a collection of machine learning algorithms for data mining tasks.
[edit] External links
- NIPS challenge 2003 (see also NIPS)
- Naive Bayes implementation with feature selection in Visual Basic (includes executable and source code)
- minimum-Redundancy-Maximum-Relevance (mRMR) feature selection program