3.5. Feature selection¶
3.5.1. Univariate feature selection¶
Univariate feature selection works by selecting the best features based on univariate statistical tests. It can seen as a preprocessing step to an estimator. The scikit.learn exposes feature selection routines a objects that implement the transform method. The k-best features can be selected based on:
- scikits.learn.feature_selection.univariate_selection.SelectKBest(score_func, k=10)¶
Filter : Select the k lowest p-values
or by setting a percentile of features to keep using
- scikits.learn.feature_selection.univariate_selection.SelectPercentile(score_func, percentile=10)¶
Filter : Select the best percentile of the p_values
or using common statistical quantities:
- scikits.learn.feature_selection.univariate_selection.SelectFpr(score_func, alpha=0.05)¶
Filter : Select the pvalues below alpha
- scikits.learn.feature_selection.univariate_selection.SelectFdr(score_func, alpha=0.05)¶
Filter : Select the p-values corresponding to an estimated false discovery rate of alpha. This uses the Benjamini-Hochberg procedure
- scikits.learn.feature_selection.univariate_selection.SelectFwe(score_func, alpha=0.05)¶
Filter : Select the p-values corresponding to a corrected p-value of alpha
These objects take as input a scoring function that returns univariate p-values.
Examples:
3.5.1.1. Feature scoring functions¶
Warning
Beware not to use a regression scoring function with a classification problem.
3.5.1.1.1. For classification¶
- scikits.learn.feature_selection.univariate_selection.f_classif(X, y)¶
Compute the Anova F-value for the provided sample
Parameters : X : array of shape (n_samples, n_features)
the set of regressors sthat will tested sequentially
y : array of shape(n_samples)
the data matrix
Returns : F : array of shape (m),
the set of F values
pval : array of shape(m),
the set of p-values
3.5.1.1.2. For regression¶
- scikits.learn.feature_selection.univariate_selection.f_regression(X, y, center=True)¶
Quick linear model for testing the effect of a single regressor, sequentially for many regressors This is done in 3 steps: 1. the regressor of interest and the data are orthogonalized wrt constant regressors 2. the cross correlation between data and regressors is computed 3. it is converted to an F score then to a p-value
Parameters : X : array of shape (n_samples, n_features)
the set of regressors sthat will tested sequentially
y : array of shape(n_samples)
the data matrix
center : True, bool,
If true, X and y are centered
Returns : F : array of shape (m),
the set of F values
pval : array of shape(m)
the set of p-values