Optimal Feature Selection Learners
OptimalFeatureSelection provides many learners for training optimal feature selection models, which we describe on this page along with a guide to their parameters.
The Feature Selection Problem
The goal of feature selection is to find a small set of the features in the data that best predict the outcome. Mathematically, the problem being solved is:
\[\min_{\mathbf{w},b} \sum_{i = 1}^n \left[\ell (y_i, \mathbf{w}^T \mathbf{x}_i + b)\right] + \frac{1}{2\gamma} {\| \mathbf{w} \|}^2_2 ~~~ \text{s.t.} ~~~ {\| \mathbf{w} \|}_0 \leq k\]
We are trying to fit a linear model $(\mathbf{w}, b)$ to the training data. The first term calculates the error made by the model when applied to the training data for a specified scoring function $\ell$. The second term is a regularization term added to prevent overfitting and make the model robust to perturbations in the data. The constraint restricts the model to selecting at most $k$ of the features in the data - the coefficients for all other features are zero.
Depending on the choice of scoring criterion $\ell$ used in the first term, this problem represents many classical problems in machine learning:
- classification:
- the entropy criterion leads to logistic regression
- the L1 hinge loss criterion leads to support vector machines
- the L2 hinge loss criterion leads to L2 support vector machines
- regression:
- the mean-squared error criterion leads to linear regression
- the L1 hinge loss criterion leads to L1 support vector regression
- the L2 hinge loss criterion leads to L2 support vector regression
Note that in the mean-squared error scenario for regression, if we relax the constraint to use the 1-norm instead of the 0-norm, we recover the elastic net and lasso.
Solving the Feature Selection Problem
OptimalFeatureSelection provides two approaches for solving the above problem.
Exact Approach
The exact approach solves the feature selection problem to exact optimality using a mixed-integer optimization solver. This provides a certifiably-optimal solution to the problem, but typically requires a commercial mixed-integer optimization solver like Gurobi or CPLEX to solve in a reasonable amount of time.
Relaxation Approach
The relaxation approach solves the feature selection problem using a relaxation of the problem that enables a specialized solution algorithm. This approach is significantly faster than the exact approach, and typically finds the same optimal solution, albeit without a proof of optimality. This approach does not require a mixed-integer optimization solver.
Shared Parameters
All of the learners provided by OptimalFeatureSelection are OptimalFeatureSelectionLearner
s. In addition to the shared learner parameters, these learners support the following parameters to control the behavior of the OptimalFeatureSelection algorithm.
Note that the shared parameter normalize_X
has no effect for OptimalFeatureSelectionLearner
s.
sparsity
sparsity
controls the maximum number of features allowed in the fitted model (the parameter $k$ in the problem formulation). Each numeric feature or categoric level assigned a non-zero weight will count towards this sparsity
restriction. This parameter must always be explicitly set or tuned. See RelativeParameterInput
for the different ways to specify this value.
gamma
gamma
controls the degree of regularization applied to the problem (the parameter $\gamma$ in the problem formulation). The default value is :auto
, which uses a value of $1/\sqrt{n}$ that gives good results in most cases. It is possible to supply and/or tune this value manually by passing numeric values if you would like more control over the regularization, but we have found empirically that this has little effect on performance.
relaxation
relaxation
controls which approach will be used to solve the problem. The default is true
, meaning the relaxation approach will be used. If set to false
, the exact approach will be used instead.
refit_learner
refit_learner
can be used to automatically refit a new model on the subset of features selected by the Optimal Feature Selection Learner. As an example, you can use GLMNetCVRegressor
or GLMNetCVClassifier
to refit an L1-regularized model that uses the selected features, which can sometimes improve the model performance in very noisy data.
solver
solver
controls the mixed-integer optimization solver used to solve the problem when using the exact approach (i.e. relaxation=false
). You need to pass a solver that supports lazy constraint callbacks:
Refer to the JuMP documentation for information on further customization of solver behavior, such as using optimizer_with_attributes
to supply additional parameters to the solver.
For example, you could specify Gurobi with a 60 second time limit using:
using Gurobi
using JuMP
lnr = IAI.OptimalFeatureSelectionClassifier(
relaxation=false,
solver=optimizer_with_attributes(
Gurobi.Optimizer, "TimeLimit" => 60
),
)
Similarly, you could specify CPLEX with a 60 second time limit using:
using CPLEX
using JuMP
lnr = IAI.OptimalFeatureSelectionClassifier(
relaxation=false,
solver=optimizer_with_attributes(
CPLEX.Optimizer, "CPX_PARAM_TILIM" => 60
),
)
We do not recommend trying to use the exact mode with the open-source GLPK solver, as it is simply not fast enough to make any progress solving the models.
solver_disable_gc
solver_disable_gc
is a Bool
defaulting to false
. If set to true
, the Julia garbage collector will be disabled while the mixed-integer optimization solver is running, which may be necessary in some cases.
coordinated_sparsity
coordinated_sparsity
is a Bool
defaulting to false
that must be set to true
to enable coordinated-sparsity fitting across multiple clusters of data. Refer to the documentation on coordinated-sparsity fitting for more information.
coordinated_sparsity_scale_per_cluster
coordinated_sparsity_scale_per_cluster
is a Bool
controlling the data scaling behavior during coordinated-sparsity fitting. If set to true
, the problem in each cluster will be scaled independently of all other clusters. The default value is false
, which applies the same common scaling to all clusters.
Classification Learners
The OptimalFeatureSelectionClassifier
is used for conducting optimal feature selection on classification problems. There are no additional parameters beyond the shared parameters. The following values for criterion
are permitted:
Regression Learners
The OptimalFeatureSelectionRegressor
is used for conducting optimal feature selection on regression problems. The following values for criterion
are permitted:
In addition to the shared parameters, these learners also support the shared regression parameters.
Similar to normalize_X
, the shared regression parameter normalize_y
has no effect for OptimalFeatureSelectionRegressor
s.