Optimal Feature Selection Learners

OptimalFeatureSelection provides many learners for training optimal feature selection models, which we describe on this page along with a guide to their parameters.

The Feature Selection Problem

The goal of feature selection is to find a small set of the features in the data that best predict the outcome. Mathematically, the problem being solved is:

\[\min_{\mathbf{w},b} \sum_{i = 1}^n \left[\ell (y_i, \mathbf{w}^T \mathbf{x}_i + b)\right] + \frac{1}{2\gamma} {\| \mathbf{w} \|}^2_2 ~~~ \text{s.t.} ~~~ {\| \mathbf{w} \|}_0 \leq k\]

We are trying to fit a linear model $(\mathbf{w}, b)$ to the training data. The first term calculates the error made by the model when applied to the training data for a specified scoring function $\ell$. The second term is a regularization term added to prevent overfitting and make the model robust to perturbations in the data. The constraint restricts the model to selecting at most $k$ of the features in the data - the coefficients for all other features are zero.

Depending on the choice of scoring criterion $\ell$ used in the first term, this problem represents many classical problems in machine learning:

classification:
- the entropy criterion leads to logistic regression
- the L1 hinge loss criterion leads to support vector machines
- the L2 hinge loss criterion leads to L2 support vector machines
regression:
- the mean-squared error criterion leads to linear regression
- the L1 hinge loss criterion leads to L1 support vector regression
- the L2 hinge loss criterion leads to L2 support vector regression

Note that in the mean-squared error scenario for regression, if we relax the constraint to use the 1-norm instead of the 0-norm, we recover the elastic net and lasso.

Solving the Feature Selection Problem

OptimalFeatureSelection provides two approaches for solving the above problem.

Exact Approach

The exact approach solves the feature selection problem to exact optimality using a mixed-integer optimization solver. This provides a certifiably-optimal solution to the problem, but typically requires a commercial mixed-integer optimization solver like Gurobi or CPLEX to solve in a reasonable amount of time.

Relaxation Approach

The relaxation approach solves the feature selection problem using a relaxation of the problem that enables a specialized solution algorithm. This approach is significantly faster than the exact approach, and typically finds the same optimal solution, albeit without a proof of optimality. This approach does not require a mixed-integer optimization solver.

Shared Parameters

All of the learners provided by OptimalFeatureSelection are OptimalFeatureSelectionLearners. In addition to the shared learner parameters, these learners support the following parameters to control the behavior of the OptimalFeatureSelection algorithm.

Note that the shared parameter normalize_X has no effect for OptimalFeatureSelectionLearners.

`sparsity`

sparsity controls the maximum number of features allowed in the fitted model (the parameter $k$ in the problem formulation). Each numeric feature or categoric level assigned a non-zero weight will count towards this sparsity restriction. This parameter must always be explicitly set or tuned. See RelativeParameterInput for the different ways to specify this value.

`gamma`

gamma controls the degree of regularization applied to the problem (the parameter $\gamma$ in the problem formulation). The default value is :auto, which uses a value of $1/\sqrt{n}$ that gives good results in most cases. It is possible to supply and/or tune this value manually by passing numeric values if you would like more control over the regularization, but we have found empirically that this has little effect on performance.

`relaxation`

relaxation controls which approach will be used to solve the problem. The default is true, meaning the relaxation approach will be used. If set to false, the exact approach will be used instead.

`refit_learner`

refit_learner can be used to automatically refit a new model on the subset of features selected by the Optimal Feature Selection Learner. As an example, you can use GLMNetCVRegressor or GLMNetCVClassifier to refit an L1-regularized model that uses the selected features, which can sometimes improve the model performance in very noisy data.

`solver`

solver controls the mixed-integer optimization solver used to solve the problem when using the exact approach (i.e. relaxation=false). You need to pass a solver that supports lazy constraint callbacks:

Gurobi.Optimizer via Gurobi.jl
CPLEX.Optimizer via CPLEX.jl

Refer to the JuMP documentation for information on further customization of solver behavior, such as using optimizer_with_attributes to supply additional parameters to the solver.

For example, you could specify Gurobi with a 60 second time limit using:

using Gurobi
using JuMP
lnr = IAI.OptimalFeatureSelectionClassifier(
    relaxation=false,
    solver=optimizer_with_attributes(
        Gurobi.Optimizer, "TimeLimit" => 60
    ),
)

Similarly, you could specify CPLEX with a 60 second time limit using:

using CPLEX
using JuMP
lnr = IAI.OptimalFeatureSelectionClassifier(
    relaxation=false,
    solver=optimizer_with_attributes(
        CPLEX.Optimizer, "CPX_PARAM_TILIM" => 60
    ),
)

We do not recommend trying to use the exact mode with the open-source GLPK solver, as it is simply not fast enough to make any progress solving the models.

`solver_disable_gc`

solver_disable_gc is a Bool defaulting to false. If set to true, the Julia garbage collector will be disabled while the mixed-integer optimization solver is running, which may be necessary in some cases.

`coordinated_sparsity`

coordinated_sparsity is a Bool defaulting to false that must be set to true to enable coordinated-sparsity fitting across multiple clusters of data. Refer to the documentation on coordinated-sparsity fitting for more information.

`coordinated_sparsity_scale_per_cluster`

coordinated_sparsity_scale_per_cluster is a Bool controlling the data scaling behavior during coordinated-sparsity fitting. If set to true, the problem in each cluster will be scaled independently of all other clusters. The default value is false, which applies the same common scaling to all clusters.

Classification Learners

The OptimalFeatureSelectionClassifier is used for conducting optimal feature selection on classification problems. There are no additional parameters beyond the shared parameters. The following values for criterion are permitted:

Regression Learners

The OptimalFeatureSelectionRegressor is used for conducting optimal feature selection on regression problems. The following values for criterion are permitted:

In addition to the shared parameters, these learners also support the shared regression parameters.

Similar to normalize_X, the shared regression parameter normalize_y has no effect for OptimalFeatureSelectionRegressors.