Optimal Feature Selection Learners

OptimalFeatureSelection provides many learners for training optimal feature selection models, which we describe on this page along with a guide to their parameters.

The Feature Selection Problem

The goal of feature selection is to find a small set of the features in the data that best predict the outcome. Mathematically, the problem being solved is:

\[\min_{\mathbf{w},b} \sum_{i = 1}^n \left[\ell (y_i, \mathbf{w}^T \mathbf{x}_i + b)\right] + \frac{1}{2\gamma} {\| \mathbf{w} \|}^2_2 ~~~ \text{s.t.} ~~~ {\| \mathbf{w} \|}_0 \leq k\]

We are trying to fit a linear model $(\mathbf{w}, b)$ to the training data. The first term calculates the error made by the model when applied to the training data for a specified scoring function $\ell$. The second term is a regularization term added to prevent overfitting and make the model robust to perturbations in the data. The constraint restricts the model to selecting at most $k$ of the features in the data - the coefficients for all other features are zero.

Depending on the choice of scoring criterion $\ell$ used in the first term, this problem represents many classical problems in machine learning:

  • classification:
    • the entropy criterion leads to logistic regression
    • the L1 hinge loss criterion leads to support vector machines
    • the L2 hinge loss criterion leads to L2 support vector machines
  • regression:

Note that in the mean-squared error scenario for regression, if we relax the constraint to use the 1-norm instead of the 0-norm, we recover the elastic net and lasso.

Solving the Feature Selection Problem

OptimalFeatureSelection provides two approaches for solving the above problem.

Exact Approach

The exact approach solves the feature selection problem to exact optimality using a mixed-integer optimization solver. This provides a certifiably-optimal solution to the problem, but typically requires a commercial mixed-integer optimization solver like Gurobi or CPLEX to solve in a reasonable amount of time.

Relaxation Approach

The relaxation approach solves the feature selection problem using a relaxation of the problem that enables a specialized solution algorithm. This approach is significantly faster than the exact approach, and typically finds the same optimal solution, albeit without a proof of optimality. This approach does not require a mixed-integer optimization solver.

Shared Parameters

All of the learners provided by OptimalFeatureSelection are OptimalFeatureSelectionLearners. In addition to the shared learner parameters, these learners support the following parameters to control the behavior of the OptimalFeatureSelection algorithm.

Note that the shared parameter normalize_X has no effect for OptimalFeatureSelectionLearners.


sparsity controls the maximum number of features allowed in the fitted model (the parameter $k$ in the problem formulation). Each numeric feature or categoric level assigned a non-zero weight will count towards this sparsity restriction. This parameter must always be explicitly set or tuned. See RelativeParameterInput for the different ways to specify this value.


gamma controls the degree of regularization applied to the problem (the parameter $\gamma$ in the problem formulation). The default value is :auto, which uses a value of $1/\sqrt{n}$ that gives good results in most cases. It is possible to supply and/or tune this value manually by passing numeric values if you would like more control over the regularization.


relaxation controls which approach will be used to solve the problem. The default is true, meaning the relaxation approach will be used. If set to false, the exact approach will be used instead.


solver controls the mixed-integer optimization solver used to solve the problem when using the exact approach (i.e. relaxation=false). You need to pass a solver that supports lazy constraint callbacks:

You can configure the solver behavior further by passing the appropriate parameters to the solver constructor.

For example, you could specify Gurobi with a 60 second time limit using:

using Gurobi
lnr = IAI.OptimalFeatureSelectionClassifier(

Similarly, you could specify CPLEX with a 60 second time limit using:

using CPLEX
lnr = IAI.OptimalFeatureSelectionClassifier(

We do not recommend trying to use the exact mode with the open-source GLPK solver, as it is simply not fast enough to make any progress solving the models.

Classification Learners

The OptimalFeatureSelectionClassifier is used for conducting optimal feature selection on classification problems. There are no additional parameters beyond the shared parameters.

Regression Learners

The OptimalFeatureSelectionRegressor is used for conducting optimal feature selection on regression problems. In addition to the shared parameters, these learners also support the shared regression parameters.

Similar to normalize_X, the shared regression parameter normalize_y has no effect for OptimalFeatureSelectionRegressors.