# Optimal Feature Selection Learners

OptimalFeatureSelection provides many learners for training optimal feature selection models, which we describe on this page along with a guide to their parameters.

## The Feature Selection Problem

The goal of feature selection is to find a small set of the features in the data that best predict the outcome. Mathematically, the problem being solved is:

$$$\min_{\mathbf{w},b} \sum_{i = 1}^n \left[\ell (y_i, \mathbf{w}^T \mathbf{x}_i + b)\right] + \frac{1}{2\gamma} {\| \mathbf{w} \|}^2_2 ~~~ \text{s.t.} ~~~ {\| \mathbf{w} \|}_0 \leq k$$$

We are trying to fit a linear model $(\mathbf{w}, b)$ to the training data. The first term calculates the error made by the model when applied to the training data for a specified scoring function $\ell$. The second term is a regularization term added to prevent overfitting and make the model robust to perturbations in the data. The constraint restricts the model to selecting at most $k$ of the features in the data - the coefficients for all other features are zero.

Depending on the choice of scoring criterion $\ell$ used in the first term, this problem represents many classical problems in machine learning:

• classification:
• the entropy criterion leads to logistic regression
• the L1 hinge loss criterion leads to support vector machines
• the L2 hinge loss criterion leads to L2 support vector machines
• regression:

Note that in the mean-squared error scenario for regression, if we relax the constraint to use the 1-norm instead of the 0-norm, we recover the elastic net and lasso.

### Solving the Feature Selection Problem

OptimalFeatureSelection provides two approaches for solving the above problem.

#### Exact Approach

The exact approach solves the feature selection problem to exact optimality using a mixed-integer optimization solver. This provides a certifiably-optimal solution to the problem, but typically requires a commercial mixed-integer optimization solver like Gurobi or CPLEX to solve in a reasonable amount of time.

#### Relaxation Approach

The relaxation approach solves the feature selection problem using a relaxation of the problem that enables a specialized solution algorithm. This approach is significantly faster than the exact approach, and typically finds the same optimal solution, albeit without a proof of optimality. This approach does not require a mixed-integer optimization solver.

## Shared Parameters

All of the learners provided by OptimalFeatureSelection are OptimalFeatureSelectionLearners. In addition to the shared learner parameters, these learners support the following parameters to control the behavior of the OptimalFeatureSelection algorithm.

Note that the shared parameter normalize_X has no effect for OptimalFeatureSelectionLearners.

#### sparsity

sparsity controls the maximum number of features allowed in the fitted model (the parameter $k$ in the problem formulation). Each numeric feature or categoric level assigned a non-zero weight will count towards this sparsity restriction. This parameter must always be explicitly set or tuned. See RelativeParameterInput for the different ways to specify this value.

#### gamma

gamma controls the degree of regularization applied to the problem (the parameter $\gamma$ in the problem formulation). The default value is :auto, which uses a value of $1/\sqrt{n}$ that gives good results in most cases. It is possible to supply and/or tune this value manually by passing numeric values if you would like more control over the regularization, but we have found empirically that this has little effect on performance.

#### relaxation

relaxation controls which approach will be used to solve the problem. The default is true, meaning the relaxation approach will be used. If set to false, the exact approach will be used instead.

#### refit_l1

refit_l1 controls whether to refit an L1-regularized model on the selected subset of features. This can sometimes improve the model performance in very noisy data. The default value is false.

#### solver

solver controls the mixed-integer optimization solver used to solve the problem when using the exact approach (i.e. relaxation=false). You need to pass a solver that supports lazy constraint callbacks:

• Gurobi.Optimizer via Gurobi.jl
• CPLEX.Optimizer via CPLEX.jl

Refer to the JuMP documentation for information on further customization of solver behavior, such as using optimizer_with_attributes to supply additional parameters to the solver.

For example, you could specify Gurobi with a 60 second time limit using:

using Gurobi
using JuMP
lnr = IAI.OptimalFeatureSelectionClassifier(
relaxation=false,
solver=optimizer_with_attributes(
Gurobi.Optimizer, "TimeLimit" => 60
),
)

Similarly, you could specify CPLEX with a 60 second time limit using:

using CPLEX
using JuMP
lnr = IAI.OptimalFeatureSelectionClassifier(
relaxation=false,
solver=optimizer_with_attributes(
CPLEX.Optimizer, "CPX_PARAM_TILIM" => 60
),
)

We do not recommend trying to use the exact mode with the open-source GLPK solver, as it is simply not fast enough to make any progress solving the models.

## Classification Learners

The OptimalFeatureSelectionClassifier is used for conducting optimal feature selection on classification problems. There are no additional parameters beyond the shared parameters. The following values for criterion are permitted:

## Regression Learners

The OptimalFeatureSelectionRegressor is used for conducting optimal feature selection on regression problems. The following values for criterion are permitted:

In addition to the shared parameters, these learners also support the shared regression parameters.

Similar to normalize_X, the shared regression parameter normalize_y has no effect for OptimalFeatureSelectionRegressors.