# Optimal Feature Selection Learners

OptimalFeatureSelection provides many learners for training optimal feature selection models, which we describe on this page along with a guide to their parameters.

## The Feature Selection Problem

The goal of feature selection is to find a small set of the features in the data that best predict the outcome. Mathematically, the problem being solved is:

\[\min_{\mathbf{w},b} \sum_{i = 1}^n \left[\ell (y_i, \mathbf{w}^T \mathbf{x}_i + b)\right] + \frac{1}{2\gamma} {\| \mathbf{w} \|}^2_2 ~~~ \text{s.t.} ~~~ {\| \mathbf{w} \|}_0 \leq k\]

We are trying to fit a linear model $(\mathbf{w}, b)$ to the training data. The first term calculates the error made by the model when applied to the training data for a specified scoring function $\ell$. The second term is a regularization term added to prevent overfitting and make the model robust to perturbations in the data. The constraint restricts the model to selecting at most $k$ of the features in the data - the coefficients for all other features are zero.

Depending on the choice of scoring criterion $\ell$ used in the first term, this problem represents many classical problems in machine learning:

- classification:
- the entropy criterion leads to logistic regression
- the L1 hinge loss criterion leads to support vector machines
- the L2 hinge loss criterion leads to L2 support vector machines

- regression:
- the mean-squared error criterion leads to linear regression
- the L1 hinge loss criterion leads to L1 support vector regression
- the L2 hinge loss criterion leads to L2 support vector regression

Note that in the mean-squared error scenario for regression, if we relax the constraint to use the 1-norm instead of the 0-norm, we recover the elastic net and lasso.

### Solving the Feature Selection Problem

OptimalFeatureSelection provides two approaches for solving the above problem.

#### Exact Approach

The exact approach solves the feature selection problem to exact optimality using a mixed-integer optimization solver. This provides a certifiably-optimal solution to the problem, but typically requires a commercial mixed-integer optimization solver like Gurobi or CPLEX to solve in a reasonable amount of time.

#### Relaxation Approach

The relaxation approach solves the feature selection problem using a relaxation of the problem that enables a specialized solution algorithm. This approach is significantly faster than the exact approach, and typically finds the same optimal solution, albeit without a proof of optimality. This approach does not require a mixed-integer optimization solver.

## Shared Parameters

All of the learners provided by OptimalFeatureSelection are `OptimalFeatureSelectionLearner`

s. In addition to the shared learner parameters, these learners support the following parameters to control the behavior of the OptimalFeatureSelection algorithm.

Note that the shared parameter `normalize_X`

has no effect for `OptimalFeatureSelectionLearner`

s.

`sparsity`

`sparsity`

controls the maximum number of features allowed in the fitted model (the parameter $k$ in the problem formulation). Each numeric feature or categoric level assigned a non-zero weight will count towards this `sparsity`

restriction. This parameter must always be explicitly set or tuned. See `RelativeParameterInput`

for the different ways to specify this value.

`gamma`

`gamma`

controls the degree of regularization applied to the problem (the parameter $\gamma$ in the problem formulation). The default value is `:auto`

, which uses a value of $1/\sqrt{n}$ that gives good results in most cases. It is possible to supply and/or tune this value manually by passing numeric values if you would like more control over the regularization, but we have found empirically that this has little effect on performance.

`relaxation`

`relaxation`

controls which approach will be used to solve the problem. The default is `true`

, meaning the relaxation approach will be used. If set to `false`

, the exact approach will be used instead.

`refit_learner`

`refit_learner`

can be used to automatically refit a new model on the subset of features selected by the Optimal Feature Selection Learner. As an example, you can use `GLMNetCVRegressor`

or `GLMNetCVClassifier`

to refit an L1-regularized model that uses the selected features, which can sometimes improve the model performance in very noisy data.

`solver`

`solver`

controls the mixed-integer optimization solver used to solve the problem when using the exact approach (i.e. `relaxation=false`

). You need to pass a solver that supports lazy constraint callbacks:

Refer to the JuMP documentation for information on further customization of solver behavior, such as using `optimizer_with_attributes`

to supply additional parameters to the solver.

For example, you could specify Gurobi with a 60 second time limit using:

```
using Gurobi
using JuMP
lnr = IAI.OptimalFeatureSelectionClassifier(
relaxation=false,
solver=optimizer_with_attributes(
Gurobi.Optimizer, "TimeLimit" => 60
),
)
```

Similarly, you could specify CPLEX with a 60 second time limit using:

```
using CPLEX
using JuMP
lnr = IAI.OptimalFeatureSelectionClassifier(
relaxation=false,
solver=optimizer_with_attributes(
CPLEX.Optimizer, "CPX_PARAM_TILIM" => 60
),
)
```

We do not recommend trying to use the exact mode with the open-source GLPK solver, as it is simply not fast enough to make any progress solving the models.

`solver_disable_gc`

`solver_disable_gc`

is a `Bool`

defaulting to `false`

. If set to `true`

, the Julia garbage collector will be disabled while the mixed-integer optimization solver is running, which may be necessary in some cases.

`coordinated_sparsity`

`coordinated_sparsity`

is a `Bool`

defaulting to `false`

that must be set to `true`

to enable coordinated-sparsity fitting across multiple clusters of data. Refer to the documentation on coordinated-sparsity fitting for more information.

`coordinated_sparsity_scale_per_cluster`

`coordinated_sparsity_scale_per_cluster`

is a `Bool`

controlling the data scaling behavior during coordinated-sparsity fitting. If set to `true`

, the problem in each cluster will be scaled independently of all other clusters. The default value is `false`

, which applies the same common scaling to all clusters.

## Classification Learners

The `OptimalFeatureSelectionClassifier`

is used for conducting optimal feature selection on classification problems. There are no additional parameters beyond the shared parameters. The following values for `criterion`

are permitted:

## Regression Learners

The `OptimalFeatureSelectionRegressor`

is used for conducting optimal feature selection on regression problems. The following values for `criterion`

are permitted:

In addition to the shared parameters, these learners also support the shared regression parameters.

Similar to `normalize_X`

, the shared regression parameter `normalize_y`

has no effect for `OptimalFeatureSelectionRegressor`

s.