# Using Optimal Feature Selection with Missing Data

Optimal Feature Selection does not directly support training or prediction when there is missing data. If this is the case, we recommend using one of the following approaches:

- Use OptImpute to impute the missing values, by using
`fit_predict!`

on the training data and`transform`

on the testing data. - Engineer additional features to encode the patterns of missingness in the data as described in Bertsimas et al. (2021), by using
`fit_and_expand!`

on the training data and`transform_and_expand`

on the testing data.

On this page, we demonstrate how to use these approaches in a classification problem, but the same approach also applies to regression problems.

We first load in the hungarian-heart-disease dataset.

```
using CSV, DataFrames
df = CSV.read(
"processed.hungarian.data", DataFrame,
missingstring="?",
header=[:age, :sex, :cp, :trestbps, :chol, :fbs, :restecg, :thalach, :exang,
:oldpeak, :slope, :ca, :thal, :num])
X = df[:, 1:(end - 1)]
y = df[:, end]
X
```

```
294×13 DataFrame
Row │ age sex cp trestbps chol fbs restecg thalach exang ⋯
│ Int64 Int64 Int64 Int64? Int64? Int64? Int64? Int64? Int64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 28 1 2 130 132 0 2 185 ⋯
2 │ 29 1 2 120 243 0 0 160
3 │ 29 1 2 140 missing 0 0 170
4 │ 30 0 1 170 237 0 1 170
5 │ 31 0 2 100 219 0 1 150 ⋯
6 │ 32 0 2 105 198 0 0 165
7 │ 32 1 2 110 225 0 0 184
8 │ 32 1 2 125 254 0 0 155
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
288 │ 50 1 4 140 341 0 1 125 ⋯
289 │ 52 1 4 140 266 0 0 134
290 │ 52 1 4 160 331 0 0 94
291 │ 54 0 3 130 294 0 1 100
292 │ 56 1 4 155 342 1 0 150 ⋯
293 │ 58 0 2 180 393 0 0 110
294 │ 65 1 4 130 275 0 1 115
5 columns and 279 rows omitted
```

We can see `X`

has some missing entries, and as a result, we would not be able to directly use Optimal Feature Selection.

Let's first split the data into training and testing:

`(X_train, y_train), (X_test, y_test) = IAI.split_data(:classification, X, y, seed=4)`

## Approach 1: Optimal Imputation

Under the first approach, we will use an Optimal Imputation imputation learner (in this case `:opt_knn`

) to fill in the missing values in `X_train`

and `X_test`

:

```
imputer = IAI.ImputationLearner(:opt_knn, random_seed=1)
X_train_imputed = IAI.fit_transform!(imputer, X_train)
X_test_imputed = IAI.transform(imputer, X_test)
```

Now that the training and testing data contain no missing values, we can proceed to train the Optimal Feature Selection learner, and evaluate it on the transformed testing data:

```
grid = IAI.GridSearch(
IAI.OptimalFeatureSelectionClassifier(random_seed=1),
sparsity=1:size(X, 2),
)
IAI.fit!(grid, X_train_imputed, y_train)
IAI.score(grid, X_test_imputed, y_test, criterion=:auc)
```

```
[ Warning: For full sparsity, use ridge regression for faster performance.
0.8872767857142855
```

## Approach 2: Engineer features to encode missingness pattern

The second approach is to engineer additional features to encode the pattern of missingness. To do so, we first create a simple imputer (such as a `ZeroImputationLearner`

or `MeanImputationLearner`

), then call `fit_and_expand!`

on the training data to expand the data with additional features that encode the missingness, and finally call `transform_and_expand`

on the testing data to apply the same transformation.

There are two types of feature expansion from which to choose, specified using the `type`

keyword argument:

`:finite`

adds an indicator encoding the missingness for each feature.`:affine`

creates a feature encoding the missingness relation between every pair of features, allowing for the regression to adaptively adjust when certain features are missing but not others.

Let's first try the `:finite`

expansion method:

```
imputer = IAI.ImputationLearner(:zero, normalize_X=false)
X_train_finite = IAI.fit_and_expand!(imputer, X_train, type=:finite)
X_test_finite = IAI.transform_and_expand(imputer, X_test, type=:finite)
names(X_train_finite)
```

```
26-element Vector{String}:
"age"
"sex"
"cp"
"trestbps"
"chol"
"fbs"
"restecg"
"thalach"
"exang"
"oldpeak"
⋮
"chol_is_missing"
"fbs_is_missing"
"restecg_is_missing"
"thalach_is_missing"
"exang_is_missing"
"oldpeak_is_missing"
"slope_is_missing"
"ca_is_missing"
"thal_is_missing"
```

We can see that `X_train_finite`

has doubled to 26 features due to the missingness indicators added for each feature. We can now train the Optimal Feature Selection learner on the expanded data:

```
grid = IAI.GridSearch(
IAI.OptimalFeatureSelectionClassifier(random_seed=1),
sparsity=1:size(X_train_finite, 2),
)
IAI.fit!(grid, X_train_finite, y_train)
IAI.get_learner(grid)
```

```
[ Warning: For full sparsity, use ridge regression for faster performance.
Fitted OptimalFeatureSelectionClassifier:
Constant: -3.89513
Weights:
cp: 0.614515
exang: 0.942181
fbs: 0.777978
oldpeak: 0.461175
restecg: -0.4196
restecg_is_missing: 3.66969
sex: 0.8913
slope: 0.432254
(Higher score indicates stronger prediction for class `1`)
```

We see that in addition to the original set of features, the missing indicator `restecg_is_missing`

is also selected. This means that if the variable `restecg`

is missing, a constant factor of 3.66969 will be used in the regression, whereas if `restecg`

is known, a coefficient of -0.4196 will be applied to this known value.

We can evaluate the model on the transformed and expanded test set:

`IAI.score(grid, X_test_finite, y_test, criterion=:auc)`

`0.8775111607142859`

Similarly, we can try the `:affine`

expansion method:

```
imputer = IAI.ImputationLearner(:zero, normalize_X=false)
X_train_affine = IAI.fit_and_expand!(imputer, X_train, type=:affine)
X_test_affine = IAI.transform_and_expand(imputer, X_test, type=:affine)
names(X_train_affine)
```

```
182-element Vector{String}:
"age"
"sex"
"cp"
"trestbps"
"chol"
"fbs"
"restecg"
"thalach"
"exang"
"oldpeak"
⋮
"thal_correction_for_trestbps_missing"
"thal_correction_for_chol_missing"
"thal_correction_for_fbs_missing"
"thal_correction_for_restecg_missing"
"thal_correction_for_thalach_missing"
"thal_correction_for_exang_missing"
"thal_correction_for_oldpeak_missing"
"thal_correction_for_slope_missing"
"thal_correction_for_ca_missing"
```

We see that `X_train_affine`

is expanded even further to 182 features, due to the pairwise adjustment terms for every combination of features. Training the Optimal Feature Selection learner gives us:

```
grid = IAI.GridSearch(
IAI.OptimalFeatureSelectionClassifier(random_seed=1),
sparsity=1:size(X_train_affine, 2),
)
IAI.fit!(grid, X_train_affine, y_train)
IAI.get_learner(grid)
```

```
[ Warning: For full sparsity, use ridge regression for faster performance.
Fitted OptimalFeatureSelectionClassifier:
Constant: -3.5853
Weights:
age_correction_for_ca_missing: -0.013717
age_correction_for_fbs_missing: -0.0267239
age_correction_for_restecg_missing: 0.0165299
chol: 0.00160183
chol_correction_for_restecg_missing: 0.00308185
cp: 0.349488
cp_correction_for_ca_missing: 0.239332
cp_correction_for_restecg_missing: 0.909147
cp_correction_for_thal_missing: 0.11822
exang: 0.416671
exang_correction_for_chol_missing: 1.24073
exang_correction_for_slope_missing: 0.6669
exang_correction_for_thal_missing: 0.315088
fbs_correction_for_thal_missing: 0.950344
oldpeak: 0.241003
oldpeak_correction_for_ca_missing: 0.183674
oldpeak_correction_for_slope_missing: 0.677676
restecg: -0.276532
restecg_correction_for_ca_missing: -0.276532
restecg_is_missing: 0.909147
sex: 0.41225
sex_correction_for_ca_missing: 0.301809
sex_correction_for_restecg_missing: 0.909147
sex_correction_for_thal_missing: 0.634863
slope: 0.275441
slope_correction_for_ca_missing: 0.200838
slope_correction_for_thal_missing: 0.237835
thal: 0.140411
thal_correction_for_ca_missing: 0.0752845
thalach: -0.00574081
thalach_correction_for_restecg_missing: 0.0066849
trestbps_correction_for_restecg_missing: 0.00649391
(Higher score indicates stronger prediction for class `1`)
```

We see that in addition to the original features, the missing indicator `restecg_is_missing`

is selected, along with some of the pairwise adjustment terms.

To understand and interpret these adjustment terms, let us consider the feature `age_correction_for_ca_missing`

, which is selected with a coefficient of -0.013717 in the model. This means that if the feature `ca`

is missing and `age`

is not missing, an additional coefficient of -0.013717 is applied to the value of `age`

. In this way, the coefficient of `age`

in the model is "corrected" due to the value of `ca`

being missing. We can think of this as saying the model would prefer to use the `ca`

feature if it is available, but if it is missing, it will compensate by adjusting the coefficients on other features in the data.

We evaluate the model on the transformed and expanded test set:

`IAI.score(grid, X_test_affine, y_test, criterion=:auc)`

`0.8962053571428573`

We see that both of the expansion approaches achieve comparable results to those under Optimal Imputation. In practice, the best approach for a problem will depend heavily on the specific dataset and how the missing values were generated (e.g. random vs. structured missingness). We should also consider the computational requirements when choosing between these approachs, as the Optimal Imputation approach requires training a separate imputation learner, while the expansion approaches add additional features to the dataset which can increase the training time of the predictive model.