Using Optimal Feature Selection with Missing Data

Optimal Feature Selection does not directly support training or prediction when there is missing data. If this is the case, we recommend using one of the following approaches:

Use OptImpute to impute the missing values, by using fit_transform! on the training data and transform on the testing data.
Engineer additional features to encode the patterns of missingness in the data as described in Bertsimas et al. (2021), by using fit_and_expand! on the training data and transform_and_expand on the testing data.

On this page, we demonstrate how to use these approaches in a classification problem, but the same approach also applies to regression problems.

We first load in the hungarian-heart-disease dataset.

using CSV, DataFrames
df = CSV.read(
    "processed.hungarian.data", DataFrame,
    missingstring="?",
    header=[:age, :sex, :cp, :trestbps, :chol, :fbs, :restecg, :thalach, :exang,
            :oldpeak, :slope, :ca, :thal, :num])
X = df[:, 1:(end - 1)]
y = df[:, end]
X

294×13 DataFrame
 Row │ age    sex    cp     trestbps  chol     fbs     restecg  thalach  exang ⋯
     │ Int64  Int64  Int64  Int64?    Int64?   Int64?  Int64?   Int64?   Int64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │    28      1      2       130      132       0        2      185        ⋯
   2 │    29      1      2       120      243       0        0      160
   3 │    29      1      2       140  missing       0        0      170
   4 │    30      0      1       170      237       0        1      170
   5 │    31      0      2       100      219       0        1      150        ⋯
   6 │    32      0      2       105      198       0        0      165
   7 │    32      1      2       110      225       0        0      184
   8 │    32      1      2       125      254       0        0      155
  ⋮  │   ⋮      ⋮      ⋮       ⋮         ⋮       ⋮        ⋮        ⋮       ⋮   ⋱
 288 │    50      1      4       140      341       0        1      125        ⋯
 289 │    52      1      4       140      266       0        0      134
 290 │    52      1      4       160      331       0        0       94
 291 │    54      0      3       130      294       0        1      100
 292 │    56      1      4       155      342       1        0      150        ⋯
 293 │    58      0      2       180      393       0        0      110
 294 │    65      1      4       130      275       0        1      115
                                                  5 columns and 279 rows omitted

We can see X has some missing entries, and as a result, we would not be able to directly use Optimal Feature Selection.

Let's first split the data into training and testing:

(X_train, y_train), (X_test, y_test) = IAI.split_data(:classification, X, y, seed=4)

Approach 1: Optimal Imputation

Under the first approach, we will use an Optimal Imputation imputation learner (in this case :opt_knn) to fill in the missing values in X_train and X_test:

imputer = IAI.ImputationLearner(:opt_knn, random_seed=1)
X_train_imputed = IAI.fit_transform!(imputer, X_train)
X_test_imputed = IAI.transform(imputer, X_test)

Now that the training and testing data contain no missing values, we can proceed to train the Optimal Feature Selection learner, and evaluate it on the transformed testing data:

grid = IAI.GridSearch(
    IAI.OptimalFeatureSelectionClassifier(random_seed=1),
    sparsity=1:size(X, 2),
)
IAI.fit!(grid, X_train_imputed, y_train)
IAI.score(grid, X_test_imputed, y_test, criterion=:auc)

[ Warning: For full sparsity, use ridge regression for faster performance.
0.8872767857142855

Approach 2: Engineer features to encode missingness pattern

The second approach is to engineer additional features to encode the pattern of missingness. To do so, we first create a simple imputer (such as a ZeroImputationLearner or MeanImputationLearner), then call fit_and_expand! on the training data to expand the data with additional features that encode the missingness, and finally call transform_and_expand on the testing data to apply the same transformation.

There are two types of feature expansion from which to choose, specified using the type keyword argument:

:finite adds an indicator encoding the missingness for each feature.
:affine creates a feature encoding the missingness relation between every pair of features, allowing for the regression to adaptively adjust when certain features are missing but not others.

Let's first try the :finite expansion method:

imputer = IAI.ImputationLearner(:zero, normalize_X=false)
X_train_finite = IAI.fit_and_expand!(imputer, X_train, type=:finite)
X_test_finite = IAI.transform_and_expand(imputer, X_test, type=:finite)
names(X_train_finite)

26-element Vector{String}:
 "age"
 "sex"
 "cp"
 "trestbps"
 "chol"
 "fbs"
 "restecg"
 "thalach"
 "exang"
 "oldpeak"
 ⋮
 "chol_is_missing"
 "fbs_is_missing"
 "restecg_is_missing"
 "thalach_is_missing"
 "exang_is_missing"
 "oldpeak_is_missing"
 "slope_is_missing"
 "ca_is_missing"
 "thal_is_missing"

We can see that X_train_finite has doubled to 26 features due to the missingness indicators added for each feature. We can now train the Optimal Feature Selection learner on the expanded data:

grid = IAI.GridSearch(
    IAI.OptimalFeatureSelectionClassifier(random_seed=1),
    sparsity=1:size(X_train_finite, 2),
)
IAI.fit!(grid, X_train_finite, y_train)
IAI.get_learner(grid)

[ Warning: For full sparsity, use ridge regression for faster performance.
Fitted OptimalFeatureSelectionClassifier:
  Constant: -3.89513
  Weights:
    cp:                  0.614515
    exang:               0.942181
    fbs:                 0.777978
    oldpeak:             0.461175
    restecg:            -0.4196
    restecg_is_missing:  3.66969
    sex:                 0.8913
    slope:               0.432254
  (Higher score indicates stronger prediction for class `1`)

We see that in addition to the original set of features, the missing indicator restecg_is_missing is also selected. This means that if the variable restecg is missing, a constant factor of 3.66969 will be used in the regression, whereas if restecg is known, a coefficient of -0.4196 will be applied to this known value.

We can evaluate the model on the transformed and expanded test set:

IAI.score(grid, X_test_finite, y_test, criterion=:auc)

0.8775111607142859

Similarly, we can try the :affine expansion method:

imputer = IAI.ImputationLearner(:zero, normalize_X=false)
X_train_affine = IAI.fit_and_expand!(imputer, X_train, type=:affine)
X_test_affine = IAI.transform_and_expand(imputer, X_test, type=:affine)
names(X_train_affine)

182-element Vector{String}:
 "age"
 "sex"
 "cp"
 "trestbps"
 "chol"
 "fbs"
 "restecg"
 "thalach"
 "exang"
 "oldpeak"
 ⋮
 "thal_correction_for_trestbps_missing"
 "thal_correction_for_chol_missing"
 "thal_correction_for_fbs_missing"
 "thal_correction_for_restecg_missing"
 "thal_correction_for_thalach_missing"
 "thal_correction_for_exang_missing"
 "thal_correction_for_oldpeak_missing"
 "thal_correction_for_slope_missing"
 "thal_correction_for_ca_missing"

We see that X_train_affine is expanded even further to 182 features, due to the pairwise adjustment terms for every combination of features. Training the Optimal Feature Selection learner gives us:

grid = IAI.GridSearch(
    IAI.OptimalFeatureSelectionClassifier(random_seed=1),
    sparsity=1:size(X_train_affine, 2),
)
IAI.fit!(grid, X_train_affine, y_train)
IAI.get_learner(grid)

[ Warning: For full sparsity, use ridge regression for faster performance.
Fitted OptimalFeatureSelectionClassifier:
  Constant: -3.5853
  Weights:
    age_correction_for_ca_missing:           -0.013717
    age_correction_for_fbs_missing:          -0.0267239
    age_correction_for_restecg_missing:       0.0165299
    chol:                                     0.00160183
    chol_correction_for_restecg_missing:      0.00308185
    cp:                                       0.349488
    cp_correction_for_ca_missing:             0.239332
    cp_correction_for_restecg_missing:        0.909147
    cp_correction_for_thal_missing:           0.11822
    exang:                                    0.416671
    exang_correction_for_chol_missing:        1.24073
    exang_correction_for_slope_missing:       0.6669
    exang_correction_for_thal_missing:        0.315088
    fbs_correction_for_thal_missing:          0.950344
    oldpeak:                                  0.241003
    oldpeak_correction_for_ca_missing:        0.183674
    oldpeak_correction_for_slope_missing:     0.677676
    restecg:                                 -0.276532
    restecg_correction_for_ca_missing:       -0.276532
    restecg_is_missing:                       0.909147
    sex:                                      0.41225
    sex_correction_for_ca_missing:            0.301809
    sex_correction_for_restecg_missing:       0.909147
    sex_correction_for_thal_missing:          0.634863
    slope:                                    0.275441
    slope_correction_for_ca_missing:          0.200838
    slope_correction_for_thal_missing:        0.237835
    thal:                                     0.140411
    thal_correction_for_ca_missing:           0.0752845
    thalach:                                 -0.00574081
    thalach_correction_for_restecg_missing:   0.0066849
    trestbps_correction_for_restecg_missing:  0.00649391
  (Higher score indicates stronger prediction for class `1`)

We see that in addition to the original features, the missing indicator restecg_is_missing is selected, along with some of the pairwise adjustment terms.

To understand and interpret these adjustment terms, let us consider the feature age_correction_for_ca_missing, which is selected with a coefficient of -0.013717 in the model. This means that if the feature ca is missing and age is not missing, an additional coefficient of -0.013717 is applied to the value of age. In this way, the coefficient of age in the model is "corrected" due to the value of ca being missing. We can think of this as saying the model would prefer to use the ca feature if it is available, but if it is missing, it will compensate by adjusting the coefficients on other features in the data.

We evaluate the model on the transformed and expanded test set:

IAI.score(grid, X_test_affine, y_test, criterion=:auc)

0.8962053571428573

We see that both of the expansion approaches achieve comparable results to those under Optimal Imputation. In practice, the best approach for a problem will depend heavily on the specific dataset and how the missing values were generated (e.g. random vs. structured missingness). We should also consider the computational requirements when choosing between these approachs, as the Optimal Imputation approach requires training a separate imputation learner, while the expansion approaches add additional features to the dataset which can increase the training time of the predictive model.