Using Optimal Feature Selection with Missing Data
Optimal Feature Selection does not directly support training or prediction when there is missing data. If this is the case, we recommend using one of the following approaches:
Use OptImpute to impute the missing values, by using
fit_transform!
on the training data andtransform
on the testing data.Engineer additional features to encode the patterns of missingness in the data as described in Bertsimas et al. (2021), by using
fit_and_expand!
on the training data andtransform_and_expand
on the testing data.
On this page, we demonstrate how to use these approaches in a classification problem, but the same approach also applies to regression problems.
We first load in the hungarian-heart-disease dataset.
using CSV, DataFrames
df = CSV.read(
"processed.hungarian.data", DataFrame,
missingstring="?",
header=[:age, :sex, :cp, :trestbps, :chol, :fbs, :restecg, :thalach, :exang,
:oldpeak, :slope, :ca, :thal, :num])
X = df[:, 1:(end - 1)]
y = df[:, end]
X
294×13 DataFrame
Row │ age sex cp trestbps chol fbs restecg thalach exang ⋯
│ Int64 Int64 Int64 Int64? Int64? Int64? Int64? Int64? Int64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 28 1 2 130 132 0 2 185 ⋯
2 │ 29 1 2 120 243 0 0 160
3 │ 29 1 2 140 missing 0 0 170
4 │ 30 0 1 170 237 0 1 170
5 │ 31 0 2 100 219 0 1 150 ⋯
6 │ 32 0 2 105 198 0 0 165
7 │ 32 1 2 110 225 0 0 184
8 │ 32 1 2 125 254 0 0 155
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
288 │ 50 1 4 140 341 0 1 125 ⋯
289 │ 52 1 4 140 266 0 0 134
290 │ 52 1 4 160 331 0 0 94
291 │ 54 0 3 130 294 0 1 100
292 │ 56 1 4 155 342 1 0 150 ⋯
293 │ 58 0 2 180 393 0 0 110
294 │ 65 1 4 130 275 0 1 115
5 columns and 279 rows omitted
We can see X
has some missing entries, and as a result, we would not be able to directly use Optimal Feature Selection.
Let's first split the data into training and testing:
(X_train, y_train), (X_test, y_test) = IAI.split_data(:classification, X, y, seed=4)
Approach 1: Optimal Imputation
Under the first approach, we will use an Optimal Imputation imputation learner (in this case :opt_knn
) to fill in the missing values in X_train
and X_test
:
imputer = IAI.ImputationLearner(:opt_knn, random_seed=1)
X_train_imputed = IAI.fit_transform!(imputer, X_train)
X_test_imputed = IAI.transform(imputer, X_test)
Now that the training and testing data contain no missing values, we can proceed to train the Optimal Feature Selection learner, and evaluate it on the transformed testing data:
grid = IAI.GridSearch(
IAI.OptimalFeatureSelectionClassifier(random_seed=1),
sparsity=1:size(X, 2),
)
IAI.fit!(grid, X_train_imputed, y_train)
IAI.score(grid, X_test_imputed, y_test, criterion=:auc)
[ Warning: For full sparsity, use ridge regression for faster performance.
0.8872767857142855
Approach 2: Engineer features to encode missingness pattern
The second approach is to engineer additional features to encode the pattern of missingness. To do so, we first create a simple imputer (such as a ZeroImputationLearner
or MeanImputationLearner
), then call fit_and_expand!
on the training data to expand the data with additional features that encode the missingness, and finally call transform_and_expand
on the testing data to apply the same transformation.
There are two types of feature expansion from which to choose, specified using the type
keyword argument:
:finite
adds an indicator encoding the missingness for each feature.:affine
creates a feature encoding the missingness relation between every pair of features, allowing for the regression to adaptively adjust when certain features are missing but not others.
Let's first try the :finite
expansion method:
imputer = IAI.ImputationLearner(:zero, normalize_X=false)
X_train_finite = IAI.fit_and_expand!(imputer, X_train, type=:finite)
X_test_finite = IAI.transform_and_expand(imputer, X_test, type=:finite)
names(X_train_finite)
26-element Vector{String}:
"age"
"sex"
"cp"
"trestbps"
"chol"
"fbs"
"restecg"
"thalach"
"exang"
"oldpeak"
⋮
"chol_is_missing"
"fbs_is_missing"
"restecg_is_missing"
"thalach_is_missing"
"exang_is_missing"
"oldpeak_is_missing"
"slope_is_missing"
"ca_is_missing"
"thal_is_missing"
We can see that X_train_finite
has doubled to 26 features due to the missingness indicators added for each feature. We can now train the Optimal Feature Selection learner on the expanded data:
grid = IAI.GridSearch(
IAI.OptimalFeatureSelectionClassifier(random_seed=1),
sparsity=1:size(X_train_finite, 2),
)
IAI.fit!(grid, X_train_finite, y_train)
IAI.get_learner(grid)
[ Warning: For full sparsity, use ridge regression for faster performance.
Fitted OptimalFeatureSelectionClassifier:
Constant: -3.89513
Weights:
cp: 0.614515
exang: 0.942181
fbs: 0.777978
oldpeak: 0.461175
restecg: -0.4196
restecg_is_missing: 3.66969
sex: 0.8913
slope: 0.432254
(Higher score indicates stronger prediction for class `1`)
We see that in addition to the original set of features, the missing indicator restecg_is_missing
is also selected. This means that if the variable restecg
is missing, a constant factor of 3.66969 will be used in the regression, whereas if restecg
is known, a coefficient of -0.4196 will be applied to this known value.
We can evaluate the model on the transformed and expanded test set:
IAI.score(grid, X_test_finite, y_test, criterion=:auc)
0.8775111607142859
Similarly, we can try the :affine
expansion method:
imputer = IAI.ImputationLearner(:zero, normalize_X=false)
X_train_affine = IAI.fit_and_expand!(imputer, X_train, type=:affine)
X_test_affine = IAI.transform_and_expand(imputer, X_test, type=:affine)
names(X_train_affine)
182-element Vector{String}:
"age"
"sex"
"cp"
"trestbps"
"chol"
"fbs"
"restecg"
"thalach"
"exang"
"oldpeak"
⋮
"thal_correction_for_trestbps_missing"
"thal_correction_for_chol_missing"
"thal_correction_for_fbs_missing"
"thal_correction_for_restecg_missing"
"thal_correction_for_thalach_missing"
"thal_correction_for_exang_missing"
"thal_correction_for_oldpeak_missing"
"thal_correction_for_slope_missing"
"thal_correction_for_ca_missing"
We see that X_train_affine
is expanded even further to 182 features, due to the pairwise adjustment terms for every combination of features. Training the Optimal Feature Selection learner gives us:
grid = IAI.GridSearch(
IAI.OptimalFeatureSelectionClassifier(random_seed=1),
sparsity=1:size(X_train_affine, 2),
)
IAI.fit!(grid, X_train_affine, y_train)
IAI.get_learner(grid)
[ Warning: For full sparsity, use ridge regression for faster performance.
Fitted OptimalFeatureSelectionClassifier:
Constant: -3.5853
Weights:
age_correction_for_ca_missing: -0.013717
age_correction_for_fbs_missing: -0.0267239
age_correction_for_restecg_missing: 0.0165299
chol: 0.00160183
chol_correction_for_restecg_missing: 0.00308185
cp: 0.349488
cp_correction_for_ca_missing: 0.239332
cp_correction_for_restecg_missing: 0.909147
cp_correction_for_thal_missing: 0.11822
exang: 0.416671
exang_correction_for_chol_missing: 1.24073
exang_correction_for_slope_missing: 0.6669
exang_correction_for_thal_missing: 0.315088
fbs_correction_for_thal_missing: 0.950344
oldpeak: 0.241003
oldpeak_correction_for_ca_missing: 0.183674
oldpeak_correction_for_slope_missing: 0.677676
restecg: -0.276532
restecg_correction_for_ca_missing: -0.276532
restecg_is_missing: 0.909147
sex: 0.41225
sex_correction_for_ca_missing: 0.301809
sex_correction_for_restecg_missing: 0.909147
sex_correction_for_thal_missing: 0.634863
slope: 0.275441
slope_correction_for_ca_missing: 0.200838
slope_correction_for_thal_missing: 0.237835
thal: 0.140411
thal_correction_for_ca_missing: 0.0752845
thalach: -0.00574081
thalach_correction_for_restecg_missing: 0.0066849
trestbps_correction_for_restecg_missing: 0.00649391
(Higher score indicates stronger prediction for class `1`)
We see that in addition to the original features, the missing indicator restecg_is_missing
is selected, along with some of the pairwise adjustment terms.
To understand and interpret these adjustment terms, let us consider the feature age_correction_for_ca_missing
, which is selected with a coefficient of -0.013717 in the model. This means that if the feature ca
is missing and age
is not missing, an additional coefficient of -0.013717 is applied to the value of age
. In this way, the coefficient of age
in the model is "corrected" due to the value of ca
being missing. We can think of this as saying the model would prefer to use the ca
feature if it is available, but if it is missing, it will compensate by adjusting the coefficients on other features in the data.
We evaluate the model on the transformed and expanded test set:
IAI.score(grid, X_test_affine, y_test, criterion=:auc)
0.8962053571428573
We see that both of the expansion approaches achieve comparable results to those under Optimal Imputation. In practice, the best approach for a problem will depend heavily on the specific dataset and how the missing values were generated (e.g. random vs. structured missingness). We should also consider the computational requirements when choosing between these approachs, as the Optimal Imputation approach requires training a separate imputation learner, while the expansion approaches add additional features to the dataset which can increase the training time of the predictive model.